VOICEBOX: Speech Processing Toolbox for MATLAB

Introduction

VOICEBOX is a speech processing toolbox consists of MATLAB routines that are maintained by and mostly written by Mike Brookes, Department of Electrical & Electronic Engineering, Imperial College, Exhibition Road, London SW7 2BT, UK.

The routines are available as a GitHub repository (or a zip archive but often slightly out-of-date) and are made available under the terms of the GNU Public License. To avoid conflicts, all routine names begin with a "v_" prefix. For compatibility with legacy code, aliased versions without the prefix are included but these are likely to be removed in the future (the routine v_voicebox_update.m is included to update legacy code to the new names).

Installation

  1. Pull the GitHub repository (or unzip the zip archive but may be out-of-date) )into any suitable folder (assumed below to be C:\sap-voicebox)
  2. Start MATLAB, click "Set Path", click "Add Folder ...", navigate to C:\sap-voicebox\voicebox, click "Select Folder" then click "Save".
  3. [Optional] The routine v_voicebox.m contains various installation-dependent parameters which may need to be altered before using the toolbox. In particular it contains a number of default directory paths indicating where temporary files should be created, where speech data normally resides, etc. You can override the defaults by editing v_voicebox.m directly or, more conveniently, by setting an environment variable VOICEBOX to the path of an initializing m-file. See the comments in v_ voicebox.m for a fuller description.
  4. [Optional] You may find it convenient to install the non-unicode IPA phonetic symbol fonts developed by SIL which are in the C:\sap-voicebox\external\silipa93 folder.
  5. [Optional] The folder C:\sap-voicebox\external\shorten contains the source code and compiled executable for the SHORTEN program written by Tony Robinson and SoftSound Limited www.softsound.com. This is needed for reading compressed SPHERE format files. You may wish to move it elsewhere but, if so, you will need to edit v_voicebox.m to give its location.

Please send any comments, suggestions, bug reports etc to mike.brookes@ic.ac.uk.


Toolbox Contents (alphabetical list)


Audio File Input/Output
Read and write WAV and other speech file formats
Frequency Scales
Convert between Hz, Mel, Erb and MIDI frequency scales
Fourier/DCT/Hartley Transforms
Various related transforms
Random Number and Probability Distributions
Generate random vectors and noise signals
Vector Distances
Calculate distances between vector lists
Speech Analysis
Active level estimation, Spectrograms
LPC Analysis of Speech
Linear Predictive Coding routines
Speech Synthesis
Text-to-speech synthesis and glottal waveform models
Speech Enhancement
Spectral noise subtraction
Speech Coding
PCM coding, Vector quantisation
Speech Recognition
Front-end processing for recognition
Signal Processing
Miscellaneous signal processing functions
Information Theory
Routines for entropy calculation and symbol codes
Computer Vision
Routines for 3D rotation
Printing and Display Functions
Utilities for printing and graphics
Voicebox Parameters and System Interface
Get or set VOICEBOX and WINDOWS system parameters
Utility Functions
Miscellaneous utility functions


Audio File Input/Output

Routines are available to read and, in some cases write, a variety of file formats:

Read Write Suffix  
v_readwav v_writewav .wav These routines allow an arbitrary number of channels and can deal with linear PCM (any precision up to 32 bits), A-law PCM, Mu-law PCM and Floating point formats. Large files can be read and written in small chunks.
wavread wavwrite .wav Emulations of legacy MATLAB WAV file routines
v_readhtk v_writehtk .htk Read and write waveform and parameter files used by Microsoft's Hidden Markov Toolkit.
v_readsfs   .sfs Speech Filing system files from Mark Huckvale at UCL.
v_readsph   .sph NIST Sphere format files (including TIMIT). Needs SHORTEN for compressed files.
v_readaif   .aif AIFF format (Audio Interchange File Format) used by Mac users.
v_readcnx   .cnx Read Connex database files (from BT)
v_readflac   .flac Read FLAC audio files
v_readau   .au Read AV audio files (from Sun)

Frequency Scale Conversion

From f To f Scale  
v_frq2bark v_bark2frq bark The bark scale is based on critical bands and masking in the human ear.
v_frq2cent v_cent2frq erb The cent scale is in increments of 0.01 semitones.
v_frq2erb v_erb2frq erb The erb scale is based on the equivalent rectangular bandwidths of the human ear.
v_frq2mel v_mel2frq mel The mel scale is based on the human perception of sinewave pitch.
v_frq2midi v_midi2frq midi The midi standard specifies a numbering of semitones with middle C being 60. They can use the normal equal tempered scale or else the pythagorean scale of just intonation. They will in addition output note names in a character format.

Fourier, DCT and Hartley Transforms

Forward Inverse  
v_rfft v_irfft Forward and inverse discrete fourier transforms on real data. Only the first half of the conjugate symmetric transform is generated. For even length data, the inverse routine is asumptotically twice as fast as the built-in MATLAB routine.
v_rsfft   Forward transform of real, symmetric data to give the first half only of the real, symmetric transform.
v_zoomfft   Calculate the discrete fourier transform at an arbitrary set of linearly spaced frequencies. Can be used to zoom into a subset of the full frequency range.
v_rdct v_irdct Forward and inverse discrete cosine transform on real data.
v_rhartley v_rhartley Hartley transform on real data (forward and inverse transforms are the same).

Random Numbers and Probability Distributions


Vector Distance

v_ disteusq calculates the squared euclidean distance between all pairs of rows of two matrices.
v_distitar calculates the Itakura spectral distances between sets of AR coefficients.
v_distitpf calculates the Itakura spectral distances between power spectra.
v_distisar calculates the Itakura-Saito spectral distances between sets of AR coefficients.
v_distispf calculates the Itakura-Saito spectral distances between power spectra.
v_distchar calculates the COSH spectral distances between sets of AR coefficients.
v_distchpf calculates the COSH spectral distances between power spectra.

Speech Analysis

v_activlev calculates the active level of a speech segment according to ITU-T recommendation P.56.
v_activlevg calculates the active level of a speech segment robustly to added noise
v_dypsa estimates the glottal closure instants from the speech waveform.
v_ earnoise Add noise to an audio signal to represent the listener's hearing threshold
v_ enframe can be used to split a signal up into frames. It can optionally apply a window to each frame.
v_correlogram Calculates a 3D correlogram [slowly]
v_ewgrpdel calculates the energy-weighted group delay waveform.
v_fram2wav interpolates a sequence of frame-based value into a waveform
v_filtbankm Transformation matrix for a linear/mel/erb/bark-spaced filterbank from dft output
v_fxpefac PEFAC pitch tracker
v_fxrapt is an implementation of the RAPT pitch tracker by David Talkin.
v_gammabank Determine a bank of IIR gammatone filters
v_importsii calculate the SII importance function
v_istftw Inverse Short-time Fourier Transform (ISTFT)
v_modspect Calculate the modulation specrogram
v_mos2pesq Convert MOS values to PESQ speech quality scores
v_overlapadd Join frames up using overlap-add processing. Commonly used with enframe.
v_pesq2mos Convert PESQ speech quality scores to MOS values
v_phon2sone Convert signal levels from phons to sones
v_psycdigit experimental estimation of monotonic/unimodal psychometric function using TIDIGITS
v_psycest experimental estimation of monotonic psychometric function
v_psycestu experimental estimation of unimodal psychometric function
v_psychofunc calculate psychometric function
v_sigma estimate glottal opening and closure instants from the laryngograph/EGG waveform
v_snrseg calculate segmental SNR and global SNR relative to a reference signal
v_sone2phon Convert signal levels from sones to phons
v_soundspeed gives the speed of sound as a function of temperature
v_spgrambw draws a spectrogram with many options. See tutorial.
v_txalign finds the best alignment (in a least squares sense) between two sets of time markers (e.g. glottal closure instants).
v_stftw Short-time Fourier Transform (STFT)
v_stoi2prob Convert STOI intelligibility measure to probability of correct recognition
v_vadsohn voice activity detector
v_ppmvu Calculate the PPM, VU or EBU levels of a signal

LPC Analysis of Speech

 
The routines relating to LPC are described in more detail on another page. A large number of conversion routines are included for changing the form of the LPC coefficients (e.g. AR coefficients, reflection coefficients etc.): these are of the form lpcxx2yy where xx and yy denote the coefficient sets.
v_ccwarpf performs frequency warping in the complex cepstrum domain.
v_ lpcauto Perform autocorrelation LPC analysis
v_lpccovar perform covariance LPC analysis
v_lpcbwexp performs bandwidth expansion on an LPC filter.
v_lpcifilt performs inverse filtering to estimate the glottal waveform from the speech signal and the lpc coefficients.
v_lpcrand can be used to generate random, stable filters for testing purposes.
v_lpcrr2am calculates LPC filters for all orders up to a given maximum.
v_lpcstable determines filter stability and forces filter stability

Speech Synthesis

v_ sapisynth Text-to-speech synthesis (TTS) of a string or matrix entries
v_ glotros Calculates the Rosenberg model of the glottal flow waveform
v_glotlf Calculates the Liljencrants-Fant model of the glottal flow waveform

Speech Enhancement

v_ estnoiseg uses an MMSE algorithm to estimate the noise spectrum from a noisy speech signal that has been divided into frames.
v_ estnoisem uses a minimum-statistics algorithm to estimate the noise spectrum from a noisy speech signal that has been divided into frames.
v_specsub performs speech enhancement using spectral subtraction
v_spendred performs speech enhancement and dereverberation
v_ssubmmse performs speech enhancement using the MMSE or log MMSE criteria
v_ssubmmsev performs speech enhancement using the MMSE or log MMSE criteria with VAD-based noise estimate

Speech Coding

v_ lin2pcma converts an audio waveform to 8-bit A-law PCM format
v_lin2pcmu converts an audio waveform to 8-bit mu-law PCM format
v_pcma2lin converts 8-bit A-law PCM to a waveform
v_pcmu2lin converts 8-bit mu-law PCM to a waveform
v_kmeanlbg vector quantisation using the LBG algorithm
v_kmeanhar vector quantisation using the K-harmonic means algorithm
v_potsband calculates a bandpass filter corresponding to the standard telephone passband.
v_kmeans vector quantisation using the K-means algorithm

Speech Recognition

v_ melcepst implements a mel-cepstrum front end for a recogniser
v_melbankm constructs a bandpass filterbank with mel-spaced centre frequencies
v_cep2pow converts multivariate Gaussian means and covariances from the log power or cepstral domain to the power domain
v_pow2cep converts multivariate Gaussian means and covariances from the power domain to the log power or cepstral domain
v_ldatrace performs Linear Discriminant Analysis with optional constraints on the transform matrix

Signal Processing

v_convfft 1-dimensional convolution/corrolation using FFT
v_ditherq adds dither and quantizes a signal
v_dlyapsq solves the discrete lyapunov equation using an efficient square root algorithm
v_filterbank Apply a bank of IIR filters to a signal
v_maxfilt performs running maximum filter
v_meansqtf calculates the output power of a rational filter with a white noise input
v_momfilt generate running moments from a signal
v_resample Resamples a signal: identical to MATLAB resample but removes filter transients
v_sigalign align a clean reference with a noise signal and find optimum gain
v_schmitt passes a signal through a schmitt trigger having hysteresis
v_teager calculate the Teager energy waveform
v_addnoise add noise to a signal at a chosen SNR
v_findpeaks finds the peaks in a signal
v_windows generates window functions
v_windinfo calculate window properties and figures of merit
v_zerocros finds the zero crossings of a signal with interpolation

Information Theory

v_huffman calculates optimum D-ary symbol code from a probability mass vector
v_entropy calculates entropy and conditional entropy for discrete and continuous distributions

Computer Vision

v_imagehomog Apply a homography transformation to an image with bilinear interpolation
v_polygonarea Calculates the area of a polygon
v_polygonwind Determines whether points are inside or outside a polygon
v_polygonxline Determines where a line crosses a polygon
v_qrabs Absolute value of a real quaternion
v_qrdivide divide two real quaternions (or invert one)
v_qrdotdiv elmentwise division of two real quaternion arrays
v_qrdotmult elmentwise multiplication of two real quaternion arrays
v_qrmult  multiply two real quaternion arrays
v_qrpermute permute the indices of a quaternion array
v_rectifyhomog Apply rectifing homographies to a set of cameras to make their optical axes parallel
v_rot--2-- converts between the following representations of rotations: rotation matrix (ro), euler angles (eu), axis of rotation (ax), plane of rotation (pl), real quaternion vector (qr), real quaternion matrix (mr), complex quaternion vector (qc), complex quaternion matrix (mc). A detailed description is given here.
v_rotqrmean Find the average of several rotation quaternions
v_rotqrvec Apply a quaternion rotation to an array of 3D vectors
v_skew3d Convert between vectors and skew symmetric matrices: 3x3 matrix <-> 3x1 vector and 4x4 Plucker matrix <-> 6x1 vector.
v_sphrharm forward and inverse spherical harmonic transform using uniform, Gaussian or arbitrary inclination (elevation) grids and a uniform azimuth grid.
v_upolyhedron Calculate the vertex coordinates and other characteristics of a uniform polyhedron

Printing and Display Functions

v_axisenlarge enlarge the axes of a figure slightly
v_bitsprec rounds values to a precision of n bits
v_cblabel add a label to the colourbar
v_figbolden makes the lines on a figure bold, enlarges font sizes and adjusts colours for printing clearly
v_fig2emf optionally makes the lines on a figure bold and then saves in windows metafile format
v_fig2pdf optionally makes the lines on a figure bold and then saves in ps, eps or pdf format
v_frac2bin converts numbers to fixed-point binary strings
v_lambda2rgb convert wavelength to an RGB or XYZ triplet
v_sprintcpx prints a the real and imaginary parts of a complex number
v_sprintsi prints a value with the correct standard SI multiplier (e.g. 2100 prints as 2.1 k)
v_texthvc add text to plots with specified alignment and colour
v_tilefigs arrange all figures on the screen
v_colormap set and display colormap information including colormaps that print well in monochrome
v_xtickint Restrict x-axis tick marks to integer values
v_xticksi Label the x-axis tick marks using SI multipliers for large and small values. Particularly useful for logarithmic plots.
v_ytickint Restrict y-axis tick marks to integer values
v_yticksi Label the y-axis tick marks using SI multipliers for large and small values. Particularly useful for logarithmic plots.
   

Voicebox Parameters and System Interface

v_ hostipinfo Gives information about computer name and internet connections
v_regexfiles Recursively find files that match a regular expression pattern
v_unixwhich searches the WINDOWS system path for an executable (like UNIX which command)
v_voicebox contains a number of installation-dependent global parameters and is likely to need editing for each particular setup.
v_voicebox_update update old code to new names by inserting a "v_" prefix where needed
v_winenvar Obtains WINDOWS environment variables

Utility Functions

v_atan2sc arctangent function that returns the sin and cos of the angle
v_besselratio calculate the Bessel function ratio: besseli(v+1,x)./besseli(v,x)
v_besselratioi calculate the Inverse of v_besselratio (currently only for v=0)
v_bitsprec Rounds values to a precision of n bits
v_choosenk all possible ways of choosing k elements out of the numbers 1:n without duplications
v_choosrnk all possible ways of choosing k elements out of the numbers 1:n with duplications allowed
v_dlyapsq Solve the discrete lyapunov equation
v_ dualdiag simultaneously diagonalises two matrices: this is useful in computing LDA or IMELDA transforms.
v_finishat Estimate the finishing time of a long loop
v_fopenmkd Equivalent to FOPEN() but creates any missing directories/folders
v_gammalns Calculates log(gamma(x)) for signed real-valued x
v_horizdiff Estimates the horizontal difference between two functions of x
v_ hypergeom1f1 Confluent Hypergeometric Function (Kummer's M function)
v_logsum calculates log(sum(exp(x))) without overflow problems.
v_minspane Calculates the minimum spanning tree (a.k.a. shortest spanning tree) of a set of n-dimensional points
v_mintrace Find a row permutation to minimize the trace of a matrix
v_m2htmlpwd Create HTML documentation of matlab routines in the current directory
v_nearnonz Replace zero elements by the nearest non-zero elements
v_paramsetch Set a parameter structure and do valididty checks
v_permutes all possible permutations of the numbers 1:n
v_quadpeak find a quadratically-interpolated peak in a N-dimensional array by fitting a quadratic function to the array values
v_rotation generates rotation matrices
v_skew3d Generate 3x3 skew symmetric matrices
v_sort Same as normal sort() but with an additional output that is an inverse index
v_ zerotrim removes from a matrix any trailing rows and columns that are all zero.