V_DYPSA Derive glottal closure instances from speech [gci,goi] = (s,fs) Note: Needs to be combined with a voiced-voiceless detector to eliminate spurious closures in unvoiced and silent regions. Inputs: s is the speech signal fs is the sampling frequncy Outputs: gci is a vector of glottal closure sample numbers gco is a vector of glottal opening sample numbers derived from an assumed constant closed-phase fraction References: [1] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, "Estimation of Glottal Closure Instants in Voiced Speech using the DYPSA Algorithm," IEEE Trans on Speech and Audio Processing, vol. 15, pp. 34-43, Jan. 2007. [2] M. Brookes, P. A. Naylor, and J. Gudnason, "A Quantitative Assessment of Group Delay Methods for Identifying Glottal Closures in Voiced Speech," IEEE Trans on Speech & Audio Processing, vol. 14, no. 2, pp. 456-466, Mar. 2006. [3] A. Kounoudes, P. A. Naylor, and M. Brookes, "The DYPSA algorithm for estimation of glottal closure instants in voiced speech," in Proc ICASSP 2002, vol. 1, Orlando, 2002, pp. 349-352. [4] C. Ma, Y. Kamp, and L. F. Willems, "A Frobenius norm approach to glottal closure detection from the speech signal," IEEE Trans. Speech Audio Processing, vol. 2, pp. 258-265, Apr. 1994. [5] A. Kounoudes, "Epoch Estimation for Closed-Phase Analysis of Speech," PhD Thesis, Imperial College, 2001.
0001 function [gci,goi] = v_dypsa(s,fs) 0002 %V_DYPSA Derive glottal closure instances from speech [gci,goi] = (s,fs) 0003 % Note: Needs to be combined with a voiced-voiceless detector to eliminate 0004 % spurious closures in unvoiced and silent regions. 0005 % 0006 % Inputs: 0007 % s is the speech signal 0008 % fs is the sampling frequncy 0009 % 0010 % Outputs: 0011 % gci is a vector of glottal closure sample numbers 0012 % gco is a vector of glottal opening sample numbers derived from 0013 % an assumed constant closed-phase fraction 0014 % 0015 % References: 0016 % [1] P. A. Naylor, A. Kounoudes, J. Gudnason, and M. Brookes, "Estimation of Glottal Closure 0017 % Instants in Voiced Speech using the DYPSA Algorithm," IEEE Trans on Speech and Audio 0018 % Processing, vol. 15, pp. 34-43, Jan. 2007. 0019 % [2] M. Brookes, P. A. Naylor, and J. Gudnason, "A Quantitative Assessment of Group Delay Methods 0020 % for Identifying Glottal Closures in Voiced Speech," IEEE Trans on Speech & Audio Processing, 0021 % vol. 14, no. 2, pp. 456-466, Mar. 2006. 0022 % [3] A. Kounoudes, P. A. Naylor, and M. Brookes, "The DYPSA algorithm for estimation of glottal 0023 % closure instants in voiced speech," in Proc ICASSP 2002, vol. 1, Orlando, 2002, pp. 349-352. 0024 % [4] C. Ma, Y. Kamp, and L. F. Willems, "A Frobenius norm approach to glottal closure detection 0025 % from the speech signal," IEEE Trans. Speech Audio Processing, vol. 2, pp. 258-265, Apr. 1994. 0026 % [5] A. Kounoudes, "Epoch Estimation for Closed-Phase Analysis of Speech," PhD Thesis, 0027 % Imperial College, 2001. 0028 0029 % Algorithm Parameters 0030 % The following parameters are defined in v_voicebox() 0031 % 0032 % dy_cpfrac=0.3; % presumed closed phase fraction of larynx cycle 0033 % dy_cproj=0.2; % cost of projected candidate 0034 % dy_cspurt=-0.45; % cost of a talkspurt 0035 % dy_dopsp=1; % Use phase slope projection (1) or not (0)? 0036 % dy_ewdly=0.0008; % window delay for energy cost function term [~ energy peak delay from closure] (sec) 0037 % dy_ewlen=0.003; % window length for energy cost function term (sec) 0038 % dy_ewtaper=0.001; % taper length for energy cost function window (sec) 0039 % dy_fwlen=0.00045; % window length used to smooth group delay (sec) 0040 % dy_fxmax=500; % max larynx frequency (Hz) 0041 % dy_fxmin=50; % min larynx frequency (Hz) 0042 % dy_fxminf=60; % min larynx frequency (Hz) [used for Frobenius norm only] 0043 % dy_gwlen=0.0030; % group delay evaluation window length (sec) 0044 % dy_lpcdur=0.020; % lpc analysis frame length (sec) 0045 % dy_lpcn=2; % lpc additional poles 0046 % dy_lpcnf=0.001; % lpc poles per Hz (1/Hz) 0047 % dy_lpcstep=0.010; % lpc analysis step (sec) 0048 % dy_nbest=5; % Number of NBest paths to keep 0049 % dy_preemph=50; % pre-emphasis filter frequency (Hz) (to avoid preemphasis, make this very large) 0050 % dy_spitch=0.2; % scale factor for pitch deviation cost 0051 % dy_wener=0.3; % DP energy weighting 0052 % dy_wpitch=0.5; % DP pitch weighting 0053 % dy_wslope=0.1; % DP group delay slope weighting 0054 % dy_wxcorr=0.8; % DP cross correlation weighting 0055 % dy_xwlen=0.01; % cross-correlation length for waveform similarity (sec) 0056 0057 % Revision History: 0058 % 0059 % 3.0 - 29 Jun 2006 - Rewrote DP function to improve speed 0060 % 2.6 - 29 Jun 2006 - Tidied up algorithm parameters 0061 % 2.4 - 10 Jun 2006 - Made into a single file aand put into VOICEBOX 0062 % 2.3 - 18 Mar 2005 - Removed 4kHz filtering of phase-slope function 0063 % 2.2 - 05 Oct 2004 - dpgci uses the slopes returned from xewgrdel 0064 % - gdwav from speech with fs<9000 is not filtered 0065 % - Various outputs and inputs of functions have been 0066 % removed since now there is no plotting 0067 % 1.0 - 30 Jan 2001 - Initial version [5] 0068 0069 % Bugs: 0070 % 1. Allow the projections only to extend to the end of the larynx cycle 0071 % 2. Compensate for false pitch period cost at the start of a voicespurt 0072 % 3. Should include energy and phase-slope costs for the first closure of a voicespurt 0073 % 4. should delete candidates that are too close to the beginning or end of speech for the cost measures 0074 % currently this is 200 samples fixed in the main routine but it should adapt to window lengths of 0075 % cross-correlation, lpc and energy measures. 0076 % 5. Should have an integrated voiced/voiceless detector 0077 % 6. Allow v_dypsa to be called in chunks for a long speech file 0078 % 7. Do forward & backward passes to allow for gradient descent and/or discriminative training 0079 % 8. Glottal opening approximation does not really belong in this function 0080 % 9. The cross correlation window is asymmetric (and overcomplex) for no very good reason 0081 % 10. Incorporate -0.5 factor into dy_wxcorr and abolish erroneous (nx2-1)/(nx2-2) factor 0082 % 11. Add in talkspurt cost at the beginning rather than the end of a spurt (more efficient) 0083 % 12. Remove qmin>2 condition from voicespurt start detection (DYPSA 2 compatibility) in two places 0084 % 13. Include energy and phase-slope costs at the start of a voicespurt 0085 % 14. Single-closure voicespurt are only allowed if nbest=1 (should always be forbidden) 0086 % 15. Penultimate closure candidate is always acceptd 0087 % 16. Final element of gcic, Cfn and Ch is unused 0088 % 17. Needs to cope better with irregular voicing (e.g. creaky voice) 0089 % 18. Should give non-integer GCI positions for improved accuracy 0090 % 19. Remove constraint that first voicespurt cannot begin until qrmax after the first candidate 0091 0092 % Copyright (C) Tasos Kounoudes, Jon Gudnason, Patrick Naylor and Mike Brookes 2006 0093 % Version: $Id: v_dypsa.m 10865 2018-09-21 17:22:45Z dmb $ 0094 % 0095 % VOICEBOX is a MATLAB toolbox for speech processing. 0096 % Home page: http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html 0097 % 0098 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0099 % This program is free software; you can redistribute it and/or modify 0100 % it under the terms of the GNU General Public License as published by 0101 % the Free Software Foundation; either version 2 of the License, or 0102 % (at your option) any later version. 0103 % 0104 % This program is distributed in the hope that it will be useful, 0105 % but WITHOUT ANY WARRANTY; without even the implied warranty of 0106 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the 0107 % GNU General Public License for more details. 0108 % 0109 % You can obtain a copy of the GNU General Public License from 0110 % http://www.gnu.org/copyleft/gpl.html or by writing to 0111 % Free Software Foundation, Inc.,675 Mass Ave, Cambridge, MA 02139, USA. 0112 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0113 0114 % Extract algorithm constants from VOICEBOX 0115 0116 0117 dy_preemph=v_voicebox('dy_preemph'); 0118 dy_lpcstep=v_voicebox('dy_lpcstep'); 0119 dy_lpcdur=v_voicebox('dy_lpcdur'); 0120 dy_dopsp=v_voicebox('dy_dopsp'); % Use phase slope projection (1) or not (0)? 0121 dy_ewtaper=v_voicebox('dy_ewtaper'); % Prediction order of FrobNorm method in seconds 0122 dy_ewlen=v_voicebox('dy_ewlen'); % windowlength of FrobNorm method in seconds 0123 dy_ewdly=v_voicebox('dy_ewdly'); % shift for assymetric speech shape at start of voiced cycle 0124 dy_cpfrac=v_voicebox('dy_cpfrac'); % presumed ratio of larynx cycle that is closed 0125 dy_lpcnf=v_voicebox('dy_lpcnf'); % lpc poles per Hz (1/Hz) 0126 dy_lpcn=v_voicebox('dy_lpcn'); % lpc additional poles 0127 dy_xwlen=v_voicebox('dy_xwlen'); % cross-correlation length for waveform similarity (sec) 0128 dy_fxminf=v_voicebox('dy_fxminf'); % minimum pitch for Frobenius norm calculation 0129 0130 lpcord=ceil(fs*dy_lpcnf+dy_lpcn); % lpc poles 0131 0132 %PreEmphasise input speech 0133 s_used=filter([1 -exp(-2*pi*dy_preemph/fs)],1,s); 0134 0135 % perform LPC analysis, AC method with Hamming windowing 0136 [ar, e, k] = v_lpcauto(s_used,lpcord,floor([dy_lpcstep dy_lpcdur]*fs)); 0137 0138 if any(any(isinf(ar))) % if the data is bad and gives infinite prediction coefficients we return with a warning 0139 warning('No GCIs returned'); 0140 gci=[]; 0141 return; 0142 end; 0143 0144 % compute the prediction residual 0145 r = v_lpcifilt(s_used,ar,k); 0146 0147 % compute the group delay function: EW method from reference [2] above 0148 [zcr_cand,sew,gdwav,toff]=xewgrdel(r,fs); 0149 gdwav=-[zeros(toff,1); gdwav(1:end-toff)]; 0150 zcr_cand=[round(zcr_cand), ones(size(zcr_cand))]; %flag zero crossing candidates with ones 0151 0152 sew=0.5+sew'; %the phase slope cost of each candidate 0153 0154 pro_cand=[]; 0155 if dy_dopsp ~= 0 0156 pro_cand = psp(gdwav,fs); 0157 pro_cand = [pro_cand, zeros(length(pro_cand),1)]; %flag projected candidates with zeros 0158 sew = [sew zeros(1,size(pro_cand,1))]; %the phase slope cost of a projected candidate is zero 0159 end; 0160 0161 %Sort the zero crossing and projected candidates together and remove any candidates that 0162 %are to close to the start or end of the speech signal because the cost functions 0163 %need room either side. 0164 0165 [gcic,sin] = sortrows([zcr_cand; pro_cand],1); 0166 sew=sew(sin); 0167 saf=max([200,dy_xwlen*fs/2+1,fs/dy_fxminf]); 0168 sin=find(and(saf<gcic,gcic<length(gdwav)-saf)); 0169 gcic=gcic(sin,:); 0170 sew=sew(sin); 0171 0172 % compute the frobenious norm function used for a cost in the DP 0173 fnwav=frobfun(s_used,dy_ewtaper*fs,dy_ewlen*fs,dy_ewdly*fs); 0174 0175 %Dynamic programming, picks the most likely candidates based on the pitch consistency, energy etc. 0176 [gci] = dpgci(gcic, s_used(:), sew, fnwav, fs); 0177 0178 %Evaluate goi ... determined to be dy_cpfrac percentage of the larynx cylce away from the last gci 0179 goi=simplegci2goi(gci,dy_cpfrac); 0180 0181 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0182 0183 function z = psp(g,fs) 0184 %PSP Calculates the phase slope projections of the group delay function 0185 % Z = PSP(G) computes the 0186 0187 % Revision History: 0188 % V1.0 July 12th 2002: 0189 % Nov. 6th 2002 : if statement added to remove "last" midpoint 0190 0191 g = g(:); 0192 0193 gdot = [diff(g);0]; 0194 gdotdot = [diff(gdot);0]; 0195 0196 % find the turning points as follows: [tp_number, index_of_tp, min(1) or max(-1), g(index_of_tp)] 0197 turningPoints = zcr(gdot); 0198 turningPoints = [[1:length(turningPoints)]', turningPoints, sign(gdotdot(turningPoints)), g(turningPoints)]; 0199 0200 % useful for debug/plotting 0201 %tplot = zeros(length(g),1); 0202 %tplot(turningPoints(:,1)) = turningPoints(:,2); 0203 0204 % find any maxima which are < 0 0205 negmaxima = turningPoints(find(turningPoints(:,3) == -1 & turningPoints(:,4) < 0 & turningPoints(:,1)~=1),:); %Change 01.05.2003 JG: The first row can't be included 0206 0207 % find the midpoint between the preceding min and the negative max 0208 nmi = negmaxima(:,1); 0209 midPointIndex = turningPoints(nmi-1,2) + round(0.5*(turningPoints(nmi,2) - turningPoints(nmi-1,2))); 0210 midPointValue = g(midPointIndex); 0211 0212 % project a zero crossing with unit slope 0213 nz = midPointIndex - round(midPointValue); 0214 0215 % find any minima which are > 0 0216 posminima = turningPoints(find(turningPoints(:,3) == 1 & turningPoints(:,4) > 0),:); 0217 0218 % find the midpoint between the positive min and the following max 0219 pmi = posminima(:,1); 0220 0221 %Remove last midpoint if it is the last sample 0222 if ~isempty(pmi), if pmi(end)==size(turningPoints,1), pmi=pmi(1:end-1); end; end; 0223 0224 midPointIndex = turningPoints(pmi,2) + round(0.5*(turningPoints(pmi+1,2) - turningPoints(pmi,2))); 0225 midPointValue = g(midPointIndex); 0226 0227 % project a zero crossing with unit slope 0228 pz = midPointIndex - round(midPointValue); 0229 0230 z = sort([nz;pz]); 0231 0232 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0233 0234 function i = zcr(x, p) 0235 %ZCR Finds the indices in a vector to zero crossings 0236 % I = ZCR(X) finds the indices of vector X which are closest to zero-crossings. 0237 % I = ZCR(X, P) finds indices for positive-going zeros-crossings for P=1 and 0238 % negative-going zero-crossings for P=0. 0239 0240 x = x(:); 0241 0242 if (nargin==2) 0243 if (p==0) 0244 z1 = zcrp(x); % find positive going zero-crossings 0245 elseif (p==1) 0246 z1 = zcrp(-x); % find negative going zero-crossings 0247 else 0248 error('ZCR: invalid input parameter 2: must be 0 or 1'); 0249 end 0250 else 0251 z1 = [zcrp(x); zcrp(-x)]; 0252 end 0253 0254 % find crossings when x==0 exactly 0255 z0 = find( (x(1:length(x))==0) & ([x(2:length(x));0] ~= 0)); 0256 0257 % concatenate and sort the two types of zero-crossings 0258 i = sort([z0; z1]); 0259 0260 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0261 0262 function zz = zcrp(xx) %only used in zcr 0263 % find positive-going zero-crossing 0264 z1 = find(diff(sign(xx)) == -2); 0265 % find which out of current sample or next sample is closer to zero 0266 [m, z2] = min([abs(xx(z1)), abs(xx(z1+1))], [], 2); 0267 zz = z1 -1 + z2; 0268 0269 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0270 0271 function [frob]=frobfun(sp,p,m,offset) 0272 0273 % [frob]=frobfun(sp,p,m) 0274 % 0275 % sp is the speech signal assumed to be preemphasised 0276 % p is the prediction order : recomended to be 1 ms in above paper 0277 % m is the window length : recomended to be 1 ms in above paper 0278 % offset is shift for assymetric speech shape at start of voiced cycle - 0279 % default 1.5ms. 0280 % 0281 % This function implements the frobenius norm based measure C defined in [4] below. 0282 % It equals the square of the Frobenius norm of the m by p+1 data matrix divided by p+1 0283 % 0284 % Reference: 0285 % [4] C. Ma, Y. Kamp, and L. F. Willems, "A Frobenius norm approach to glottal closure detection 0286 % from the speech signal," IEEE Trans. Speech Audio Processing, vol. 2, pp. 258"265, Apr. 1994. 0287 0288 0289 % Revision History: 0290 % V1.0 July 12th 2002: 0291 % Nov. 6th 2002 : if statement added to remove "last" midpoint 0292 0293 %force p m and offset to be integers 0294 p=round(p); 0295 m=round(m); 0296 offset=round(offset); 0297 0298 w=(p+1)*ones(1,m+p); 0299 w(1:p)=1:p; 0300 w(m+1:p+m)=p:-1:1; 0301 0302 w=w./(p+1); 0303 frob=filter(w,1,sp.^2); 0304 frob(1:(round((p+m-1)/2) + offset))=[]; 0305 0306 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0307 0308 function goi=simplegci2goi(gci,pr) 0309 0310 % Estimate glottal opening instants by assuming a fixed closed-phase fraction 0311 0312 gci=round(gci); 0313 maxpitch=max(medfilt1(diff(gci),7)); 0314 0315 % calculate opening instants 0316 for kg=1:length(gci)-1 0317 goi(kg)=gci(kg)+min(pr*(gci(kg+1)-gci(kg)),pr*maxpitch); 0318 end; 0319 kg=kg+1; 0320 goi(kg)=round(gci(kg)+pr*(gci(kg)-gci(kg-1))); %use the previous pitch period instead 0321 goi=round(goi); 0322 0323 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0324 0325 function [tew,sew,y,toff]=xewgrdel(u,fs) 0326 0327 % implement EW group delay epoch extraction 0328 0329 dy_gwlen=v_voicebox('dy_gwlen'); % group delay evaluation window length 0330 dy_fwlen=v_voicebox('dy_fwlen'); % window length used to smooth group delay 0331 0332 % perform group delay calculation 0333 0334 gw=2*floor(dy_gwlen*fs/2)+1; % force window length to be odd 0335 ghw=window('hamming',gw,'s'); 0336 ghw = ghw(:); % force to be a column (dmb thinks window gives a row - and he should know as he wrote it!) 0337 ghwn=ghw'.*(gw-1:-2:1-gw)/2; % weighted window: zero in middle 0338 0339 u2=u.^2; 0340 yn=filter(ghwn,1,u2); 0341 yd=filter(ghw,1,u2); 0342 yd(abs(yd)<eps)=10*eps; % prevent infinities 0343 y=yn(gw:end)./yd(gw:end); % delete filter startup transient 0344 toff=(gw-1)/2; 0345 fw=2*floor(dy_fwlen*fs/2)+1; % force window length to be odd 0346 if fw>1 0347 daw=window('hamming',fw,'s'); 0348 y=filter(daw,1,y)/sum(daw); % low pass filter 0349 toff=toff-(fw-1)/2; 0350 end 0351 [tew,sew]=v_zerocros(y,'n'); % find zero crossings 0352 0353 tew=tew+toff; % compensate for filter delay and frame advance 0354 0355 0356 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0357 0358 function Cfn=fnrg(gcic,frob,fs) 0359 0360 %Frobenious Energy Cost 0361 0362 dy_fxminf=v_voicebox('dy_fxminf'); 0363 frob=frob(:)'; 0364 mm=round(fs/dy_fxminf); 0365 mfrob=v_maxfilt(frob,1,mm); 0366 mfrob=[mfrob(floor(mm/2)+1:end) max(frob(end-ceil(mm/2):end))*ones(1,floor(mm/2))]; 0367 rfr=frob./mfrob; 0368 Cfn=0.5-rfr(round(gcic)); 0369 0370 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0371 0372 function gci=dpgci(gcic, s, Ch, fnwav, fs) 0373 0374 %DPGCI Choose the best Glottal Closure Instances with Dynamic Programming 0375 % gci=dpgci(gcic, s(:), Ch, fnwav, fs) returns vectors of sample indices corresponding 0376 % to the instants of glottal closure in the speech signal s at sampling frequency fs Hz. 0377 % 0378 % Inputs: 0379 % gcic is a matrix whos first column are the glottal closure instance candidates and 0380 % the second column is 1 if the corresponding gci is derived from a zero crossing 0381 % but zero if the gci is from a a projected zero crossing 0382 % s is the speech signal - MUST be a column vector 0383 % Ch the phase slope cost of every candidate 0384 % fnwav is the frobenious norm function of s 0385 % fs is the sampling frequncy 0386 % 0387 % Outputs: 0388 % gci is a vector of glottal closure instances chosen by the DP 0389 0390 0391 0392 % Revision History: 0393 % Bugs: Constants are hardwired but defined in a structure like pv (defined in grpdelpv) 0394 % 0395 0396 % get algorithm parameters from v_voicebox() 0397 0398 dy_fxmin=v_voicebox('dy_fxmin'); % min larynx frequency (Hz) 0399 dy_fxmax=v_voicebox('dy_fxmax'); % min larynx frequency (Hz) 0400 dy_xwlen=v_voicebox('dy_xwlen'); % cross-correlation length for waveform similarity (sec) 0401 dy_nbest=v_voicebox('dy_nbest'); % Number of NBest paths to keep 0402 dy_spitch=v_voicebox('dy_spitch'); % scale factor for pitch deviation cost 0403 wproj=v_voicebox('dy_cproj'); % cost of projected candidate 0404 dy_cspurt=v_voicebox('dy_cspurt'); % cost of a talkspurt 0405 dy_wpitch=v_voicebox('dy_wpitch'); % DP pitch weighting 0406 dy_wener=v_voicebox('dy_wener'); % DP energy weighting 0407 dy_wslope=v_voicebox('dy_wslope'); % DP group delay slope weighting 0408 dy_wxcorr=v_voicebox('dy_wxcorr'); % DP cross correlation weighting 0409 0410 0411 0412 %Constants 0413 Ncand=length(gcic); 0414 sv2i=-(2*dy_spitch^2)^(-1); % scale factor for pitch deviation cost 0415 nxc=ceil(dy_xwlen*fs); % cross correlation window length in samples 0416 % === should delete any gci's that are within this of the end. 0417 % === and for energy window 0418 0419 %Limit the search: 0420 qrmin=ceil(fs/dy_fxmax); 0421 qrmax=floor(fs/dy_fxmin); 0422 0423 %Cost and tracking r = current, q = previous, p = preprevious 0424 cost=zeros(Ncand, dy_nbest); cost(:,:)=inf; %Cost matrix, one row for each candidate 0425 maxcost=zeros(Ncand,1); maxcost(:,:)=inf; %Maximum cost in each row 0426 imaxcost=ones(Ncand,1); %Index of maximum cost 0427 0428 prev = ones(Ncand, dy_nbest); %index of previous, q candidates 0429 ind = ones(Ncand, dy_nbest); %index of p in row q (from prev) 0430 qbest = [zeros(Ncand,1), ones(Ncand,2)]; % the minimum cost in any previous q [cost,q,i] 0431 0432 Cfn=fnrg(gcic(:,1),fnwav,fs); %Frob.Energy Cost 0433 0434 %Add start and end state 0435 % === should probably delete candidates that are too close to either end of the input 0436 % === why do we ever need the additional one at the tail end ? 0437 gcic=[[gcic(1,1)-qrmax-2 0];gcic;[gcic(end,1)+qrmax+2 0]]; 0438 Cfn=[0 Cfn 0]; 0439 Ch = [0 Ch 0]; 0440 0441 % first do parallelized version 0442 0443 0444 % rather complicated window specification is for compatibility with DYPSA 2 0445 % === +1 below is for compatibility - probably a bug 0446 wavix=(-floor(nxc/2):floor(nxc/2)+1)'; % indexes for segments [nx2,1] 0447 nx2=length(wavix); 0448 sqnx2=sqrt(nx2); 0449 g_cr=dy_wener*Cfn+dy_wslope*Ch+wproj*(1-gcic(:,2))'; % fixed costs 0450 0451 g_n=gcic(:,1)'; % gci sample number [1,Ncand+2] 0452 g_pr=gcic(:,2)'; % unprojected flag [1,Ncand+2] 0453 g_sqm=zeros(1,Ncand+1); % stores: sqrt(nx2) * mean for speech similarity waveform 0454 g_sd=zeros(1,Ncand+1); % stores: 1/(Std deviation * sqrt(nx2)) for speech similarity waveform 0455 f_pq=zeros((Ncand+1)*dy_nbest,1); % (q-p) period for each node 0456 f_c=repmat(Inf,(Ncand+1)*dy_nbest,1); % cumulative cost for each node - initialise to inf 0457 f_c(1)=0; % initial cost of zero for starting node 0458 % f_costs=zeros(Ncand*dy_nbest,6); % === debugging only remember costs of candidate 0459 f_f=ones((Ncand+1)*dy_nbest,1); % previous node in path 0460 f_fb=ones((Ncand+1),1); % points back to best end-of-spurt node 0461 fbestc=0; % cost of best end-of-spurt node 0462 0463 qmin=2; 0464 for r=2:Ncand+1 0465 % if r==86 0466 % r; 0467 % end 0468 r_n=g_n(r); % sample number of r = current candidate 0469 rix=dy_nbest*(r-1)+(1:dy_nbest); % index range within node variables 0470 0471 % determine the range of feasible q candidates 0472 qmin0=qmin; 0473 qmin=find(g_n(qmin0-1:r-1)<r_n-qrmax); % qmin is the nearest candidate that is >qrmax away 0474 qmin=qmin(end)+qmin0-1; % convert to absolute index of first viable candidate 0475 qmax=find(g_n(qmin-1:r-1)<=r_n-qrmin); % qmax is the nearest candidate that is >=qrmin away 0476 qmax=qmax(end)+qmin-2; 0477 0478 0479 % calculate waveform similarity cost measure statistics 0480 0481 sr=s(r_n+wavix); % note s MUST be a column vector so sr is also 0482 wsum=sum(sr); 0483 g_sqm(r)=wsum/sqnx2; % mean * sqrt(nx2) 0484 g_sd(r)=1/sqrt(sr.'*sr-wsum^2/nx2); % 1/(Std deviation * sqrt(nx2)) 0485 0486 % now process the candidates 0487 0488 if qmin<=qmax 0489 qix=qmin:qmax; % q index 0490 nq=length(qix); 0491 % === should integrate the -0.5 into dy_wxcorr 0492 % === the factor (nx2-1)/(nx2-2) is to compensate for a bug in swsc() 0493 q_cas=-0.5*(nx2-1)/(nx2-2)*dy_wxcorr*(sum(s(repmat(g_n(qix),nx2,1)+repmat(wavix,1,nq)).*repmat(sr,1,nq),1)-g_sqm(qix)*g_sqm(r)).*g_sd(qix)*g_sd(r); 0494 % compare: i=35; Ca=swsc(g_n(qix(i)),g_n(r),s,fs); [i qix(i) r g_n(qix(i)) g_n(r) dy_wxcorr*Ca q_cas(i)] 0495 0496 % now calculate pitch deviation cost 0497 0498 fix = 1+(qmin-1)*dy_nbest:qmax*dy_nbest; % node index range 0499 f_qr=repmat(r_n-g_n(qix),dy_nbest,1); % (r-p) period for each node 0500 f_pr=f_qr(:)+f_pq(fix); 0501 % === could absorb the 2 into sv2i 0502 f_nx=2-2*f_pr./(f_pr+abs(f_qr(:)-f_pq(fix))); 0503 f_cp=dy_wpitch*(0.5-exp(sv2i*f_nx.^2)); 0504 % === fudge to match dypsa2.4 - could more efficiently be added 0505 % === onto the cost of a talkspurt end 0506 % === should be a v_voicebox parameter anyway 0507 f_cp(f_pq(fix)==0)=dy_cspurt*dy_wpitch; 0508 0509 % now find the N-best paths 0510 0511 [r_cnb,nbix]=sort(f_c(fix)+f_cp+reshape(repmat(q_cas,dy_nbest,1),nq*dy_nbest,1)); 0512 f_c(rix)=r_cnb(1:dy_nbest)+g_cr(r); % costs 0513 f_f(rix)=nbix(1:dy_nbest)+(qmin-1)*dy_nbest; % traceback nodes 0514 f_pq(rix)=f_qr(nbix(1:dy_nbest)); % previous period 0515 % === f_costs is only for debugging 0516 % r; 0517 % f_costs(rix,1)=f_c(fix(nbix(1:dy_nbest))); 0518 % f_costs(rix,2)=wproj*(1-gcic(r,2)); 0519 % f_costs(rix,3)=f_cp(nbix(1:dy_nbest)); 0520 % f_costs(rix,4)=dy_wener*Cfn(r); 0521 % f_costs(rix,5)=dy_wslope*Ch(r); 0522 % f_costs(rix,6)=reshape(q_cas(1+floor((nbix(1:dy_nbest)-1)/dy_nbest)),dy_nbest,1); 0523 0524 % check cost of using this candidate as the start of a new spurt 0525 % ==== the qmin>2 condition is for compatibility with v_dypsa 2 and 0526 % prevents any spurts starting until at least qrmax past the first 0527 % gci. This is probably a bug (see again below) 0528 iNb=rix(end); 0529 if (qmin>2) && (f_c(f_fb(qmin-1))+wproj*(1-gcic(r,2))<f_c(iNb)) % compare with worst of Nbest paths 0530 f_f(iNb)=f_fb(qmin-1); 0531 % === for now we exclude the energy and phase-slope costs for compatibility with dypsa2 0532 % === this is probably a bug 0533 f_c(iNb)=f_c(f_fb(qmin-1))+wproj*(1-gcic(r,2)); % replace worst of the costs with start voicespurt cost 0534 f_pq(iNb)=0; % false pq period 0535 end 0536 if f_c(rix(1))<fbestc 0537 f_fb(r)=rix(1); % points to the node with lowest end-of-spurt cost 0538 % === should compensate for the pitch period cost incurred at the start of the next spurt 0539 % === note that a node can never be a one-node voicespurt on its own unless dy_nbest=1 0540 % since the start voices[purt option replaced the worst Nbest cost. This is probably good but 0541 % is a bit inconsistent. 0542 fbestc=f_c(rix(1)); 0543 else 0544 f_fb(r)=f_fb(r-1); 0545 end 0546 else % no viable candidates - must be the start of a voicespurt if anything 0547 % === for now we exclude the energy and phase-slope costs for compatibility with dypsa2 0548 % === this is probably a bug 0549 % ==== the qmin>2 condition is for compatibility with v_dypsa 2 and 0550 % prevents any spurts starting until at least qrmax past the first 0551 % gci. This is probably a bug (see again above) 0552 if (qmin>2) 0553 f_c(rix(1))=f_c(f_fb(qmin-1))+wproj*(1-gcic(r,2)); % cost of new voicespurt 0554 f_f(rix)=f_fb(qmin-1); % traceback to previous talkspurt end 0555 f_pq(rix)=0; % previous period 0556 end 0557 f_fb(r)=f_fb(r-1); % cannot be the end of a voicespurt 0558 end 0559 end 0560 0561 % now do the traceback 0562 0563 gci = zeros(1,Ncand+1); 0564 0565 % === for compatibility with dypsa2, we force the penultimate candidate to be accepted 0566 % === should be: i=f_fb(Ncand+1) but instead we pick the best of the penultimate candidate 0567 i=rix(1)-dy_nbest; 0568 if f_c(i-dy_nbest+1)<f_c(i) % check if start of a talkspurt 0569 i=i-dy_nbest+1; 0570 end 0571 k=1; 0572 while i>1 0573 j=1+floor((i-1)/dy_nbest); % convert node number to candidate number 0574 gci(k)=g_n(j); 0575 i=f_f(i); 0576 k=k+1; 0577 end 0578 gci=gci(k-1:-1:1); % put into ascending order 0579 0580 0581 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 0582 0583