Transducer Mismatch Compensation for Robust Speaker Verification

Man-wai Mak (Jan. 2003)

Current efforts to address the problems of transducer mismatches have focused on two main areas: feature transformation and model transformation.

Feature Transformation:
Feature-based approaches attempt to modify the distorted features so that the resulting features fit the clean speech models better. One of the earliest techniques is cepstral mean subtraction (CMS) [2], which approximates a linear channel by the long-term average of distorted cepstral vectors. Subsequent efforts include pole-filtered cepstral mean subtraction [22] and signal bias removal [24]. These approaches, however, do not consider the effect of background noise. A more general approach, which handles both channel distortion and background noise, is the codeword-dependent cepstral normalization (CDCN) [1]. In CDCN, additive noise and convolutive distortion are modeled as codeword-dependent cepstral biases. The CDCN, however, only works well when the background noise level is low.

When stereo corpora are available, channel distortion can be estimated directly by comparing the clean features against their corresponding distorted features. For example, in SNR-dependent cepstral normalization (SDCN) [1], cepstral biases for different signal-to-noise ratios are estimated in a maximum likelihood framework. In probabilistic optimum filtering [23], the transformation is a set of multi-dimensional least-squares filters whose outputs are probabilistically combined. These methods, however, rely on the availability of stereo corpora, which could be difficult to obtain. The requirement of stereo corpora can be avoided by making use of the information embedded in the clean speech models. For example, in stochastic matching [27], the cepstral biases and affine transformation matrices are determined by maximizing the likelihood of observing the distorted features given the clean models. The linear transformation in [27] can be replaced by a neural network to compensate for non-linear distortion [29]. However, closed form solutions can no longer be obtained and the generalized EM algorithm is required.

Although the above methods have been successful in reducing channel mismatches, they (except [29]) operate on the assumption that the channel effect can be approximated by a linear filter. Most telephone handsets, in fact, exhibit energy-dependent frequency responses [25] for which a linear filter may be a poor approximation. Therefore, a more complex representation of handset characteristics is required.

Model Transformation:
Model-based approaches attempt to modify the clean speech models such that the density functions of the resulting models fit the distorted data better. Influential approaches include stochastic matching [27] and stochastic additive transforms [26], where the models’ means and variances are adjusted by stochastic biases, maximum likelihood linear regression (MLLR) [10], where the mean vectors of clean speech models are linearly transformed, and the constrained reestimation of Gaussian mixtures [3], where both mean vectors and covariance matrices are transformed. Recently, MLLR has been extended to maximum-likelihood linear transformation [5], in which the transformation matrices for the variances can be different from those for the mean vectors. Meanwhile, the constrained transformation in [3] has been extended to piecewise-linear stochastic transformation [4], where a collection of linear transformations are shared by all the Gaussians in each mixture. The random bias in [27] has also been replaced by a neural network to compensate for non-linear distortion [29]. All these extensions show improvement in recognition accuracy.

As the above methods "indirectly" adjust the model parameters via a small number of transformations, they may not be able to capture the fine structure of the distortion. While this limitation can be overcome by the Bayesian techniques [6,9] where model parameters are adjusted "directly", the Bayesian approach requires a large amount of adaptation data to be effective. As both direct and indirect adaptations have their own strengths and weaknesses, a natural extension is to combine them so that these two approaches can complement each other [21,28].
 

Previous Work of M.W. Mak
From 1990 to 1993, M.W. Mak investigated the applications of neural networks to speaker recognition [18,19]. These investigations have become the framework of his recent work, in which probabilistic neural classifiers [17,32], threshold determination [36], scoring normalization [35], channel compensation methods [15,11,31], and recurrent learning algorithms [7,8,20, 16] were proposed. This work has demonstrated that (1) the structure of speaker models can be optimized automatically [13], (2) elliptical basis function networks outperform radial basis function networks in speaker verification, (3) embedding anti-speakers in the speaker models greatly improves the reliability of the a priori decision thresholds, (4) combining the world model and the cohort model in a two-stage scoring normalization approach improves speaker verification performance, and (5) the channel effects can be greatly reduced by looking at the clean speech cepstrum. These previous efforts are now the foundation of this proposal.
 

References

  1. Acero, A. (1992). Acoustical and Environmental Robustness in Automatic Speech Recognition, Kluwer Academic Pub., Dordrecht.
  2. Atal, B.S. (1974). "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification," J. Acoust. Soc. Am. 55 (6), 1304-1312.
  3. Digalakis, V. Rtischev, D. and Neumeyer, L. (1995). "Speaker adaptation using constrained reestimation of Gaussian mixtures," IEEE Trans. on Speech Audio Processing, 3, 357-366.
  4. Diakoloukas, V.D. and Digalakis, V. (1999). "Maximum-likelihood stochastic-transformation adaptation of hidden Markov models," IEEE Trans. on Speech Audio Processing, 7(2), 177-187.
  5. Gales, M.J.F. (1998). "Maximum-likelihood linear transformation for HMM-based speech recognition," Computer Speech and Language, 12, 75-98.
  6. Huo, Q., Chan, C., and Lee, C.H. (1997). On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate, IEEE Trans. on Audio and Speech Processing, 5 (2), 161-172.
  7. Ku, K.W., Mak, M.W. and Siu, W.C. (1999), "Adding learning to cellular genetic algorithms for training recurrent neural networks," IEEE Trans. on Neural Networks, Vol. 10, No. 2, pp. 239-252.
  8. Ku, K.W. Mak, M.W. and Siu, W.C. (2000), "A study of the lamarckian evolution of recurrent neural networks," IEEE Trans. on Evolutionary Computation, Vol. 4, No. 1, pp. 31-42.
  9. Lee, C.H., Lin, C.H., and Juang, B.H. (1991). "A study on speaker adaptation of the parameters of continuous density hidden Markov models, IEEE Trans. on Acoustics, Speech and Signal Processing, 39 (4), 806-814.
  10. Leggetter, C.J., Woodland, P.C. (1995). "Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models," Computer Speech and Language, 9 (4), 806-814.
  11. Li, X. Mak, M.W., and Kung, S.Y. (2001). "Robust speaker verification over the telephone by feature recuperation," International Symposium on Intelligent, Video and Speech Processing, 2001, pp. 433-436.
  12. Li, X. Mak, M.W., and Kung, S.Y. (2001). "An EBF-based non-linear feature mapper for robust speaker verification, Submitted to IEEE Trans. on Neural Networks.
  13. Li, X., Mak, M.W. and Li, C.K. (1999), "Determining the optimal number of clusters by an extended RPCL algorithm," Int. J. of Advanced Computational Intelligence, Vol. 3, No. 6, pp. 467-473.
  14. Lin, S.H., Kung, S.Y. and Lin, L.J. (1997). "Face recognition/detection by probabilistic decision-based neural network, IEEE Trans. on Neural Networks, 8 (1), pp. 114-132.
  15. Lo, T.F., Mak, M.W. and Yiu, K.K. (1999), "A new cepstrum-based channel compensation method for speaker verification," Eurospeech’99.
  16. Mak, M.W. (1995), "A learning algorithm for recurrent radial basis function networks," Neural Processing Letters, Vol. 2, No. 1, pp. 27-31.
  17. Mak, M.W. and Kung, S.Y. (2000). "Estimation of elliptical basis function parameters by the EM algorithms with application to speaker verification," IEEE Trans. on Neural Networks, Vol. 11, No. 4, pp. 961-969.
  18. Mak, M.W. et al. (1993a). "Comparing multi-layer perceptrons and radial basis function networks in speaker identifications", J. of Micro. Applications, 16, 147-59.
  19. Mak, M.W. et al. (1994), "Speaker Identification using Multi Layer Perceptrons and Radial Basis Functions Networks," Neurocomputing, 6 (1), 99-118.
  20. Mak, M.W. Ku, K.W. and Y.L. Lu. (1999). "On the improvement of the real time recurrent learning algorithm," Neurocomputing, Vol. 24, pp. 13-36.
  21. Mokbel, C. (2001). "Online adaptation of HMMs to real-life conditions: A unified framework," IEEE Trans. on Speech and Audio Processing, 9 (4), 342-357.
  22. Naik, D. (1995). "Pole-filtered cepstral mean subtraction," Proc. ICASSP’95, 1, 157-160, 1995.
  23. Neumeyer, L. and Weintraub, M. (1994). "Probabilistic optimal filtering for robust speech recognition," ICASSP’94, pp. 417-420.
  24. Rahim, M. G. and Juang, B. H. (1996). "Signal bias removal by maximum likelihood estimation for robust telephone speech recognition," IEEE Trans. on Speech and Audio Processing, 4 (1), 19-30.
  25. Reynolds, D.A. et al. (1995). "The effects of telephone transmission degradations on speaker recognition performance," ICASSP’95.
  26. Rose, R. C., Hofstetter, E. M. and Reynolds, D. A. (1994). "Integrated models of signal and background with application to speaker identification in noise", IEEE Trans. on Speech and Audio Processing, 2 (2), 245-257.
  27. Sankar, A. and Lee, C.H. (1996). "A maximum-likelihood approach to stochastic matching for robust speech recognition," IEEE Trans. on Speech and Audio Processing, 4 (3), pp. 190-202.
  28. Siohan, O. Chesta, C. and Lee, C.H. (2001). Joint maximum a posteriori adaptation of transformation and HMM parameters," IEEE Trans. on Speech and Audio Processing, 9 (4), 417-428.
  29. Surendran, A.C., Lee, C.H. and Rahim, M. (1999). "Nonlinear compensation for stochastic matching," IEEE Trans. on Speech and Audio Processing, 7 (6), pp. 643-655.
  30. Yiu, K.K. (2000). Speaker Verification Based on Probabilistic Neural Networks with A Priori Decision Thresholds, MPhil Thesis, The HK Polytechnic University, (Supervised by M.W. Mak).
  31. Yiu, K.K. Mak, M.W. and Kung, S.Y. (2001). "Channel distortion compensation based on the measurement of handset’s frequency responses," International Symposium on Intelligent, Video and Speech Processing, 2001, pp. 197-200.
  32. Yiu, K.K., Mak, M.W. and Li, C.K. (1999), "Gaussian mixture models and probabilistic decision-based neural networks for pattern classification: A comparative study," Neural Computing and Applications, 8, 235-245.
  33. Yiu, K.K. Mak, M.W. and Kung, S.Y. (2001). "A GMM-Based Handset Selector for Channel Mismatch Compensation with Applications to Speaker Identification," Second IEEE Pacific-Rim Conference on Multimedia 2001 (PCM'2001).
  34. Yiu, K.K., Mak, M.W. and Kung, S.Y. (2001). "Minimization of channel distortion by the measurement of handset’s frequency responses, submitted to Speech Communications.
  35. Zhang, W.D. Mak, M.W. and He, M.X. (2000). "A two-stage scoring method combining world and cohort models for speaker verification," Proc. ICASSP, Vol. 2, pp. 1193-1196, 2000.
  36. Zhang, W.D., Yiu, K.K., Mak, M.W., Li, C.K. and M.X. He (1999), "A priori threshold determination for phrase-prompted speaker verification," Eurospeech'99. Vol.2, pp. 1023-1026, Sept. 1999.