Speaker Verification System

Demo 1    Demo 2

Cantonese Digit Speaker Verification Corpus (readme.txt)


Click Here! Click Here!
Click Here! Click Here!

 


An Overview of Speaker Recognition

M.W. Mak (Sept. 1999)

Efforts to address the problems of telephone-based speaker verification systems have centered on four main areas: channel mismatch equalization, background noise compensation, speaker modeling, and decision strategies.

Channel Mismatch Equalization: Telephone speech is typically collected under different acoustic environments and communication channels, causing mismatches between speech gathered during enrollment and verification. There are three main schools of thought which have been used to address this problem. The first looks at the local spectral characteristics of a given frame of speech. Early attempts included cepstral weighting (Tohkura, 1987) and bandpass liftering (Juang et al., 1987). These approaches, however, assume that all frames are subject to the same distortion. This assumption has been avoided by more recent proposals, such as adaptive component weighting  (Assaleh & Mammone, 1994) and pole-zero post-filtering (Zilovic et al., 1998). However, the experiments in these studies have limitations, as the speech corpora (KING and TIMIT with channel simulators) they used do not allow handset variability to be properly examined. Effort must therefore be made to investigate whether these approaches are robust to handset variation.

The second school exploits the temporal variability of feature vectors. Typical examples include cepstral mean subtraction (Atal, 1974), pole-filtered cepstral mean subtraction (Naik, 1995), delta cepstrum (Furui, 1981), relative spectral processing (Hermansky & Morgan, 1994), signal bias removal (Rahim & Juang, 1996), and the modified-mean cepstral mean normalization with frequency warping (Garcia & Mammone, 1999). Although these methods have been successfully used in reducing channel mismatches, they have limitations, as they assume that the channel effect can be approximated by a linear filter. Most telephone handsets, however, exhibit energy-dependent frequency responses (Reynolds et al., 1995) for which a linear filter may be a poor approximation. Therefore, a more complex representation of handset characteristics is required.
The third school uses affine transformation to correct the mismatches (Mammone et al., 1996), where speaker models can be trained on clean speech and operated on environmental distorted speech without any retraining. This approach has the advantage that both convolutional distortion and additive noise can be compensated simultaneously. Additional computation, however, is required during verification to compute the transformation matrices.

Background Noise Compensation: It has been shown that about 40% of telephone conversations contain competing speech, music, or traffic noise (Dobroth et al., 1989). This figure suggests the importance of background noise compensation in telephone-based speaker verification. Early approaches include spectral subtraction (Boll, 1979) and projection-based distortion measure (Mansour and Juang 1989). More recently, statistical-based methods such as the noise integration model (Rose et al., 1994) and signal bias removal (Rahim & Juang, 1996) have been proposed. The advantage of using statistical methods is that clean reference templates are no longer required. This property is particularly important to telephone-based applications, as clean speech is usually not available. Despite the promise of these noise compensation methods in speech recognition, they have not been widely applied to speaker verification because convolutional distortion rather than additive noise is believed to be the major factor that degrades verification performance.

Joint Additive and Convolutional Bias Compensation: There have been several proposals aimed at addressing the problem of convolutional distortion and additive noise simultaneously. In addition to the affine transformation mentioned above, these proposals include stochastic pattern matching (Sankar & Lee, 1996), parallel model combination (Gales & Young, 1995), state-based compensation for continuous density hidden Markov models (Afify et al. 1998), and maximum likelihood estimation of channels’ autocorrelation functions and noise (Zhao, 1999). Despite their success in telephone speech recognition, these methods have not been widely applied to speaker verification. This is because adapting a speaker model to new environments will affect its capability in recognizing speakers (Beaufays & Weintraub, 1997).

Speaker Modeling: The choice of speaker models depends mainly on whether the verification is text-dependent or text-independent. In the former, it is possible to compare the claimant’s utterance with that of the reference speaker by aligning the two utterances at equivalent points in time using dynamic time warping (DTW) techniques (Furui, 1981). An alternative is to model the statistical variation in the spectral features. This is known as hidden Markov modeling (HMM) which has been shown to outperform the DTW-based methods (Naik, et al., 1989). In text-independent speaker verification, methods that look at long-term speech statistics (Markel et al., 1977) or consider individual spectral vectors as independent of each other have been proposed. The latter includes vector quantization (VQ) (Soong, et al. 1985), Gaussian mixture models (GMMs) (Reynolds, 1995), and neural networks (Oglesby & Mason, 1990).

Decision Strategies: Recent research has focused on the normalization of speaker scores to minimize error rates. Early work includes the likelihood ratio scoring proposed by Higgins et al. (1991) and the cohort normalized scoring by Rosenberg et al. (1992). Subsequent work based on likelihood normalization (Matsui & Furui, 1995; Liu et al., 1996) and minimum verification error training (Rosenberg et al., 1998) also show that including an impostor model not only improves speaker separability, but also allows the decision threshold to be easily set. Rosenberg and Parthasarathy (1996) established some principles for constructing impostor models, and show that those with speech closest to the reference speaker’s model perform the best. Their result, however, differs from that of Reynolds (1995), who found that a gender-balanced, randomly selected impostor model performs better, suggesting that more work is required in this area.

Previous Work of M.W. Mak: Currently, M.W. Mak and his students are working on a research project that is  an extension of M.W. Mak's previous investigations into speaker recognition (Mak et al., 1993a, 1993b, 1994; Mak, 1995) where multi-layer perceptrons, radial basis function networks, and recurrent networks were applied to speaker recognition. It was found that RBF networks not only require a significantly shorter training time compared to MLP, but they also achieve a lower error rate. These investigation have become the framework of our recent work, where robust neural classifiers (Yiu, et al. 1999; Mak, 1996; Mak, et. al. 1998; Mak and Li, 1999), threshold determination methods (Zhang, et al. 1999), and channel compensation methods (Lo, et al. 1999) were proposed. We have shown that the structure of speaker models can be optimized automatically by our recently proposed extended-RPCL algorithm (Li, et. al. 1999), that embedding anti-speakers in the speaker models greatly improve the reliability of the a priori decision thresholds, and that the channel effects can be greatly reduced by looking at the clean speech cepstrum. These results have formed the foundation of the current work.

References
Afify, M. et al. (1998). “A general joint additive and convolutive bias compensation approach applied to noisy Lombard speech recognition,” IEEE Trans. on Speech and Audio Processing, 6 (6), 524-537.

Assaleh, K.T., Mammone, R.J. (1994). “New-LP-derived features for speaker identification,” IEEE Trans. on Speech and Audio Processing, 2 (4), 630-638.

Atal, B.S. (1974). “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” J. Acoust. Soc. Am. 55 (6), 1304-1312.

Beaufays, F. and Weintraub, M. “Model transformation for robust speaker recognition from telephone data,” ICASSP’97.

Boll, S. F. (1979). “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Trans. on Acoust., Speech, Signal Processing, ASSP-27 (2), 113-120.

Dobroth, K. M., Zeigler, B. L. and Karis, D. (1989). “Future directions for audio interface research: characteristics of human-human order-entry conversations,” in Proc. Am. Voice Input/Output Soc., Sept. 1989.

Furui, S. (1981). “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. Acoust. Speech, Signal Processing, 29, 254-272.

Gales, M.J.F. and Young, S.J. (1995) “Robust speech recognition in additive and convolutional noise using parallel model compensation,” Speech Comm. 9, 289-307.

Garcia, A. and Mammone, R. J. (1999). “Channel-robust speaker identification using modified-mean cepstral mean normalization with frequency warping,” in Proc. ICASSP’99.

Hermansky, H. and Morgan, N. (1994). “RASTA processing of speech,” IEEE Trans. on Speech, Audio Processing, 2 (4), 578-589.

Higgins, A., Bahler, L. and Porter, J. (1991). “Speaker verification using randomized phrase prompting,” Digital Signal Processing, 1, 89-106.

Juang, B. H., Rabiner, L. R. and Wilpon, J. G. (1987). “On the use of bandpass liftering in speech recognition,” IEEE Trans. on Acoust., Speech, Signal Processing, 35, 947-954.

Li, X., Mak, M.W. and Li, C.K. (1999), “Determining the optimal number of clusters by an extended RPCL algorithm," Int. J. of Advanced Computational Intelligence, (accepted).

Liu, C. S., Wang, H. C. and Lee, C. H. (1996). “Speaker verification using normalized log-likelihood score,” IEEE Trans. on Speech and Audio Processing, 4 (1), 56-60.

Lo, T.F., Mak, M.W. and Yiu, K.K. (1999), "A new cepstrum-based channel compensation method for speaker verification," Eurospeech'99.

Mak, M.W. et al. (1993a). “Comparing multi-layer perceptrons and radial basis function networks in speaker identifications”, J. of Micro. Applications, 16, 147-59.

Mak, M.W. et al. (1993b), "Speaker Identification using Radial Basis Functions," The 3rd IEE Int. Conf. on Artificial Neural Networks, pp. 138-142, U.K.

Mak, M.W. et al. (1994), “Speaker Identification using Multi Layer Perceptrons and Radial Basis Functions Networks,” Neurocomputing, 6 (1), 99-118, 1994.

Mak, M.W. (1995). “Speaker identification using modular recurrent neural networks,” 4th IEE Int. Conf. on Artificial Neural Networks, 1-6.

Mak, M.W. (1996), "Text-Independent Speaker Verification Over a Telephone Network by Radial Basis Function Networks", Proc. Int. Sym. Multi-Technology Information Processing, 145-150.

Mak, M.W., Li, C.K. and Li, X. (1998), “Maximum likelihood estimation of elliptical basis function parameters with application to speaker verification,” ICSP’98, 1287-1290.

Mak, M.W. and Li, C.K. (1999), “Elliptical basis function networks and radial basis function networks for speaker verification: A comparative study,” IJCNN'99.

Mammone, R. J., Zhang, X. and Ramachandran, R. P. (1996). “Robust speaker recognition,” IEEE Signal Processing Magazine, 13, Sept., 58-71.

Mansour, D. and Juang, B. H. (1989). “A family of distortion measures based upon projection operation for robust speech recogntion,” IEEE Trans. on Acoust., Speech, Signal Processing, 37 (11), 1659-1671.

Markel, J. D., Oshika, B. T. and Gray, A. H. (1977). “Long-term feature averaging for speaker recognition,” IEEE Trans. Acoust. Speech Signal Proc., ASSP-25, 330-337.

Matsui, T. and Furui, S. (1995). “Likelihood normalization for speaker verification using a phoneme- and speaker-independent model,” Speech Communications, 17, 109-116.

Naik, D. (1995). “Pole-filtered cepstral mean subtraction,” Proc. ICASSP’95, 1, 157-160, 1995.

Naik, J. M., Netsch, L. P. and Doddington, G. R. (1989). “Speaker verification over long distance telephone lines,” Proc. ICASSP’89, 524-527.

Oblesby, J. and Mason, J. S. (1990). “Optimization of neural models for speaker identification,” Proc. ICASSP’90. 261-264.

Rahim, M. G. and Juang, B. H. (1996). “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Trans. on Speech and Audio Processing, 4 (1), 19-30.

Reynolds, D. A. (1995). “Speaker identification and verification using Guassian mixture speaker models,” Speech Communications, 17, 91-108.

Reynolds, D.A. et al. (1995). “The effects of telephone transmission degradations on speaker recognition performance,” Proc. ICASSP’95, 329-332.

Rose, R. C., Hofstetter, E. M. and Reynolds, D. A. (1994). “Integrated models of signal and background with application to speaker identification in noise”, IEEE Trans. on Speech and Audio Processing, 2 (2), 245-257.

Rosenberg, A. E. and Parthasarathy, S. (1996). “Speaker background models for connected digit password speaker verification”, Proc. ICASSP’96, 81-84.

Rosenberg, A. E., DeLong, J., Lee, C. H., Juang, B. H. and Soong, F. K. (1992). “The use of cohort normalized scores for speaker verification,” Proc. ICSLP 92, 2, 599-602.

Rosenberg, A. E., Siohan, O. and Parthasarathy, S. (1998). “Speaker verification using minimum verification error training,” Proc. ICASSP’98, 105-108.

Sankar, A. and Lee, C. H. (1996). “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, 4 (3), 190-202.

Soong, F. K., Rosenberg, A. E., Rabiner, L. R. and Juang, B. H. (1985). “A vector quantization approach to speaker recognition,” Proc. ICASSP’85, 387-390.

Tohkura, Y. (1987). “A weighted cepstral measure for speech recognition,” IEEE Trans. Acoust. Speech Signal Processing, ASSP-35, 1414-1422.

Yiu, K.K., Mak, M.W. and Li, C.K. (1999), “Gaussian mixture models and probabilistic decision-based neural networks for pattern classification: A comparative study," Neural Computing and Applications, 8, 235-245.

Zhang, W.D., Yiu, K.K., Mak, M.W., Li, C.K. and M.X. He (1999), “A Priori Threshold Determination for Phrase-Prompted Speaker Verification,” Eurospeech'99.

Zhao, Y. (1999). “An EM algorithm for linear distortion channel estimation based on observations from a mixture of Gaussian sources,” IEEE Trans. Speech and Audio Processing, 7 (4), 400-413.

Zilovic, M.S., Ramachandran, R.P. and Mammone, R .J. (1998). “Speaker identification based on the use of robust cepstral features obtained from pole-zero transfer functions,” IEEE Trans. on Speech, Audio Processing, 6 (3), 260-267.


M.W. Mak's Homepage

http://www.eie.polyu.edu.hk/~mwmak/mypage.htm