Transducer
Mismatch Compensation for Robust Speaker Verification
Man-wai Mak (Jan. 2003)
Current efforts to address the problems of transducer mismatches have
focused on two main areas: feature transformation and model transformation.
Feature Transformation:
Feature-based approaches attempt to modify the distorted features so
that the resulting features fit the clean speech models better. One of
the earliest techniques is cepstral mean subtraction (CMS) [2], which approximates
a linear channel by the long-term average of distorted cepstral vectors.
Subsequent efforts include pole-filtered cepstral mean subtraction [22]
and signal bias removal [24]. These approaches, however, do not consider
the effect of background noise. A more general approach, which handles
both channel distortion and background noise, is the codeword-dependent
cepstral normalization (CDCN) [1]. In CDCN, additive noise and convolutive
distortion are modeled as codeword-dependent cepstral biases. The CDCN,
however, only works well when the background noise level is low.
When stereo corpora are available, channel distortion can be estimated
directly by comparing the clean features against their corresponding distorted
features. For example, in SNR-dependent cepstral normalization (SDCN) [1],
cepstral biases for different signal-to-noise ratios are estimated in a
maximum likelihood framework. In probabilistic optimum filtering [23],
the transformation is a set of multi-dimensional least-squares filters
whose outputs are probabilistically combined. These methods, however, rely
on the availability of stereo corpora, which could be difficult to obtain.
The requirement of stereo corpora can be avoided by making use of the information
embedded in the clean speech models. For example, in stochastic matching
[27], the cepstral biases and affine transformation matrices are determined
by maximizing the likelihood of observing the distorted features given
the clean models. The linear transformation in [27] can be replaced by
a neural network to compensate for non-linear distortion [29]. However,
closed form solutions can no longer be obtained and the generalized EM
algorithm is required.
Although the above methods have been successful in reducing channel
mismatches, they (except [29]) operate on the assumption that the channel
effect can be approximated by a linear filter. Most telephone handsets,
in fact, exhibit energy-dependent frequency responses [25] for which a
linear filter may be a poor approximation. Therefore, a more complex representation
of handset characteristics is required.
Model Transformation:
Model-based approaches attempt to modify the clean speech models such
that the density functions of the resulting models fit the distorted data
better. Influential approaches include stochastic matching [27] and stochastic
additive transforms [26], where the models’ means and variances are adjusted
by stochastic biases, maximum likelihood linear regression (MLLR) [10],
where the mean vectors of clean speech models are linearly transformed,
and the constrained reestimation of Gaussian mixtures [3], where both mean
vectors and covariance matrices are transformed. Recently, MLLR has been
extended to maximum-likelihood linear transformation [5], in which the
transformation matrices for the variances can be different from those for
the mean vectors. Meanwhile, the constrained transformation in [3] has
been extended to piecewise-linear stochastic transformation [4], where
a collection of linear transformations are shared by all the Gaussians
in each mixture. The random bias in [27] has also been replaced by a neural
network to compensate for non-linear distortion [29]. All these extensions
show improvement in recognition accuracy.
As the above methods "indirectly" adjust the model parameters via a
small number of transformations, they may not be able to capture the fine
structure of the distortion. While this limitation can be overcome by the
Bayesian techniques [6,9] where model parameters are adjusted "directly",
the Bayesian approach requires a large amount of adaptation data to be
effective. As both direct and indirect adaptations have their own strengths
and weaknesses, a natural extension is to combine them so that these two
approaches can complement each other [21,28].
Previous Work of M.W. Mak
From 1990 to 1993, M.W. Mak investigated the applications of neural
networks to speaker recognition [18,19]. These investigations have become
the framework of his recent work, in which probabilistic neural classifiers
[17,32], threshold determination [36], scoring normalization [35], channel
compensation methods [15,11,31], and recurrent learning algorithms [7,8,20,
16] were proposed. This work has demonstrated that (1) the structure of
speaker models can be optimized automatically [13], (2) elliptical basis
function networks outperform radial basis function networks in speaker
verification, (3) embedding anti-speakers in the speaker models greatly
improves the reliability of the a priori decision thresholds, (4) combining
the world model and the cohort model in a two-stage scoring normalization
approach improves speaker verification performance, and (5) the channel
effects can be greatly reduced by looking at the clean speech cepstrum.
These previous efforts are now the foundation of this proposal.
References
-
Acero, A. (1992). Acoustical and Environmental Robustness in Automatic
Speech Recognition, Kluwer Academic Pub., Dordrecht.
-
Atal, B.S. (1974). "Effectiveness of linear
prediction characteristics of the speech wave for automatic speaker identification
and verification," J. Acoust. Soc. Am. 55 (6), 1304-1312.
-
Digalakis, V. Rtischev, D. and Neumeyer,
L. (1995). "Speaker adaptation using constrained reestimation of Gaussian
mixtures," IEEE Trans. on Speech Audio Processing, 3, 357-366.
-
Diakoloukas, V.D. and Digalakis, V. (1999).
"Maximum-likelihood stochastic-transformation adaptation of hidden Markov
models," IEEE Trans. on Speech Audio Processing, 7(2), 177-187.
-
Gales, M.J.F. (1998). "Maximum-likelihood linear
transformation for HMM-based speech recognition," Computer Speech and
Language, 12, 75-98.
-
Huo, Q., Chan, C., and Lee, C.H. (1997). On-line
adaptive learning of the continuous density hidden Markov model based on
approximate recursive Bayes estimate, IEEE Trans. on Audio and Speech
Processing, 5 (2), 161-172.
-
Ku, K.W., Mak,
M.W. and Siu, W.C. (1999), "Adding learning to cellular genetic algorithms
for training recurrent neural networks," IEEE Trans. on Neural Networks,
Vol. 10, No. 2, pp. 239-252.
-
Ku, K.W. Mak, M.W. and Siu, W.C. (2000), "A
study of the lamarckian evolution of recurrent neural networks," IEEE
Trans. on Evolutionary Computation, Vol. 4, No. 1, pp. 31-42.
-
Lee, C.H., Lin,
C.H., and Juang, B.H. (1991). "A study on speaker adaptation of the parameters
of continuous density hidden Markov models, IEEE Trans. on Acoustics,
Speech and Signal Processing, 39 (4), 806-814.
-
Leggetter, C.J., Woodland, P.C. (1995). "Maximum
likelihood linear regression for speaker adaptation of continuous density
hidden Markov models," Computer Speech and Language, 9 (4), 806-814.
-
Li,
X. Mak, M.W., and Kung, S.Y. (2001). "Robust speaker verification over
the telephone by feature recuperation," International Symposium on Intelligent,
Video and Speech Processing, 2001, pp. 433-436.
-
Li, X. Mak, M.W., and Kung, S.Y. (2001). "An
EBF-based non-linear feature mapper for robust speaker verification, Submitted
to IEEE Trans. on Neural Networks.
-
Li, X., Mak, M.W. and Li, C.K. (1999), "Determining
the optimal number of clusters by an extended RPCL algorithm," Int.
J. of Advanced Computational Intelligence, Vol. 3, No. 6, pp. 467-473.
-
Lin, S.H., Kung, S.Y. and Lin, L.J. (1997).
"Face recognition/detection by probabilistic decision-based neural network,
IEEE
Trans. on Neural Networks, 8 (1), pp. 114-132.
-
Lo, T.F., Mak, M.W. and Yiu, K.K. (1999), "A
new cepstrum-based channel compensation method for speaker verification,"
Eurospeech’99.
-
Mak, M.W. (1995),
"A learning algorithm for recurrent radial basis function networks," Neural
Processing Letters, Vol. 2, No. 1, pp. 27-31.
-
Mak, M.W. and Kung, S.Y. (2000). "Estimation
of elliptical basis function parameters by the EM algorithms with application
to speaker verification," IEEE Trans. on Neural Networks, Vol. 11,
No. 4, pp. 961-969.
-
Mak, M.W. et al. (1993a). "Comparing multi-layer
perceptrons and radial basis function networks in speaker identifications",
J.
of Micro. Applications, 16, 147-59.
-
Mak, M.W. et al. (1994), "Speaker Identification
using Multi Layer Perceptrons and Radial Basis Functions Networks," Neurocomputing,
6 (1), 99-118.
-
Mak, M.W. Ku, K.W.
and Y.L. Lu. (1999). "On the improvement of the real time recurrent learning
algorithm," Neurocomputing, Vol. 24, pp. 13-36.
-
Mokbel, C. (2001). "Online adaptation of HMMs
to real-life conditions: A unified framework," IEEE Trans. on Speech
and Audio Processing, 9 (4), 342-357.
-
Naik, D. (1995).
"Pole-filtered cepstral mean subtraction," Proc. ICASSP’95, 1, 157-160,
1995.
-
Neumeyer, L. and Weintraub, M. (1994). "Probabilistic
optimal filtering for robust speech recognition," ICASSP’94, pp.
417-420.
-
Rahim, M. G. and Juang, B. H. (1996). "Signal
bias removal by maximum likelihood estimation for robust telephone speech
recognition," IEEE Trans. on Speech and Audio Processing, 4 (1),
19-30.
-
Reynolds, D.A. et al. (1995). "The effects
of telephone transmission degradations on speaker recognition performance,"
ICASSP’95.
-
Rose, R. C., Hofstetter, E. M. and Reynolds,
D. A. (1994). "Integrated models of signal and background with application
to speaker identification in noise", IEEE Trans. on Speech and Audio
Processing, 2 (2), 245-257.
-
Sankar, A. and Lee, C.H. (1996). "A maximum-likelihood
approach to stochastic matching for robust speech recognition," IEEE
Trans. on Speech and Audio Processing, 4 (3), pp. 190-202.
-
Siohan, O. Chesta, C. and Lee, C.H. (2001).
Joint maximum a posteriori adaptation of transformation and HMM parameters,"
IEEE
Trans. on Speech and Audio Processing, 9 (4), 417-428.
-
Surendran,
A.C., Lee, C.H. and Rahim, M. (1999). "Nonlinear compensation for stochastic
matching," IEEE Trans. on Speech and Audio Processing, 7 (6), pp.
643-655.
-
Yiu, K.K. (2000). Speaker Verification Based
on Probabilistic Neural Networks with A Priori Decision Thresholds,
MPhil Thesis, The HK Polytechnic University, (Supervised by M.W. Mak).
-
Yiu, K.K. Mak,
M.W. and Kung, S.Y. (2001). "Channel distortion compensation based on the
measurement of handset’s frequency responses," International Symposium
on Intelligent, Video and Speech Processing, 2001, pp. 197-200.
-
Yiu, K.K., Mak, M.W. and Li, C.K. (1999), "Gaussian
mixture models and probabilistic decision-based neural networks for pattern
classification: A comparative study," Neural Computing and Applications,
8, 235-245.
-
Yiu, K.K. Mak, M.W. and Kung, S.Y. (2001).
"A GMM-Based Handset Selector for Channel Mismatch Compensation with Applications
to Speaker Identification," Second IEEE Pacific-Rim Conference on Multimedia
2001 (PCM'2001).
-
Yiu, K.K., Mak, M.W. and Kung, S.Y. (2001).
"Minimization of channel distortion by the measurement of handset’s frequency
responses, submitted to Speech Communications.
-
Zhang, W.D. Mak, M.W. and He, M.X. (2000).
"A two-stage scoring method combining world and cohort models for speaker
verification," Proc. ICASSP, Vol. 2, pp. 1193-1196, 2000.
-
Zhang, W.D., Yiu,
K.K., Mak, M.W., Li, C.K. and M.X. He (1999), "A priori threshold determination
for phrase-prompted speaker verification," Eurospeech'99. Vol.2,
pp. 1023-1026, Sept. 1999.