Abstract:
The AM-FM modulation model of speech is a nonlinear model that has been
successfully used in several branches of speech-related research. However, the
significance of the AM-FM features extracted from this model has not been fully
explored in applications such as speaker identification systems. This paper shows that
frequency modulation (FM) features can improve speaker identification accuracy. Due
to the similarity between amplitude modulation (AM) feature and the conventional
Mel frequency cepstrum coefficients (MFCC), this paper mainly focuses on the FM
feature. The correlation between FM feature components is shown to be very small
compared with that of Mel filterbank log energies, thus reducing the need for
decorrelation. FM feature components are shown to be very nearly Gaussian
distributed. Further, speech synthesis using AM-FM features is performed to compare
four existing AM-FM demodulation methods based on the perceptual quality of the
synthesized speech. Of these, Digital Energy Separation Algorithm (DESA) gives the
best synthesized speech, and is thus used as a front-end in our speaker identification
system. Evaluation of speaker identification using FM features on the NIST 2001
database shows a relative improvement in speaker identification accuracy of 2% for
male speakers and 9% for female speakers over the conventional MFCC-based frontend.