Abstract:
The aim of this work is to gain insights into how the deep neural
network (DNN) models should be trained for short utterance
evaluation conditions in an x-vector based speaker verification
system. The study suggests that the speaker embedding can
be extracted with reduced dimensions for short utterance evaluation
conditions. When the speaker embedding is extracted
from deeper layer which has lower dimension, the x-vector system
achieves 14% relative improvement over baseline approach
on EER on NIST2010 5sec-5sec truncated conditions. We surmise
that since short utterances have less phonetic information
speaker discriminative x-vectors can be extracted from a deeper
layer of the DNN which captures less phonetic information. Another
interesting finding is that the x-vector system achieves 5%
relative improvement on NIST2010 5sec-5sec evaluation condition
when the back-end PLDA is trained using short utterance
development data. The results confirms the intuitive expectation
that duration of development utterances and the duration
of evaluation utterances should be matched. Finally, for the
duration mismatch condition, we propose a variance normalization
approach for PLDA training that provides a 4% relative
improvement on EER over baseline approach.