Abstract:
Speaker verification, a biometric identifier,
determines whether an input speech belongs to the claimed
identity. The existing models for speaker verification have
reported performances mainly in English, and no study has
experimented with Sinhala and Tamil datasets. This study
proposes a semi-automated pipeline to curate datasets for
Sinhala and Tamil from videos on YouTube filmed under noisy
and unconstrained conditions which represent real-world
scenarios. Both Sinhala and Tamil datasets include utterances
for 140 persons of interest (POIs) with more than 300 utterances
per POI under one or more genres: interviews, speeches, and
vlogs. Moreover, this study investigates how domain mismatch
affects a speaker verification model trained in English and
applied to Sinhala and Tamil. Two deep neural network models
trained in English show significant performance drops on
Sinhala and Tamil datasets compared to an English dataset as
expected due to domain mismatch, however, it is observed that
AM-softmax performed better than vanilla softmax. In the
future, robust speaker verification models with domain
adaptation techniques will be built to improve performance on
Sinhala and Tamil datasets.