Abstract:
Speaker diarization is the task of partitioning a
speech signal into homogeneous segments corresponding to
speaker identities. We introduce a Tamil test dataset,
considering that the existing literature on speaker diarization
has experimented with English to a great extent; however, none
on a Tamil dataset. An overlapped speech segment is a part of
an audio clip where two or more speakers speak simultaneously.
Overlapped speech regions degrade the performance of a
speaker diarization system proportionally due to the complexity
of identifying individual speakers. This study proposes an
overlapped speech detection (OSD) model by discarding the
non-speech segments and feeding speech segments into a
Convolutional Recurrent Neural Network model as a binary
classifier: single speaker speech and overlapped speech. The
OSD model is integrated into a speaker diarizer, and the
performance gain on the standard VoxConverse and our Tamil
datasets in terms of Diarization Error Rate are 5.6% and 13.4%,
respectively.