Thai Speech Spoofing Detection Dataset with Variations in Speaking Styles

Abstract: We develop the Chula Spoofed Speech (CSS) dataset, a spoofing dataset for Thai, which contains 1,332,120 utterances of both bona fide and synthetic speech. Synthetic speech samples were generated using five distinct high-quality text-to-speech (TTS) systems, all based on the same utterances as the bona fide data. The data covers various age ranges and speak- ing styles. Strong baselines such as AASIST and RawNet2 are trained under different conditions to uncover aspects that affect the performance of the models. Besides unseen attacks, unseen speaking styles also have a big impact on performance, indicating a need for diversity in speaking styles in anti-spoofing datasets. Furthermore, we investigate the models in telephony scenarios against additional TTS systems. The results reveal that the models still face certain challenges in this context.

Details of Chula Spoofed Speech (CSS) Dataset

We present the Chula Spoofed Speech (CSS) dataset, a Thai-language spoofing dataset that we created to support the development and evaluation of anti-spoofing systems. According to our research, the dataset contains a total of 1,332,120 utterances, amounting to approximately 1,620 hours of speech data. It includes both genuine (bona-fide) and synthetic (spoofed) speech samples.

Bona-fide Speech

Source: Recordings from 20 professional voice actors (10 male and 10 female).

Diversity: The speakers cover three age groups (8 adolescents, 9 working adults, 3 elderly) and three distinct speaking styles (formal, casual, and excited).

Volume: This portion consists of 222,020 utterances.