(This work has been accepted to Interspeech 2025)
ticho.urai@gmail.com, pacharawinboon@gmail.com, ekapol.c@chula.ac.th
Department of Computer Engineering, Faculty of Engineering
Chulalongkorn University, Thailand
Abstract: We develop the Chula Spoofed Speech (CSS) dataset, a spoofing dataset for Thai, which contains 1,332,120 utterances of both bona fide and synthetic speech. Synthetic speech samples were generated using five distinct high-quality text-to-speech (TTS) systems, all based on the same utterances as the bona fide data. The data covers various age ranges and speak- ing styles. Strong baselines such as AASIST and RawNet2 are trained under different conditions to uncover aspects that affect the performance of the models. Besides unseen attacks, unseen speaking styles also have a big impact on performance, indicating a need for diversity in speaking styles in anti-spoofing datasets. Furthermore, we investigate the models in telephony scenarios against additional TTS systems. The results reveal that the models still face certain challenges in this context.
We present the Chula Spoofed Speech (CSS) dataset, a Thai-language spoofing dataset that we created to support the development and evaluation of anti-spoofing systems. According to our research, the dataset contains a total of 1,332,120 utterances, amounting to approximately 1,620 hours of speech data. It includes both genuine (bona-fide) and synthetic (spoofed) speech samples.
Source: Recordings from 20 professional voice actors (10 male and 10 female).
Diversity: The speakers cover three age groups (8 adolescents, 9 working adults, 3 elderly) and three distinct speaking styles (formal, casual, and excited).
Volume: This portion consists of 222,020 utterances.
Generation: The synthetic samples were generated to match the bona-fide data, using the same speakers and the same text scripts.
TTS Models: The spoofed data was created using five different high-quality text-to-speech (TTS) systems/combinations:
1.) VITS (an End-to-End model)
2.) FastPitch + HiFi-GAN
3.) FastPitch + UnivNet
4.) Tacotron 2 + HiFi-GAN
5.) Tacotron 2 + UnivNet
Volume: For each of the 5 TTS systems, 222,020 utterances were generated, resulting in a total of 1,110,100 spoofed samples.
Speaker 001 | Speaker 002 | |||
---|---|---|---|---|
Formal Speaking Style | ||||
Bona-fide | ||||
VITS | ||||
FastPitch + HiFi-GAN | ||||
FastPitch + UnivNet | ||||
Tacotron 2 + HiFi-GAN | ||||
Tacotron 2 + UnivNet |
Speaker 003 | Speaker 004 | |||
---|---|---|---|---|
Casual Speaking Style | ||||
Bona-fide | ||||
VITS | ||||
FastPitch + HiFi-GAN | ||||
FastPitch + UnivNet | ||||
Tacotron 2 + HiFi-GAN | ||||
Tacotron 2 + UnivNet |
Speaker 005 | Speaker 006 | |||
---|---|---|---|---|
Excited Speaking Style | ||||
Bona-fide | ||||
VITS | ||||
FastPitch + HiFi-GAN | ||||
FastPitch + UnivNet | ||||
Tacotron 2 + HiFi-GAN | ||||
Tacotron 2 + UnivNet |
This research was jointly supported by the PMU-C grant (C05F660049) and Amity Accentix Co., Ltd.
We would also like to express our sincere gratitude to all the voice contributors who generously dedicated their time and talent to make this project possible.