Thai Speech Spoofing Detection Dataset with Variations in Speaking Styles

(This work has been accepted to Interspeech 2025)

Ticho Urai, Pachara Boonsarngsuk, Ekapol Chuangsuwanich

ticho.urai@gmail.com, pacharawinboon@gmail.com, ekapol.c@chula.ac.th

Department of Computer Engineering, Faculty of Engineering

Chulalongkorn University, Thailand

Abstract: We develop the Chula Spoofed Speech (CSS) dataset, a spoofing dataset for Thai, which contains 1,332,120 utterances of both bona fide and synthetic speech. Synthetic speech samples were generated using five distinct high-quality text-to-speech (TTS) systems, all based on the same utterances as the bona fide data. The data covers various age ranges and speak- ing styles. Strong baselines such as AASIST and RawNet2 are trained under different conditions to uncover aspects that affect the performance of the models. Besides unseen attacks, unseen speaking styles also have a big impact on performance, indicating a need for diversity in speaking styles in anti-spoofing datasets. Furthermore, we investigate the models in telephony scenarios against additional TTS systems. The results reveal that the models still face certain challenges in this context.


Dataset Construction

Overview of the CSS Dataset creation
Figure: Overview of the CSS Dataset creation.

Details of Chula Spoofed Speech (CSS) Dataset

We present the Chula Spoofed Speech (CSS) dataset, a Thai-language spoofing dataset that we created to support the development and evaluation of anti-spoofing systems. According to our research, the dataset contains a total of 1,332,120 utterances, amounting to approximately 1,620 hours of speech data. It includes both genuine (bona-fide) and synthetic (spoofed) speech samples.

Bona-fide Speech

Source: Recordings from 20 professional voice actors (10 male and 10 female).

Diversity: The speakers cover three age groups (8 adolescents, 9 working adults, 3 elderly) and three distinct speaking styles (formal, casual, and excited).

Volume: This portion consists of 222,020 utterances.

Spoofed Speech

Generation: The synthetic samples were generated to match the bona-fide data, using the same speakers and the same text scripts.

TTS Models: The spoofed data was created using five different high-quality text-to-speech (TTS) systems/combinations:

1.) VITS (an End-to-End model)

2.) FastPitch + HiFi-GAN

3.) FastPitch + UnivNet

4.) Tacotron 2 + HiFi-GAN

5.) Tacotron 2 + UnivNet

Volume: For each of the 5 TTS systems, 222,020 utterances were generated, resulting in a total of 1,110,100 spoofed samples.

Speech Samples

Speaker 001 Speaker 002
Formal Speaking Style
Bona-fide
VITS
FastPitch + HiFi-GAN
FastPitch + UnivNet
Tacotron 2 + HiFi-GAN
Tacotron 2 + UnivNet
Speaker 003 Speaker 004
Casual Speaking Style
Bona-fide
VITS
FastPitch + HiFi-GAN
FastPitch + UnivNet
Tacotron 2 + HiFi-GAN
Tacotron 2 + UnivNet
Speaker 005 Speaker 006
Excited Speaking Style
Bona-fide
VITS
FastPitch + HiFi-GAN
FastPitch + UnivNet
Tacotron 2 + HiFi-GAN
Tacotron 2 + UnivNet

Acknowledgements

This research was jointly supported by the PMU-C grant (C05F660049) and Amity Accentix Co., Ltd.

We would also like to express our sincere gratitude to all the voice contributors who generously dedicated their time and talent to make this project possible.