[en] In this paper, we present a methodology for linguistic feature extraction,
focusing particularly on automatically syllabifying words in multiple
languages, with a design to be compatible with a forced-alignment tool, the
Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our
method focuses on the extraction of phonetic transcriptions from text, stress
marks, and a unified automatic syllabification (in text and phonetic domains).
The system was built with open-source components and resources. Through an
ablation study, we demonstrate the efficacy of our approach in automatically
syllabifying words from several languages (English, French and Spanish).
Additionally, we apply the technique to the transcriptions of the CMU ARCTIC
dataset, generating valuable annotations available
online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for
speech representation learning, speech unit discovery, and disentanglement of
speech factors in several speech-related fields.
Disciplines :
Electrical & electronics engineering
Author, co-author :
Tits, Noé ; Université de Mons - UMONS > Faculté Polytechniqu > Service Information, Signal et Intelligence artificielle
Language :
English
Title :
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning
Publication date :
2023
Event name :
2023 Conference on Empirical Methods in Natural Language Processing
Event organizer :
Association for Computational Linguistics
Event place :
Singapore
Event date :
6-10 december 2023
Audience :
International
Main work title :
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Adaeze Adigwe, Noé Tits, Kevin El Haddad, Sarah Os-tadabbas, and Thierry Dutoit. 2018. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv preprint arXiv:1806.09514.
Susan Bartlett, Grzegorz Kondrak, and Colin Cherry. 2009. On the syllabification of phonemes. In Proceedings of human language technologies: The 2009 annual conference of the north american chapter of the association for computational linguistics, pages 308-316.
Brigitte Bigi and Katarzyna Klessa. 2015. Automatic syllabification of polish. In 7th Language and Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, pages 262-266.
Brigitte Bigi, Christine Meunier, Irina Nesterenko, and Roxane Bertrand. 2010. Automatic detection of syllable boundaries in spontaneous speech. In 7th International conference on Language Resources and Evaluation (LREC 2010), pages 3285-3292.
Brigitte Bigi and Caterina Petrone. 2014. A generic tool for the automatic syllabification of italian. A generic tool for the automatic syllabification of Italian, pages 73-77.
Jessica DeLisi. 2015. Sonority sequencing violations and prosodic structure in latin and other indoeuropean languages. Indo-European Linguistics, 3(1):1-23.
Zenón Hernández-Figueroa, Francisco J Carreras-Riudavets, and Gustavo Rodríguez-Rodríguez. 2013. Automatic syllabification for spanish using lemmatization and derivation to solve the prefix's prominence issue. Expert systems with applications, 40(17):7122-7131.
Luca Iacoponi and Renata Savy. 2011. Sylli: Automatic phonological syllabification for italian. In Twelfth Annual Conference of the International Speech Communication Association.
John Kominek and Alan W Black. 2004. The cmu arctic speech databases. In Fifth ISCA workshop on speech synthesis.
Jacob Krantz, Maxwell Dulin, and Paul De Palma. 2019. Language-agnostic syllabification with neural sequence labeling. In 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), pages 804-810. IEEE.
Jacob Krantz, Maxwell Dulin, Paul De Palma, and Mark VanDam. 2018. Syllabification by phone categorization. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 47-48.
Yannick Marchand, Connie R Adsett, and Robert I Damper. 2009. Automatic syllabification in english: A comparison of different algorithms. Language and speech, 52(1):1-27.
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. 2017. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498-502.
Meinard Müller. 2007. Dynamic time warping. Information retrieval for music and motion, pages 69-84.
Abhijit Pradhan, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy, et al. 2013. A syllable based statistical text to speech system. In 21st Euro-pean signal processing conference (EUSIPCO 2013), pages 1-5. IEEE.
Kseniya Rogova, Kris Demuynck, and Dirk Van Com-pernolle. 2013. Automatic syllabification using segmental conditional random fields. Computational Linguistics in the Netherlands Journal, 3:34-48.
Chuanqi Tan, Fuchun Sun, Tao Kong, Wenchang Zhang, Chao Yang, and Chunfang Liu. 2018. A survey on deep transfer learning. In International conference on artificial neural networks, pages 270-279. Springer.
Paul Taylor, Alan W Black, and Richard Caley. 1998. The architecture of the festival speech synthesis system. In The third ESCA/COCOSDA workshop (ETRW) on speech synthesis.
Noé Tits, Kevin El Haddad, and Thierry Dutoit. 2018. Asr-based features for emotion recognition: A transfer learning approach. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 48-52. Association for Computational Linguistics.
Noé Tits, Kevin El Haddad, and Thierry Dutoit. 2020. Exploring Transfer Learning for Low Resource Emotional TTS. In Intelligent Systems and Applications, pages 52-60, Cham. Springer International Publishing.
Noé Tits, Kevin El Haddad, and Thierry Dutoit. 2021. Analysis and assessment of controllability of an expressive deep learning-based tts system. In Informatics, volume 8, page 84. MDPI.
Noé Tits, Fengna Wang, Kevin El Haddad, Vincent Pagel, and Thierry Dutoit. 2019. Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis through Audio Analysis. In Proc. Interspeech 2019, pages 4475-4479.
Noé Tits and Zoé Broisson. 2023. Flowchase: a Mobile Application for Pronunciation Training. In Proc. 9th Workshop on Speech and Language Technology in Education (SLaTE), pages 93-94.
Theo Vennemann. 1987. Preference laws for syllable structure: And the explanation of sound change with special reference to German, Germanic, Italian, and Latin. de Gruyter.
Dong Wang and Thomas Fang Zheng. 2015. Transfer learning for speech and language processing. In 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), pages 1225-1237. IEEE.
Ruihua Yin, Jeroen van de Weijer, and Erich R Round. 2023. Frequent violation of the sonority sequencing principle in hundreds of languages: how often and by which sequences? Linguistic Typology.
Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-Hudilainen, John Levis, and Ricardo Gutierrez-Osuna. 2018. L2-arctic: A non-native english speech corpus. In Interspeech, pages 2783-2787.
Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li. 2022. Emotional voice conversion: Theory, databases and esd. Speech Communication, 137:1-18.