HMM-based Speech Synthesis of Live Sports Commentaries: Integration of a Two-Layer Prosody Annotation

Picart, Benjamin; Brognaux, Sandrine; Drugman, Thomas

Request a copy

Paper published in a journal (Scientific congresses and symposiums)

HMM-based Speech Synthesis of Live Sports Commentaries: Integration of a Two-Layer Prosody Annotation

Picart, Benjamin; Brognaux, Sandrine; Drugman, Thomas

2013

Permalink
https://hdl.handle.net/20.500.12907/41580

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

ssw8_bpsbtd.pdf

Author postprint (480.23 kB)

Request a copy

All documents in ORBi UMONS are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

[en] Speaking Style Adaptation; [en] Expressive Speech; [en] Prosody; [en] HMM-based Speech Synthesis; [en] Sports Commentaries

Abstract :

[en] This paper proposes the integration of a two-layer prosody annotation specific to live sports commentaries into HMM-based speech synthesis. Local labels are assigned to all syllables and refer to accentual phenomena. Global labels categorize sequences of words into five distinct speaking styles, defined in terms of valence and arousal. Two stages of the synthesis process are analyzed. First, the integration of global labels (i.e. speaking styles) is carried out either using speaker-dependent training or adaptation methods. Secondly, a comprehensive study allows evaluating the effects achieved by each prosody annotation layer on the generated speech. The evaluation process is based on three subjective criteria: intelligibility, expressivity and segmental quality. Our experiments indicate that: (i) for the integration of global labels, adaptation techniques outperform speaking style-dependent models both in terms of intelligibility and segmental quality; (ii) the integration of local labels results in an enhanced expressivity, while it provides slightly higher intelligibility and segmental quality performance; (iii) combining the two levels of annotation (local and global) leads to the best results. It is indeed shown that it obtains better levels of expressivity and intelligibility.

Disciplines :

Electrical & electronics engineering

Author, co-author :

Picart, Benjamin ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Brognaux, Sandrine

Drugman, Thomas ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Language :

English

Title :

HMM-based Speech Synthesis of Live Sports Commentaries: Integration of a Two-Layer Prosody Annotation

Publication date :

02 July 2013

Event name :

8th Speech Synthesis Workshop (SSW8)

Event place :

Barcelona, Spain

Event date :

2013

Research unit :

F105 - Information, Signal et Intelligence artificielle

Research institute :

R450 - Institut NUMEDIART pour les Technologies des Arts Numériques

Available on ORBi UMONS :

since 23 January 2014

Statistics

Number of views

43 (0 by UMONS)

Number of downloads

0 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

R. Tsuzuki, H. Zen, K. Tokuda, T. Kitamura, M. Bulut, and S. Narayanan, "Constructing emotional speech synthesizers with limited speech database," in International Conference on Spoken Language Processing (ICSLP), 2004, pp. 1185-1188.
J. Yamagishi, K. Onishi, T. Musuko, and T. Kobayashi, "Acoustic modeling of speaking styles and emotional expressions in hmmbased speech synthesis," IECE Transactions on Information and Systems, vol. E88-D(3), pp. 502-509, 2005.
T. Takahashi, T. Fujii, M. Nishi, H. Banno, T. Irino, and H. Kawahara, "Voice and emotional expression transformation based on statistics of vowel parameters in an emotional speech database," in Interspeech, 2005, pp. 537-540.
L. Qin, Z.-H. Ling, Y.-J. Wu, B.-F. Zhang, and R.-H. Wang, "HMM-based emotional speech synthesis using average emotion models," in ICSLP, 2006, pp. 233-240.
H. Zen, K. Tokuda, and A. Black, "Statistical parametric speech synthesis," Speech Comm., vol. 51(11), pp. 1039-1064, 2009.
T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," in Eurospeech, 1999, pp. 2347-2350.
I. Fonagy, L'accent en français contemporain. Ottawa: Marcel Didier Ltée, 1979, ch. L'accent français: Accent probabilitaire, pp. 123-232.
K. Hirose, K. Sato, and N. Minematsu, "Emotional speech synthesis with corpus-based generation of f0 contours using generation process model," in Speech Prosody, 2004, pp. 417-420.
M. E. Beckman and J. B. Pierrehumbert, "Japanese prosodic phrasing and intonation synthesis," in Twenty-Fourth Annual Meeting of ACL, 1986, p. 173180.
N. Braunschweiler, M. J. Gales, and S. Buchholz, "Lightly supervised recognition for automatic alignment of large coherent speech recordings," in Interspeech, 2010, pp. 2222-2225.
F. Eyben, S. Bucholz, and N. Braunschweiler, "Unsupervised clustering of emotion and voice styles for expressive TTS," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
S. Audrit, T. Psir, A. Auchlin, and J.-P. Goldman, "Sport in the media: A contrasted study of three sport live media reports with semi-automatic tools," in Speech Prosody, 2012.
J. Trouvain and W. Barry, "The prosody of excitement in horse race commentaries," in ISCA Workshop on Speech and Emotion: A Conceptual Framework for Research, 2000, pp. 86-91.
N. Obin, V. Dellwo, A. Lacheret, and X. Rodet, "Expectations for discourse genre identification," in Interspeech, 2010.
J. Trouvain, "Between excitement and triumph - live football commentaries in radio vs. tv," in 17th International Congress of Phonetic Sciences (ICPhS XVII), 2011.
F. Kern, Prosody in Interaction. John Benjamins, 2010, ch. Speaking Dramatically. The Prosody of Live Radio Commentary of Football Matches, pp. 217-237.
S. Brognaux, B. Picart, and T. Drugman, "A new prosody annotation protocol for live sports commentaries," in Interspeech, 2013.
J. Yamagishi, T. Nose, H. Zen, Z. Ling, T. Toda, K. Tokuda, S. King, and S. Renals, "Robust speaker-adaptive hmm-based text-to-speech synthesis," IEEE Audio, Speech, & Language Processing, vol. 17(6), pp. 1208-1230, 2009.
J. Yamagishi, T. Masuko, and T. Kobayashi, "Hmm-based expressive speech synthesis - towards tts with arbitrary speaking styles and emotions," in Proc. of SWIM, 2004.
T. Nose, M. Tachibana, and T. Kobayash, "Hmm-based style control for expressive speech synthesis with arbitrary speakers voice using model adaptation," IEICE Transactions on Information and Systems, vol. 92(3), pp. 489-497, 2009.
J. Yamagishi, "Average-voice-based speech synthesis," Ph.D. dissertation, Tokyo Institute of Technology, 2006.
H. Zen, N. Braunschweiler, S. Buchholz, M. J. Gales, K. Knill, S. Krstulovic, and J. Latorre, "Statistical parametric speech synthesis based on speaker and language factorization," IEEE Transactions on Audio, Speech and Language Processing, vol. 20(6), pp. 1713-1724, 2012.
N. Obin, P. Lanchantin, A. Lacheret, and X. Rodet, "Discrete/ continuous modelling of speaking style in hmm-based speech synthesis: Design and evaluation," in Interspeech, 2011.
J.-P. Goldman, "Easyalign: an automatic phonetic alignment tool under Praat," in Interspeech, 2011, pp. 3233-3236.
S. Brognaux, S. Roekhaut, T. Drugman, and R. Beaufort, "Train&Align: A new online tool for automatic phonetic alignments," in IEEE SLT Workshop, 2012.
V. Colotte and R. Beaufort, "Linguistic features weighting for a text-to-speech system without prosody model," in Interspeech, 2005, pp. 2549-2552.
K. Silverman, M. Beckman, J. Pitrelli, M. Ostendorf, C. Wightman, P. Price, J. Pierrehumbert, and J. Hirschberg, "ToBI: A standard for labeling english prosody," in ICSLP, 1992, pp. 867-870.
A. Di Cristo, "Vers une modélisation de l'accentuation du français: deuxième partie," Journal of French Studies, vol. 10, pp. 27-44, 2000.
T. Drugman, J. Kane, and C. Gobl, "Modeling the creaky excitation for parametric speech synthesis," in Interspeech, 2012.
A. Mehrabian and J. A. Russel, An Approach to Environmental Psychology. MIT Press, 1974.
J. Cohen, "A coefficient of agreement for nominal scales," Educational and Psychological Measurement, vol. 20(1), pp. 37-46, 1960.
T. Drugman and T. Dutoit, "The deterministic plus stochastic model of the residual signal and its applications," IEEE Transactions on Audio, Speech and Language Processing, vol. 20(3), pp. 968-981, 2012.
V. Digalakis, D. Rtischev, and L. Neumeyer, "Speaker adaptation using constrainted reestimation of gaussian mixtures," IEEE Transactions on Speech and Audio Processing, vol. 3(5), pp. 357-366, 1995.
M. Gales, "Maximum likelihood linear transformations for hmmbase speech recognition," Computer Speech and Language, vol. 12(2), pp. 75-98, 1998.
J. Ferguson, "Variable duration models for speech," in Symp. on Application of Hidden Markov Models to Text and Speech, 1980.
J. Yamagishi and T. Kobayashi, "Average-voice-based speech synthesis using hsmm-based speaker adaptation and adaptive training," IEICE Transactions Information and Systems, vol. 90(2), pp. 533-543, 2007.