Cross-language voice conversion based on eigenvoices

[en] This paper presents a novel cross-language voice conversion (VC) method based on eigenvoice conversion (EVC). Cross-language VC is a technique for converting voice quality between two speakers uttering different languages each other. In general, parallel data consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language VC. First, we train an eigenvoice GMM (EV-GMM) using many parallel data sets by a source speaker and many pre-stored other speakers who can utter the same language as the source speaker. And then, the conversion model between the source speaker and a target speaker who cannot utter the source speaker's language is developed by adapting the EV-GMM using a few arbitrary sentences uttered by the target speaker in a different language. The experimental results demonstrate that the proposed method yields significant performance improvements in both speech quality and conversion accuracy for speaker individuality compared with a conventional cross-language VC method based on frame selection.

Disciplines :

Electrical & electronics engineering

Author, co-author :

Charlier, M.

Ohtani, Y.

Toda, T.

Moinet, Alexis ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Dutoit, Thierry ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Language :

English

Title :

Cross-language voice conversion based on eigenvoices

Publication date :

06 September 2009

Event name :

Interspeech 2009

Event place :

Brighton, United Kingdom

Event date :

2009

Research unit :

F105 - Information, Signal et Intelligence artificielle

Research institute :

R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique
R450 - Institut NUMEDIART pour les Technologies des Arts Numériques

Available on ORBi UMONS :

since 24 November 2010

Statistics

Number of views

49 (0 by UMONS)

Number of downloads

0 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

Y. Stylianou, O. Cappé, and E. Moulines. Continuous probabilistic transform for voice conversion. IEEE Trans. Speech and Audio Processing, Vol. 6, No. 2, pp. 131-142, 1998.
A. Kain and M.W. Macon. Spectral voice conversion for text-to-speech synthesis. Proc. ICASSP, pp. 285-288, Seattle, USA, May 1998.
T. Toda, A.W. Black, and K. Tokuda. Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Trans. Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2222-2235, 2007.
M. Abe, K. Shikano, and H. Kuwabara. Statistical analysis of bilingual speaker's speech for cross-language voice conversion. J. Acoust. Soc. Am., Vol. 90, No. 1, pp. 76-82, 1991.
M. Mashimo, T. Toda, H. Kawanami. K. Shikano, and N. Camp-bell. Cross-language voice conversion evaluation using bilingual databases. IPSJ Journal, Vol. 43, No. 7, pp. 2177-2185, July 2002.
D. Suendermann, H. Hoege, A. Bonafonte, H. Ney, A. W. Black, and S. Narayanan. Text-independent voice conversion based on unit selection. Proc. ICASSP, Vol. 1, pp. 81-84, Toulouse, France, USA, Mar. 2006.
A.J. Hunt and A.W. Black. Unit selection in a concatenative speech synthesis system using a large speech database. Proc. ICASSP, pp. 373-376, Atlanta, USA, May 1996.
D. Erro and A. Moreno. Frame alignment method for cross-lingual voice conversion. Proc. INTERSPEECH, pp. 1969-1972, Antwerp, Belgium, Aug. 2007.
A. Mouchtaris, J.V. der Spiegel, and P. Mueller. Non-parallel training for voice conversion by maximum likelihood constrained adaptation. Proc. ICASSP, Vol. 1, pp. 1-4, Montreal, Canada, May 2004.
T. Toda, Y. Ohtani, and K. Shikano. One-to-many and many-to-one voice conversion based on eigenvoices. Proc. ICASSP, pp. 1249-1252, Hawaii, USA, Apr. 2007.
R. Kuhn, J. Junqua, P. Nguyen, and N. Niedzielski. Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech and Audio Processing, Vol. 8, No. 6, pp. 695-707, 2000.
T. Anastasakos, J. McDonough, R. Schwartz, and J. Makhoul. A compact model for speaker-adaptive training. Proc. ICSLP, pp. 1137-1140, Philadelphia, Oct. 1996.
Y. Ohtani, T. Toda, H. Saruwatari, K. Shikano. Speaker adaptive training for one-to-many eigenvoice conversion based on Gaussian mixture model. Proc. INTERSPEECH, pp. 1981-1984, Antwerp, Belgium, Aug. 2007.
D. Tani, T. Toda, Y. Ohtani, H. Saruwatari, and K. Shikano. Maximum a posteriori adaptation for many-to-one eigenvoice conversion. Proc. INTERSPEECH, pp. 1461-1464, Brisbane, Australia, Sep. 2008.
Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano. An improved one-to-many eigenvoice conversion system. Proc. INTERSPEECH, pp. 1080-1083, Brisbane, Australia, Sep. 2008.
T. Toda, Y. Ohtani, and K. Shikano. Eigenvoice conversion based on Gaussian mixture model. Proc. ICSLP, pp. 2446-2449, Pittsburgh, USA, Sep. 2006.
H. Kawahara, I. Masuda-Katsuse, and A.de Cheveigné. Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F 0 extraction: possible role of a repetitive structure in sounds. Speech Communication, Vol. 27, No. 3-4, pp. 187-207, 1999.