Speaker-Aware Long Short-Term Memory Multi-Task Learning for Speech Recognition

[en] In order to address the commonly met issue of overfitting in speech recognition, this article investigates Multi- Task Learning, when the auxiliary task focuses on speaker clas- sification. Overfitting occurs when the amount of training data is limited, leading to an over-sensible acoustic model. Multi-Task Learning is a method, among many other regularization methods, which decreases the overfitting impact by forcing the acoustic model to train jointly for multiple different, but related, tasks. In this paper, we consider speaker classification as an auxiliary task in order to improve the generalization abilities of the acoustic model, by training the model to recognize the speaker, or find the closest one inside the training set. We investigate this Multi- Task Learning setup on the TIMIT database, while the acoustic modeling is performed using a Recurrent Neural Network with Long Short-Term Memory cells.

Disciplines :

Electrical & electronics engineering
Library & information sciences

Author, co-author :

Pironkov, Gueorgui ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Dupont, Stéphane ; Université de Mons > Faculté Polytechnique > Information, Signal et Intelligence artificielle

Dutoit, Thierry ; Université de Mons > Faculté Polytechnique > Service Information, Signal et Intelligence artificielle

Language :

English

Title :

Speaker-Aware Long Short-Term Memory Multi-Task Learning for Speech Recognition

Publication date :

31 August 2016

Event name :

European Signal Processing Conference

Event place :

Budapest, Hungary

Event date :

2016

Research unit :

F105 - Information, Signal et Intelligence artificielle

Research institute :

R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique
R450 - Institut NUMEDIART pour les Technologies des Arts Numériques

Available on ORBi UMONS :

since 05 September 2016

Statistics

Number of views

47 (0 by UMONS)

Number of downloads

0 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

Bibliography

G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups", Signal Processing Magazine, IEEE, vol. 29, no. 6, pp. 82-97, 2012.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, and G. Penn, "Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition", in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4277-4280.
O. Vinyals, S. V. Ravuri, and D. Povey, "Revisiting recurrent neural networks for robust ASR", in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. IEEE, 2012, pp. 4085-4088.
R. Caruana, "Multitask learning", Machine learning, vol. 28, no. 1, pp. 41-75, 1997.
L. Prechelt, "Early stopping-but when?" in Neural Networks: Tricks of the trade. Springer, 1998, pp. 55-69.
S. J. Nowlan and G. E. Hinton, "Simplifying neural networks by soft weight-sharing", Neural computation, vol. 4, no. 4, pp. 473-493, 1992.
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A simple way to prevent neural networks from overfitting", The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929-1958, 2014.
G. Pironkov, S. Dupont, and T. Dutoit, "Investigating sparse deep neural networks for speech recognition", in Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on, Dec. 2015, pp. 124. 129.
Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, "Deep neural networks employing multi-task learning and stacked bottleneck features for speech synthesis."
Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and R. Maia, "Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning", in Proc. Interspeech, 2015.
N. Chen, Y. Qian, and K. Yu, "Multi-task learning for text-dependent speaker verification", in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
S. Dupont, C. Ris, O. Deroo, and S. Poitoux, "Feature extraction and acoustic modeling: an approach for improved generalization across languages and accents", in Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on. IEEE, 2005, pp. 29-34.
G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, "Multilingual acoustic models using distributed deep neural networks", in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 8619-8623.
A. Mohan and R. Rose, "Multi-lingual speech recognition with lowrank multi-task deep neural networks", in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4994-4998.
G. Tur, "Multitask learning for spoken language understanding", in Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, vol. 1. IEEE, 2006, pp. I. I.
X. Li, Y.-Y. Wang, and G. Tür, "Multi-task learning for spoken language understanding with shared slots." in INTERSPEECH, vol. 20, no. 1, 2011, p. 1.
R. Collobert and J. Weston, "A unified architecture for natural language processing: Deep neural networks with multitask learning", in Proceedings of the 25th international conference on Machine learning. ACM, 2008, pp. 160-167.
Y. Lu, F. Lu, S. Sehgal, S. Gupta, J. Du, C. H. Tham, P. Green, and V. Wan, "Multitask learning in connectionist speech recognition", in Proceedings of the Tenth Australian International Conference on Speech Science & Technology: 8-10 December 2004; Sydney, 2004, pp. 312. 315.
J. Stadermann, W. Koska, and G. Rigoll, "Multi-task learning strategies for a recurrent neural net in a hybrid tied-posteriors acoustic model." in INTERSPEECH, 2005, pp. 2993-2996.
M. L. Seltzer and J. Droppo, "Multi-task learning in deep neural networks for improved phoneme recognition", in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 6965-6969.
P. Bell and S. Renals, "Regularization of context-dependent deep neural networks with context-independent multi-task training", in Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4290-4294.
D. Chen, B. Mak, C.-C. Leung, and S. Sivadas, "Joint acoustic modeling of triphones and trigraphemes by multi-task learning deep neural networks for low-resource speech recognition", in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 5592-5596.
Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, "Rapid adaptation for deep neural networks through multi-task learning", in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, "Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks", in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
S. Kim, B. Raj, and I. Lane, "Environmental noise embeddings for robust speech recognition", arXiv preprint arXiv:1601.02553, 2016.
G. Pironkov, S. Dupont, and T. Dutoit, "Multi-task learning for speech recognition: an overview", in Proceedings of the 24th European Symposium on Artificial Neural Networks (ESANN), 2016.
U. Reubold, J. Harrington, and F. Kleber, "Vocal aging effects on F0 and the first formant: A longitudinal analysis in adult speakers", Speech Communication, vol. 52, no. 7, pp. 638-651, 2010.
E. Variani, X. Lei, E. McDermott, I. Lopez Moreno, and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification", in Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014, pp. 4052-4056.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlíček, Y. Qian, P. Schwarz et al., "The kaldi speech recognition toolkit", 2011.
J. S. Garofolo, L. D. Consortium et al., TIMIT: acoustic-phonetic continuous speech corpus. Linguistic Data Consortium, 1993.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, "Convolutional neural networks for speech recognition", Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 10, pp. 1533-1545, 2014.