Attention in Machine Learning

Attention; Neural networks; Sequence learning; Transformer; Computational modelling; Human brain; Machine-learning; Models of attention; Natural languages; Neural-networks; Sensory input; Neuroscience (all); Computer Science (all); Engineering (all)

Abstract :

[en] The human brain’s remarkable ability to focus on relevant information amidst a sea of sensory inputs has inspired a new wave of research in artificial intelligence. In this chapter, we delve into the computational modeling of attention in machine learning, where neural networks are trained to selectively pay attention to specific parts of their input to produce accurate outputs. From natural language processing to computer vision, from recommender systems to market forecasting, attention-based models have achieved state-of-the-art performance on a wide range of tasks. We will guide readers through the history and evolution of attention in machine learning, from its early implementations to recent breakthroughs with the Transformer architecture. Through a step-by-step introduction to neural networks and sequence learning, we will explain the motivation behind computational attention, explore its implementations, and provide a comparison with human attention. With this overview of the attention landscape in machine learning, readers will gain insight into how this computational concept has transformed AI research.

Disciplines :

Computer science

Author, co-author :

Gille, Cyprien ; Université de Mons - UMONS > Faculté Polytechnique > Service Information, Signal et Intelligence artificielle

Language :

English

Title :

Attention in Machine Learning

Publication date :

February 2025

Main work title :

From Human Attention to Computational Attention: A Multidisciplinary Approach

Publisher :

Springer Science+Business Media

ISBN/EAN :

978-3-03-184300-6
978-3-03-184299-3

Peer reviewed :

Editorial reviewed

Additional URL :

https://link.springer.com/content/pdf/10.1007/978-3-031-84300-6_12

Research unit :

F105 - Information, Signal et Intelligence artificielle

Research institute :

R300 - Institut de Recherche en Technologies de l'Information et Sciences de l'Informatique

Available on ORBi UMONS :

since 12 January 2026

Statistics

Number of views

27 (1 by UMONS)

Number of downloads

0 (0 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Koch K McLean J Segev R Freed M A Berry M J Balasubramanian V Sterling P How much the eye tells the brain Current biology: CB 2006 16 14 1428 1434 1:CAS:528:DC%2BD28XntlCrt7o%3D 16860742
Bahdanau, D., Cho, K., & Bengio, Y. Neural machine translation by jointly learning to align and translate. Version: 7. http://arxiv.org/abs/1409.0473
Mnih V Heess N Graves A Kavukcuoglu K Recurrent models of visual attention 2014 http://arxiv.org/abs/1406.6247
Liu, Q., Zeng, Y., Mokhosi, R., & Zhang, H. (2018). STAMP: Short-term attention/memory priority model for session-based recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, ser. KDD ’18 (pp. 1831–1839). Association for Computing Machinery. https://doi.org/10.1145/3219819.3219950
Xu K Ba J Kiros R Cho K Courville A Salakhutdinov R Zemel R Bengio Y Show, attend and tell: Neural image caption generation with visual attention 2016 http://arxiv.org/abs/1502.03044, Version: 3
Correia, A. d. S., & Colombini, E. L. (2021). Attention, please! a survey of neural attention models in deep learning. http://arxiv.org/abs/2103.16775
Radford A Narasimhan K Improving language understanding by generative pre-training 2018 https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035
Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., & Sutskever, I. (2021). Zero-shot text-to-image generation. arXiv:2102.12092 [cs] version: 2. http://arxiv.org/abs/2102.12092
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 [cs]. http://arxiv.org/abs/1810.04805
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971 [cs]. http://arxiv.org/abs/2302.13971
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762 [cs] version: 7. http://arxiv.org/abs/1706.03762
Rosenblatt F The perceptron: A probabilistic model for information storage and organization in the brain Psychological Review 1958 65 6 386 408 1:STN:280:DyaG1M%2FjtFCmtw%3D%3D 13602029
Canziani A LeCun Y Deep learning—NYU 2020 https://atcold.github.io/NYU-DLSP20/
Minsky M Papert S Perceptrons; an introduction to computational geometry 1969 MIT Press
Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ser. ICML’10 (pp. 807–814). Omnipress.
Hendrycks, D., & Gimpel, K. Gaussian error linear units (GELUs). http://arxiv.org/abs/1606.08415
Fukushima K Neocognitron: A self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position Biological Cybernetics 1980 36 4 193 202 1:STN:280:DyaL3c7nsFKntw%3D%3D 7370364
Krizhevsky A Sutskever I Hinton G E ImageNet classification with deep convolutional neural networks Communications of the ACM 2017 60 6 84 90 https://dl.acm.org/doi/10.1145/3065386
Amari S-I Learning patterns and pattern sequences by self-organizing nets of threshold elements IEEE Transactions on Computers 1972 C-21 11 1197 1206 https://ieeexplore.ieee.org/document/1672070
Hochreiter S Schmidhuber J Long short-term memory Neural Computation 1997 9 8 1735 1780 1:STN:280:DyaK1c%2FhvVahsQ%3D%3D 9377276
Cho K van Merrienboer B Gulcehre C Bahdanau D Bougares F Schwenk H Bengio Y Learning phrase representations using RNN encoder-decoder for statistical machine translation 2014 http://arxiv.org/abs/1406.1078
Kramer M A Nonlinear principal component analysis using autoassociative neural networks AIChE Journal 1991 37 2 233 243 1:CAS:528:DyaK3MXht1Ghsbs%3D https://onlinelibrary.wiley.com/doi/abs/10.1002/aic.690370209
Goodfellow I J Pouget-Abadie J Mirza M Xu B Warde-Farley D Ozair S Courville A Bengio Y Generative adversarial networks 2014 http://arxiv.org/abs/1406.2661
Bronstein M M Bruna J LeCun Y Szlam A Vandergheynst P Geometric deep learning: Going beyond Euclidean data IEEE Signal Processing Magazine 2017 34 4 18 42 https://ieeexplore.ieee.org/abstract/document/7974879
Scarselli F Gori M Tsoi A C Hagenbuchner M Monfardini G The graph neural network model IEEE Transactions on Neural Networks 2009 20 1 61 80 19068426 https://ieeexplore.ieee.org/document/4700287
Huber, P. J. (1964). Robust estimation of a location parameter. 35(1), 73–101. Institute of Mathematical Statistics. https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-35/issue-1/Robust-Estimation-of-a-Location-Parameter/10.1214/aoms/1177703732.full
Kullback S Leibler R A On information and sufficiency The Annals of Mathematical Statistics 1951 22 1 79 86 https://projecteuclid.org/journals/annals-of-mathematical-statistics/volume-22/issue-1/On-Information-and-Sufficiency/10.1214/aoms/1177729694.full
Amari S A theory of adaptive pattern classifiers IEEE Transactions on Electronic Computers 1967 EC-16 3 299 307 https://ieeexplore.ieee.org/abstract/document/4039068
Schmidhuber J Annotated history of modern AI and deep learning 2022 http://arxiv.org/abs/2212.11279
Kingma D P Ba J Adam: A method for stochastic optimization 2017 http://arxiv.org/abs/1412.6980, Version: 9
Sutton, R., & Barto, A. (1998). Reinforcement learning: An introduction (Vol. 9(5), pp. 1054–1054). IEEE Transactions on Neural Networks. https://ieeexplore.ieee.org/document/712192
Cho K van Merrienboer B Bahdanau D Bengio Y On the properties of neural machine translation: Encoder-decoder approaches 2014 http://arxiv.org/abs/1409.1259
Bengio Y Simard P Frasconi P Learning long-term dependencies with gradient descent is difficult IEEE Transactions on Neural Networks 1994 5 2 157 166 1:STN:280:DC%2BD1c7gvFansQ%3D%3D 18267787 https://ieeexplore.ieee.org/document/279181
Pascanu R Mikolov T Bengio Y On the difficulty of training recurrent neural networks 2013 http://arxiv.org/abs/1211.5063
Larochelle, H., & Hinton, G. (2010). Learning to combine foveal glimpses with a third-order Boltzmann machine (Vol. 1, pp. 1243–1251).
Sundararajan M Taly A Yan Q Axiomatic attribution for deep networks 2017 http://arxiv.org/abs/1703.01365
Zeiler M D Fergus R Visualizing and understanding convolutional networks 2013 http://arxiv.org/abs/1311.2901
Frintrop S Rome E Christensen H I Computational visual attention systems and their cognitive foundations: A survey ACM Transactions on Applied Perception (TAP) 2010 7 1 1 39 https://dl.acm.org/doi/10.1145/1658349.1658355
Misra I Zitnick C L Hebert M Shuffle and learn: Unsupervised learning using temporal order verification 2016 http://arxiv.org/abs/1603.08561
Kalchbrenner, N., Espeholt, L., Simonyan, K., Oord, A. v. d., Graves, A., & Kavukcuoglu, K. (2017). Neural machine translation in linear time. http://arxiv.org/abs/1610.10099
Luong M-T Pham H Manning C D Effective approaches to attention-based neural machine translation 2015 http://arxiv.org/abs/1508.04025, Version: 5
Cheng J Dong L Lapata M Long short-term memory-networks for machine reading 2016 http://arxiv.org/abs/1601.06733
Graves A Wayne G Danihelka I Neural Turing machines 2014 http://arxiv.org/abs/1410.5401
Chan W Jaitly N Le Q V Vinyals O Listen, attend and spell 2015 http://arxiv.org/abs/1508.01211
Vinyals O Kaiser L Koo T Petrov S Sutskever I Hinton G Grammar as a foreign language 2015 http://arxiv.org/abs/1412.7449
Vinyals O Le Q A neural conversational model 2015 http://arxiv.org/abs/1506.05869
Gregor K Danihelka I Graves A Rezende D J Wierstra D DRAW: A recurrent neural network for image generation 2015 http://arxiv.org/abs/1502.04623, Version: 2
Guo S Zhang R Liu B Zhu Y Hayhoe M Ballard D Stone P Machine versus human attention in deep reinforcement learning tasks 2021 http://arxiv.org/abs/2010.15942, Version: 3
Lai Q Khan S Nie Y Shen J Sun H Shao L Understanding more about human and machine attention in deep neural networks 2020 http://arxiv.org/abs/1906.08764, Version: 3
Das A Agrawal H Zitnick C L Parikh D Batra D Human attention in visual question answering: Do humans and deep networks look at the same regions? 2016 http://arxiv.org/abs/1606.03556
Bensemann, J., Peng, A., Benavides-Prado, D., Chen, Y., Tan, N., Corballis, P. M., Riddle, P., & Witbrock, M. (2022). Eye gaze and self-attention: How humans and transformers attend words in sentences. In E. Chersoni, N. Hollenstein, C. Jacobs, Y. Oseki, L. Prévot & E. Santus (Eds.), Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics (pp. 75–87). Association for Computational Linguistics. https://aclanthology.org/2022.cmcl-1.9
Riche N Mancas M Duvinage M Mibulumukini M Gosselin B Dutoit T RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis Signal Processing: Image Communication 2013 28 6 642 658 https://www.sciencedirect.com/science/article/pii/S0923596513000489