Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation

[en] Background/Objectives: Artificial intelligence (AI), particularly large language models (LLMs), has demonstrated versatility in various applications but faces challenges in specialized domains like neurology. This study evaluates a specialized LLM’s capability and trustworthiness in complex neurological diagnosis, comparing its performance to neurologists in simulated clinical settings. Methods: We deployed GPT-4 Turbo (OpenAI, San Francisco, CA, US) through Neura (Sciense, New York, NY, US), an AI infrastructure with a dual-database architecture integrating “long-term memory” and “short-term memory” components on a curated neurological corpus. Five representative clinical scenarios were presented to 13 neurologists and the AI system. Participants formulated differential diagnoses based on initial presentations, followed by definitive diagnoses after receiving conclusive clinical information. Two senior academic neurologists blindly evaluated all responses, while an independent investigator assessed the verifiability of AI-generated information. Results: AI achieved a significantly higher normalized score (86.17%) compared to neurologists (55.11%, p < 0.001). For differential diagnosis questions, AI scored 85% versus 46.15% for neurologists, and for final diagnosis, 88.24% versus 70.93%. AI obtained 15 maximum scores in its 20 evaluations and responded in under 30 s compared to neurologists’ average of 9 min. All AI-provided references were classified as relevant with no hallucinatory content detected. Conclusions: A specialized LLM demonstrated superior diagnostic performance compared to practicing neurologists across complex clinical challenges. This indicates that appropriately harnessed LLMs with curated knowledge bases can achieve domain-specific relevance in complex clinical disciplines, suggesting potential for AI as a time-efficient asset in clinical practice.

Disciplines :

Neurology

Author, co-author :

Barrit, Sami ; Neurosurgery, Université Libre de Bruxelles, 1070 Brussels, Belgium ; Neurosurgery, CHU Tivoli, 7110 La Louvière, Belgium ; Neurodynamics Laboratory, Department of Neurosurgery, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA ; Sciense, New York, NY 10013, USA

Torcida, Nathan ; Sciense, New York, NY 10013, USA ; Neurology, Université Libre de Bruxelles, 1050 Brussels, Belgium

Mazeraud, Aurelien ; Anesthésie-Réanimation, GHU Paris, Pôle Neuro, 75014 Paris, France ; Neurosciences, Université de Paris, 75006 Paris, France

Boulogne, Sebastien ; Neurophysiology and Epileptology, Universite de Lyon, 69007 Lyon, France

Benoit, Jeanne ; Neurology, CHU de Nice, Université Côte d’Azur, UMR2CA, 06000 Nice, France

Carette, Timothée; Neurology, Université Catholique de Louvain, Clinique Saint-Pierre Ottignies, 1348 Louvain-la-Neuve, Belgium

Carron, Thibault; LIP6, CNRS, Sorbonne Université, 75005 Paris, France

Delsaut, Bertil; Neurology, Université Libre de Bruxelles, 1050 Brussels, Belgium ; Neurology, CHU Tivoli, 7110 La Louvière, Belgium

Diab, Eva ; Clinical Neurophysiology, CHU Amiens Picardie, CHIMERE UR 7516 UPJV, 80054 Amiens, France

Kermorvant, Hugo; Neurophy Lab, Université Libre de Bruxelles, 1050 Brussels, Belgium

More authors (15 more)

Language :

English

Title :

Specialized Large Language Model Outperforms Neurologists at Complex Diagnosis in Blinded Case-Based Evaluation

Publication date :

27 March 2025

Journal title :

Brain Sciences

eISSN :

2076-3425

Publisher :

MDPI AG

Volume :

Issue :

Pages :

347

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://www.mdpi.com/2076-3425/15/4/347/pdf

Available on ORBi UMONS :

since 03 April 2025

Statistics

Number of views

19 (2 by UMONS)

Number of downloads

4 (1 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenCitations

OpenAlex citations

Bibliography

Yu K.H. Beam A.L. Kohane I.S. Artificial intelligence in healthcare Nat. Biomed. Eng. 2018 2 719 731 10.1038/s41551-018-0305-z 31015651
Xu Y. Liu X. Cao X. Huang C. Liu E. Qian S. Liu X. Wu Y. Dong F. Zhang J. et al. Artificial intelligence: A powerful paradigm for scientific research Innovation 2021 2 100179 10.1016/j.xinn.2021.100179 34877560
Radford A. Narasimhan K. Salimans T. Sutskever I. Improving Language Understanding by Generative Pre-Training Preprint 2018 Available online: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed on 1 March 2023)
Devlin J. Chang M.-W. Lee K. Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding arXiv 2018 10.48550/arXiv.1810.04805 1810.04805
Achiam J. Adler S. Agarwal S. Ahmad L. Akkaya I. Aleman F.L. Almeida D. Altenschmidt J. Altman S. McGrew B. et al. GPT-4 Technical Report arXiv 2023 10.48550/arXiv.2303.08774 2303.08774
Beam A.L. Drazen J.M. Kohane I.S. Leong T.Y. Manrai A.K. Rubin E.J. Artificial Intelligence in Medicine N. Engl. J. Med. 2023 388 1220 1221 10.1056/NEJMe2206291
Ling C. Zhao X. Lu J. Deng C. Zheng C. Wang J. Chowdhury T. Li Y. Cui H. Zhao L. et al. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey arXiv 2023 10.48550/arXiv.2305.18703 2305.18703
Strubell E. Ganesh A. McCallum A. Energy and Policy Considerations for Deep Learning in NLP arXiv 2019 10.48550/arXiv.1906.02243 1906.02243
Singhal K. Azizi S. Tu T. Singhal K. Azizi S. Tu T. Mahdavi S.S. Wei J. Chung H.W. Natarajan V. et al. Large language models encode clinical knowledge Nature 2023 620 172 180 10.1038/s41586-023-06291-2
Lipton Z.C. The Mythos of Model Interpretability Queue 2018 16 31 57 10.1145/3236386.3241340
Huang K. Altosaar J. Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission arXiv 2019 10.48550/arXiv.1904.05342 1904.05342
Liu N.F. Lin K. Hewitt J. Paranjape A. Bevilacqua M. Petroni F. Liang P. Lost in the Middle: How Language Models Use Long Contexts (Version 3) arXiv 2023 10.48550/ARXIV.2307.03172 2307.03172
Lewis P. Perez E. Piktus A. Petroni F. Karpukhin V. Goyal N. Küttler H. Lewis M. Yih W.-T. Rocktäschel T. et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks arXiv 2020 10.48550/arXiv.2005.11401 2005.11401
Mikolov T. Chen K. Corrado G. Dean J. Efficient Estimation of Word Representations in Vector Space arXiv 2013 10.48550/arXiv.1301.3781 1301.3781
Pokorny J. NoSQL databases Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services Ho Chi Minh City, Vietnam 5–7 December 2011 10.1145/2095536.2095583
Taipalus T. Vector database management systems: Fundamental concepts, use-cases, and current challenges arXiv 2023 2309.11322 10.48550/arXiv.2309.11322
Han M.H. Adams and Victor’s Principles of Neurology American Association of Neuropathologists, Inc. Littleton, CO, USA 2009
Brazis P.W. Masdeu J.C. Biller J. Localization in Clinical Neurology 6th ed. Wolters Kluwer Health Adis (ESP) Waltham, MA, USA 2012 1 668
Jankovic J. Mazziotta J.C. Pomeroy S.L. Newman N.J. Bradley’s Neurology in Clinical Practice Elsevier Health Sciences Amsterdam, The Netherlands 2021
Cooper P.E. Cooper PE. DeJong’s The Neurologic Examination. 2005. Sixth edition. By William W. Campbell. Published by Lippincott, Williams & Wilkins. 671 pages. C$140 approx Can. J. Neurol. Sci. 2017 32 558 10.1017/s0317167100116099
Rowland L.P. Pedley T.A. Merritt H.H. Merritt’s Neurology Lippincott Williams & Wilkins Philadelphia, PA, USA 2010
Edition MMP Neurologic Disorders 2023 Available online: https://www.msdmanuals.com/professional/neurologic-disorders (accessed on 25 September 2023)
Wikipedia Category: Neurological Disorders 2023 Available online: https://en.wikipedia.org/wiki/Category:Neurological_disorders_%E2%80%8C (accessed on 25 September 2023)
Lun R. Niznick N. Padmore R. Mack J. Shamy M. Stotts G. Blacquiere D. Clinical Reasoning: Recurrent strokes secondary to unknown vasculopathy Neurology 2020 94 e2396 e2401 10.1212/WNL.0000000000009534
Francis A.W. Kiernan C.L. Huvard M.J. Vargas A. Zeidman L.A. Moss H.E. Clinical Reasoning: An unusual diagnostic triad. Susac syndrome, or retinocochleocerebral vasculopathy Neurology 2015 85 e17 e21 10.1212/WNL.0000000000001760
Choi J.H. Wallach A.I. Rosales D. Margiewicz S.E. Belmont H.M. Lucchinetti C.F. Minen M.T. Clinical Reasoning: A 50-year-old woman with SLE and a tumefactive lesion Neurology 2017 89 e140 e145 10.1212/WNL.0000000000004386
Harada Y. Elkhider H. Masangkay N. Lotia M. Clinical Reasoning: A 65-year-old man with asymmetric weakness and paresthesias Neurology 2019 93 856 861 10.1212/WNL.0000000000008444
McIntosh P. Scott B. Clinical Reasoning: A 55-Year-Old Man with Odd Behavior and Abnormal Movements Neurology 2021 97 1090 1093 10.1212/WNL.0000000000012663
Chai J. Evans L. Hughes T. Diagnostic aids: The Surgical Sieve revisited Clin Teach. 2017 14 263 267 10.1111/tct.12546
Kung T.H. Cheatham M. Medenilla A. Sillos C. De Leon L. Elepaño C. Madriaga M. Aggabao R. Diaz-Candido G. Tseng V. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models PLoS Digit. Health 2023 2 e0000198 10.1371/journal.pdig.0000198
Schubert M.C. Wick W. Venkataramani V. Performance of Large Language Models on a Neurology Board-Style Examination JAMA Netw. Open 2023 6 e2346721 10.1001/jamanetworkopen.2023.46721 38060223
Singhal K. Tu T. Gottweis J. Sayres R. Wulczyn E. Amin M. Hou L. Clark K. Pfohl S.R. Cole-Lewis H. et al. Towards Expert-Level Medical Question Answering with Large Language Models arXiv 2023 10.48550/arXiv.2305.09617 2305.09617
Ray P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope Internet Things Cyber-Phys. Syst. 2023 3 121 154 10.1016/j.iotcps.2023.04.003
Brown T. Mann B. Ryder N. Subbiah M. Kaplan J.D. Dhariwal P. Neelakantan A. Shyam P. Sastry G. Askell A. Language Models are Few-Shot Learners arXiv 2020 10.48550/arXiv.2005.14165 2005.14165
Touvron H. Martin L. Stone K. Subbiah M. Kaplan J. Dhariwal P. Neelakantan A. Shyam P. Sastry G. Askell A. et al. Llama 2: Open Foundation and Fine-Tuned Chat Models arXiv 2023 10.48550/arXiv.2307.09288 2307.09288
Jiang A.Q. Sablayrolles A. Mensch A. Bamford C. Chaplot D.S. de las Casas D. Bressand F. Lengyel G. Lample G. Saulnier L. et al. Mistral 7B arXiv 2023 10.48550/arXiv.2310.06825 2310.06825
Li Y. Du M. Song R. Wang X. Wang Y. A Survey on Fairness in Large Language Models arXiv 2023 10.48550/arXiv.2308.10149 2308.10149
Wu M. Fikri Aji A. Style Over Substance: Evaluation Biases for Large Language Models arXiv 2023 10.48550/arXiv.2307.03025 2307.03025
Sanderson K. GPT-4 is here: What scientists think arXiv 2023 615 773 10.1038/d41586-023-00816-5
Louie P. Wilkes R. Representations of race and skin tone in medical textbook imagery Soc. Sci. Med. 2018 202 38 42 10.1016/j.socscimed.2018.02.023 29501717
Belyaeva A. Cosentino J. Hormozdiari F. Eswaran K. Shetty S. Corrado G. Carroll A. McLean C.Y. Furlotte N.A. Multimodal LLMs for health grounded in individual-specific data arXiv 2023 10.48550/arXiv.2307.09018 2307.09018
Lyu C. Wu M. Wang L. Huang X. Liu B. Du Z. Shi S. Tu Z. Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration arXiv 2023 10.48550/arXiv.2306.09093 2306.09093
Chollet F. On the Measure of Intelligence arXiv 2019 10.48550/arXiv.1911.01547 1911.01547
Berglund L. Tong M. Kaufmann M. Balesni M. Cooper Stickland A. Korbak T. Evans O. The Reversal Curse: LLMs trained on “A is B” fail to learn “B is A” arXiv 2023 10.48550/arXiv.2309.12288 2309.12288
Dziri N. Lu X. Sclar M. Li X.L. Jiang L. Lin B.Y. Welleck S. West P. Bhagavatula C. Le Bras R. et al. Faith and Fate: Limits of Transformers on Compositionality arXiv 2023 10.48550/arXiv.2305.18654 2305.18654
McCoy R.T. Yao S. Friedman D. Hardy M. Griffiths T.L. Embers of Autoregression: Understanding Large Language Models Through the Problem They are Trained to Solve arXiv 2023 10.48550/arXiv.2309.13638 2309.13638
Mitchell M. Palmarini A.B. Moskvichev A. Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks arXiv 2023 10.48550/arXiv.2311.09247 2311.09247
Gallegos I.O. Rossi R.A. Barrow J. Tanjim M.M. Kim S. Dernoncourt F. Yu T. Zhang R. Ahmed N.K. Bias and Fairness in Large Language Models: A Survey arXiv 2023 2309.00770 10.1162/coli_a_00524 Available online: https://ui.adsabs.harvard.edu/abs/2023arXiv230900770G (accessed on 1 September 2023)
Daroff R.B. Jankovic J. Mazziotta J.C. Pomeroy S.L. Bradley W.G. Bradley’s Neurology in Clinical Practice Elsevier Amsterdam, The Netherlands 2016 149 2341 149, 237, 304, 334, 338, 564, 569, 570, 1051, 1061, 1067, 1075, 1181, 1192, 1223, 1256, 1257, 1294, 1361, 1828, 1890, 2243, 2312, 2323–2325, 2330, 2337, 2339, 2341 0323339166
Rowland L.P. Pedley T.A. Merritt H.H. Merritt’s Neurology Wolters Kluwer Alphen aan den Rijn, The Netherlands 2016 854 1472 854, 690, 1180, 1348, 1445, 1472 145119336X
Ferreri A.J. Campo E. Seymour J.F. Willemze R. Ilariucci F. Ambrosetti A. Zucca E. Rossi G. López-Guillermo A. Pavlovsky M.A. et al. Intravascular lymphoma: Clinical presentation, natural history, management and prognostic factors in a series of 38 cases, with special emphasis on the ‘cutaneous variant’ Br. J. Haematol. 2004 127 173 183 10.1111/j.1365-2141.2004.05177.x 15461623
Ropper A. Samuels M. Klein J. Adams and Victor’s Principles of Neurology 10th ed. McGraw-Hill New York, NY, USA 2014 889 2032 889, 1224, 1543, 2032 978-0071794794
Jung H.H. Danek A. Walker R.H. Neuroacanthocytosis Syndromes Orphanet J. Rare Dis. 2011 6 68 10.1186/1750-1172-6-68