Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).

Lechien, Jérome; Maniaci, Antonino; Gengler, Isabelle; Hans, Stephane; Chiesa-Estomba, Carlos M; Vaira, Luigi A

doi:10.1007/s00405-023-08219-y

Article (Scientific journals)

Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).

Lechien, Jérome; Maniaci, Antonino; Gengler, Isabelle et al.

2023 • In European Archives of Oto-Rhino-Laryngology

Peer Reviewed verified by ORBi

Permalink
https://hdl.handle.net/20.500.12907/47325

DOI
10.1007/s00405-023-08219-y

PubMed
37698703

Files (1)Send to Details Statistics Bibliography Similar publications

Files

Full Text

39.pdf

Author postprint (1.21 MB)

Download

All documents in ORBi UMONS are protected by a user license.

Send to

RIS BibTex APA Chicago Permalink X Linkedin

Details

Keywords :

Artificial; ChatGPT; Chatbot; Comparison; Diagnosis; GPT; Head neck; Instrument; Intelligence; Medicine; Otolaryngology; Performance; Surgery; Tool; Treatment; Otorhinolaryngology; General Medicine

Abstract :

[en] [en] OBJECTIVES: To evaluate the reliability and validity of the Artificial Intelligence Performance Instrument (AIPI). METHODS: Medical records of patients consulting in otolaryngology were evaluated by physicians and ChatGPT for differential diagnosis, management, and treatment. The ChatGPT performance was rated twice using AIPI within a 7-day period to assess test-retest reliability. Internal consistency was evaluated using Cronbach's α. Internal validity was evaluated by comparing the AIPI scores of the clinical cases rated by ChatGPT and 2 blinded practitioners. Convergent validity was measured by comparing the AIPI score with a modified version of the Ottawa Clinical Assessment Tool (OCAT). Interrater reliability was assessed using Kendall's tau. RESULTS: Forty-five patients completed the evaluations (28 females). The AIPI Cronbach's alpha analysis suggested an adequate internal consistency (α = 0.754). The test-retest reliability was moderate-to-strong for items and the total score of AIPI (rs = 0.486, p = 0.001). The mean AIPI score of the senior otolaryngologist was significantly higher compared to the score of ChatGPT, supporting adequate internal validity (p = 0.001). Convergent validity reported a moderate and significant correlation between AIPI and modified OCAT (rs = 0.319; p = 0.044). The interrater reliability reported significant positive concordance between both otolaryngologists for the patient feature, diagnostic, additional examination, and treatment subscores as well as for the AIPI total score. CONCLUSIONS: AIPI is a valid and reliable instrument in assessing the performance of ChatGPT in ear, nose and throat conditions. Future studies are needed to investigate the usefulness of AIPI in medicine and surgery, and to evaluate the psychometric properties in these fields.

Disciplines :

Otolaryngology

Author, co-author :

Lechien, Jérome ; Université de Mons - UMONS > Faculté de Psychologie et des Sciences de l'Educatio > Service de Métrologie et Sciences du langage

Maniaci, Antonino; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France ; Department of Medical, Surgical Sciences and Advanced Technologies G.F. Ingrassia, ENT Section, University of Catania, 95123, Catania, Italy

Gengler, Isabelle; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France ; Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati Medical Center, Cincinnati, OH, USA

Hans, Stephane; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France ; Phonetics and Phonology Laboratory (UMR 7018 CNRS, Université Sorbonne Nouvelle/Paris 3), Department of Otorhinolaryngology and Head and Neck Surgery, Foch Hospital, School of Medicine, UFR Simone Veil, Université Versailles Saint-Quentin-en-Yvelines (Paris Saclay University), Paris, France

Chiesa-Estomba, Carlos M; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France ; Young Confederation of the European Oto-Rhino-Laryngological Head and Neck Surgery Societies (Y-CEORLHNS), Dublin, Ireland ; Department of Otorhinolaryngology - Head and Neck Surgery, Donostia University Hospital - Biodonostia Research Institute, St. Sebastian, Spain

Vaira, Luigi A; Research Committee of Young Otolaryngologists of the International Federation of Otorhinolaryngological Societies (IFOS), Paris, France ; Maxillofacial Surgery Operative Unit, Department of Medicine, Surgery and Pharmacy, University of Sassari, Sassari, Italy ; Biomedical Science Department, Biomedical Science PhD School, University of Sassari, Sassari, Italy

Language :

English

Title :

Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI).

Publication date :

12 September 2023

Journal title :

European Archives of Oto-Rhino-Laryngology

ISSN :

0937-4477

eISSN :

1434-4726

Publisher :

Springer Science and Business Media Deutschland GmbH, Germany

Peer reviewed :

Peer Reviewed verified by ORBi

Additional URL :

https://link.springer.com/content/pdf/10.1007/s00405-023-08219-y.pdf

Research unit :

M112 - Anatomie humaine et Oncologie expérimentale

Research institute :

R550 - Institut des Sciences et Technologies de la Santé
R350 - Institut de recherche en sciences et technologies du langage

Available on ORBi UMONS :

since 25 December 2023

Statistics

Number of views

34 (2 by UMONS)

Number of downloads

556 (1 by UMONS)

More statistics

Scopus citations^®

Scopus citations^®
without self-citations

OpenAlex citations

Bibliography

Pernencar C, Saboia I, Dias JC (2022) How far can conversational agents contribute to IBD patient health care-a review of the literature. Front Public Health 10:862432. 10.3389/fpubh.2022.862432 DOI: 10.3389/fpubh.2022.862432
Wahlster W (2023) Understanding computational dialogue understanding. Philos Trans A Math Phys Eng Sci 381(2251):20220049. 10.1098/rsta.2022.0049 DOI: 10.1098/rsta.2022.0049
Hill-Yardin EL, Hutchinson MR, Laycock R, Spencer SJ (2023) A Chat(GPT) about the future of scientific publishing. Brain Behav Immun 110:152–154. 10.1016/j.bbi.2023.02.022 DOI: 10.1016/j.bbi.2023.02.022
Choi JH, Hickman KE, Monahan A, Schwarcz D (2023) ChatGPT goes to law school? Minnesota legal studies research paper No. 23-03
Mohammad B, Supti T, Alzubaidi M, Shah H, Alam T, Shah Z, Househ M (2023) The Pros and Cons of using ChatGPT in medical education: a scoping review. Stud Health Technol Inform 305:644–647. 10.3233/SHTI230580 DOI: 10.3233/SHTI230580
https://futureoflife.org/open-letter/pause-giant-ai-experiments/
Lechien JR, Georgescu BM, Hans S, Chiesa-Estomba CM (2023) ChatGPT performance in laryngology and head & neck surgery: a clinical case-series. Eur Arch Otorhinolaryngol
Rekman J, Hamstra SJ, Dudek N, Wood T, Seabrook C, Gofton W (2016) A new instrument for assessing resident competence in surgical clinic: the Ottawa clinic assessment tool. J Surg Educ 73(4):575–582. 10.1016/j.jsurg.2016.02.003 DOI: 10.1016/j.jsurg.2016.02.003
Task Force for the Development of Student Clinical Performance Instruments, American Physical Therapy Association (2002) The development and testing of APTA clinical performance instruments. Phys Ther 82(4):329–353
Chen YY, Chiu YC, Chu TS, Hsu HY, Chen HL, Wu CC, Huang TS (2022) Is the rating result reliable? A new approach to respond to a medical trainee’s concerns about the reliability of Mini-CEX assessment. J Formos Med Assoc 121(5):943–949. 10.1016/j.jfma.2021.07.005 DOI: 10.1016/j.jfma.2021.07.005
Jubraj B, Patel S, Naseem I, Copp S, Karagkounis D (2017) The acute care assessment tool: pharmacy ACAT. Clin Teach 14:184e8 DOI: 10.1111/tct.12565
Braun LT, Lenzer B, Fischer MR, Schmidmaier R (2019) Complexity of clinical cases in simulated learning environments: proposalfor a scoring system. GMS J Med Educ 36(6):80. 10.3205/zma001288 DOI: 10.3205/zma001288
Gercama AJ, de Haan M, van der Vleuten CPM (2000) Reliability of the Amsterdam clinical challenge scale (ACCS): a new instrument to assess the level of difficulty of patient cases in medical education. Med Educ 34(7):519–524 DOI: 10.1046/j.1365-2923.2000.00663.x
Lee V, Brain K, Martin J (2017) Factors influencing mini-CEX rater judgments and their practical implications: a systematic literature review. Acad Med 92(6):880–887. 10.1097/ACM.0000000000001537 DOI: 10.1097/ACM.0000000000001537
Kogan JR, Holmboe ES, Hauer KE (2009) Tools for direct observation and assessment of clinical skills of medical trainees: a systematic review. JAMA 302(12):1316–1326. 10.1001/jama.2009.1365 DOI: 10.1001/jama.2009.1365
Hoch CC, Wollenberg B, Lüers JC, Knoedler S, Knoedler L, Frank K, Cotofana S, Alfertshofer M (2023) ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 10.1007/s00405-023-08051-4 DOI: 10.1007/s00405-023-08051-4
Chiesa-Estomba CM, Lechien JR, Vaira LA, Brunet A, Cammaroto G, Mayo-Yanez M, Sanchez-Barrueco A, Saga-Gutierrez C (2023) Exploring the potential of Chat-GPT as a supportive tool for sialendoscopy clinical decision making and patient information support. Eur Arch Otorhinolaryngol. 10.1007/s00405-023-08104-8 DOI: 10.1007/s00405-023-08104-8
Hayois L, Dunsmore A (2023) Common and serious ENT presentations in primary care. InnovAiT 16(2):79–86. 10.1177/17557380221140131 DOI: 10.1177/17557380221140131
Hannaford PC, Simpson JA, Bisset AF, Davis A, McKerrow W, Mills R (2005) The prevalence of ear, nose and throat problems in the community: results from a national cross-sectional postal survey in Scotland. Fam Pract 22(3):227–233. 10.1093/fampra/cmi004 DOI: 10.1093/fampra/cmi004
Vasileiou I, Giannopoulos A, Klonaris C, Vlasis K, Marinos S, Koutsonasios I, Katsargyris A, Konstantopoulos K, Karamoutsos C, Tsitsikas A, Marinos G (2009) The potential role of primary care in the management of common ear, nose or throat disorders presenting to the emergency department in Greece. Qual Prim Care 17(2):145–148
Millstein J, Agarwal A (2023) What can doctors and patients do with ChatGPT? | Expert Opinion. Philadelphia Inquirer