Comparing the performance of ChatGPT GPT‐4, Bard, and Llama‐2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi‐center psychiatrists

Li, Dian‐Jeng; Kao, Yu‐Chen; Tsai, Shih‐Jen; Bai, Ya‐Mei; Yeh, Ta‐Chuan; Chu, Che‐Sheng; Hsu, Chih‐Wei; Cheng, Szu‐Wei; Hsu, Tien‐Wei; Liang, Chih‐Sung; Su, Kuan‐Pin

Published in

Wiley, Psychiatry and Clinical Neurosciences, 2024

DOI: 10.1111/pcn.13656

Tools

Export citation

Search in Google Scholar

Comparing the performance of ChatGPT GPT‐4, Bard, and Llama‐2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi‐center psychiatrists

Journal article published in 2024 by Dian‐Jeng Li

, Yu‐Chen Kao, Shih‐Jen Tsai

, Ya‐Mei Bai

, Ta‐Chuan Yeh

, Che‐Sheng Chu

, Chih‐Wei Hsu

, Szu‐Wei Cheng

, Tien‐Wei Hsu

, Chih‐Sung Liang

, Kuan‐Pin Su

This paper was not found in any repository, but could be made available legally by the author.

Full text: Unavailable

Preprint: archiving allowed

Upload

Postprint: archiving restricted

Upload

Published version: archiving forbidden

Policy details

Data provided by

Abstract

AimLarge language models (LLMs) have been suggested to play a role in medical education and medical practice. However, the potential of their application in the psychiatric domain has not been well‐studied.MethodIn the first step, we compared the performance of ChatGPT GPT‐4, Bard, and Llama‐2 in the 2022 Taiwan Psychiatric Licensing Examination conducted in traditional Mandarin. In the second step, we compared the scores of these three LLMs with those of 24 experienced psychiatrists in 10 advanced clinical scenario questions designed for psychiatric differential diagnosis.ResultOnly GPT‐4 passed the 2022 Taiwan Psychiatric Licensing Examination (scoring 69 and ≥ 60 being considered a passing grade), while Bard scored 36 and Llama‐2 scored 25. GPT‐4 outperformed Bard and Llama‐2, especially in the areas of ‘Pathophysiology & Epidemiology’ (χ² = 22.4, P < 0.001) and ‘Psychopharmacology & Other therapies’ (χ² = 15.8, P < 0.001). In the differential diagnosis, the mean score of the 24 experienced psychiatrists (mean 6.1, standard deviation 1.9) was higher than that of GPT‐4 (5), Bard (3), and Llama‐2 (1).ConclusionCompared to Bard and Llama‐2, GPT‐4 demonstrated superior abilities in identifying psychiatric symptoms and making clinical judgments. Besides, GPT‐4's ability for differential diagnosis closely approached that of the experienced psychiatrists. GPT‐4 revealed a promising potential as a valuable tool in psychiatric practice among the three LLMs.

Published in

Links

Tools

Comparing the performance of ChatGPT GPT‐4, Bard, and Llama‐2 in the Taiwan Psychiatric Licensing Examination and in differential diagnosis with multi‐center psychiatrists

Abstract