Evaluating large language models on medical evidence summarization

Tang, Liyan; Sun, Zhaoyi; Idnay, Betina; Nestor, Jordan G.; Soroush, Ali; Elias, Pierre A.; Xu, Ziyang; Ding, Ying; Durrett, Greg; Rousseau, Justin F.; Weng, Chunhua; Peng, Yifan

Published in

Nature Research, npj Digital Medicine, 1(6), 2023

DOI: 10.1038/s41746-023-00896-7

Tools

Export citation

Search in Google Scholar

Evaluating large language models on medical evidence summarization

Journal article published in 2023 by Liyan Tang, Zhaoyi Sun

, Betina Idnay

, Jordan G. Nestor

, Ali Soroush

, Pierre A. Elias

, Ziyang Xu, Ying Ding, Greg Durrett, Justin F. Rousseau

, Chunhua Weng

, Yifan Peng

This paper is made freely available by the publisher.

Full text: Download

Preprint: archiving allowed

Upload

Postprint: archiving forbidden

Published version: archiving allowed

Upload

Policy details

Data provided by

Abstract

AbstractRecent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.

Published in

Links

Tools

Evaluating large language models on medical evidence summarization

Abstract