Just over 1 year has passed since ChatGPT (OpenAI) burst onto the scene to become the fastest growing consumer software application in history.1 It was immediately obvious that large language models (LLMs) were extraordinarily powerful, but the scope, nature, application, and implications of that power remain unclear. Chung and colleagues2 present a fascinating article in JAMA Surgery demonstrating the ability of a specific LLM (GPT-4 Turbo [OpenAI]) to inspect preoperative clinician notes from the electronic health record to classify or predict several perioperative parameters such as the American Society of Anesthesiologists Physical Status (ASA) classification, hospital or intensive care unit admission, and postoperative lengths of stay. They find that the LLM not only classified cases accurately but also provided an explanation justifying why, for example, a specific patient was categorized as ASA class 3 but not class 4. By contrast, the LLM struggled to make accurate predictions of linear outcomes like length of stay.
By focusing on familiar perioperative outcomes, this study serves as an approachable case study that provides the motivated reader with an accessible primer on how LLMs work, what sophisticated prompting strategies look like, the kind of outcomes that can be expected, and the way performance changes with prompting strategy. As such, it will be of interest to a wide range of surgeons and investigators eager to explore how LLMs might be applied to their area of interest and research.
The article2 presents an enormous amount of detail, much of which is found only in the online supplements. Readers unfamiliar with machine learning and LLM methodology will encounter unfamiliar and challenging terminology, but they will be rewarded by downloading and digesting all that is made available.
Significant technical, regulatory, and practical challenges remain before this or any other LLM might be deployed in clinical practice. For example, when outcomes are rare (eg, in-hospital mortality), rote prediction of the most prevalent condition is almost always correct, making it technically challenging to train LLMs. The authors deployed a sampling strategy that enriched the prevalence of outcomes to enhance detectability, but it is unclear how well the LLM will function in datasets reflecting real-world prevalence. The methods also relied on special treatment from Microsoft, which agreed to delete patient data from memory immediately after LLM processing in order to comply with privacy regulations; whether this can be done at scale is untested. Even if these barriers are overcome, the discipline of implementation science warns of a whole raft of challenges that emerge in translation from the controlled environment of the bench to the unpredictable reality of the bedside. In addition, none of this speaks to the noted phenomenon of LLMs to “hallucinate” or “confabulate” text that appears plausible but false,3,4 raising the question of whether, like Madison Avenue’s flexible commitment to truth in advertising, LLMs may occasionally tell us only what we want to hear.5,6
Practical application is surely coming. Chung and colleagues2 have sent us a thorough dispatch from the frontier of this brave new world.
Corresponding Author: Daniel E. Hall, MD, MDiv, MHSc, UPMC Presbyterian, 200 Lothrop St, Ste F1264, Pittsburgh, PA 15213 (hallde@upmc.edu).
Published Online: June 5, 2024. doi:10.1001/jamasurg.2024.1645
Conflict of Interest Disclosures: Dr Hall reported receiving grants from Veterans Health Affairs Office of Research and Development and having an unpaid consulting relationship with FutureAssure LLC. No other disclosures were reported.
1.Wikipedia. ChatGPT. Accessed April 17, 2024.
2.Chung
P, Fong
CT, Walters
AM, Aghaeepour
N, Yetisgen
M, O’Reilly-Shah
VN. Large language model capabilities in perioperative risk prediction and prognostication. JAMA Surg. Published online June 5, 2024. doi:
3.Smith
AL, Greaves
F, Panch
T. Hallucination or confabulation—neuroanatomy as metaphor in large language models. PLOS Digit Health. 2023;2(11):e0000388. doi:
4.Hatem
R, Simmons
B, Thornton
JE. A call to address AI “hallucinations” and how healthcare professionals can mitigate their risks. ܰܲ. 2023;15(9):e44720. doi:
5.Harrer
S. Attention is not all you need: the complicated case of ethically using large language models in health care and medicine. Dzѱ徱Ա. 2023;90:104512. doi:
6.Frankfurt
HG. The Importance of What We Care About. Cambridge University Press; 1988:117-133. doi: