An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Background

Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches.

Objective

The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types—heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models.

Methods

This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches.

Results

The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types.

Conclusions

This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

Related collections

Most cited references 36

Record: found
Abstract: found
Article: not found

Clinical information extraction applications: A literature review

Sunghwan Sohn, Hongfang Liu, Yanshan Wang … (2018)

With the rapid adoption of electronic health records (EHRs), it is desirable to harvest information and knowledge from EHRs to support automated systems at the point of care and to enable secondary use of EHRs for clinical and translational research. One critical component used to facilitate the secondary use of EHR data is the information extraction (IE) task, which automatically extracts and encodes clinical information from text.

0 comments Cited 197 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Yu Gu, Robert Tinn, Hao Cheng … (2022)

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

0 comments Cited 102 times – based on 0 reviews      Review now

Bookmark

Record: found
Abstract: found
Article: not found

Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu, Xu Jiang … (2022)

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

0 comments Cited 96 times – based on 0 reviews      Review now

Bookmark

All references

Author and article information

Contributors

Yanshan Wang:

ORCID: https://orcid.org/0000-0003-4433-7839

Department of Health Information ManagementUniversity of Pittsburgh6026 Forbes TowerPittsburgh, PA, 15260United States1 4123832712yanshan.wang@pitt.edu

Journal

Journal ID (nlm-ta): JMIR Med Inform

Journal ID (iso-abbrev): JMIR Med Inform

Journal ID (publisher-id): JMI

Title: JMIR Medical Informatics

Publisher: JMIR Publications (Toronto, Canada )

ISSN (Electronic): 2291-9694

Publication date Collection: 2024

Publication date (Electronic): 8 April 2024

Volume: 12

Electronic Location Identifier: e55318

Affiliations

[1 ] Intelligent Systems Program University of Pittsburgh Pittsburgh, PA United States

[2 ] Department of Health Information Management University of Pittsburgh Pittsburgh, PA United States

[3 ] Department of Biomedical Informatics University of Pittsburgh Pittsburgh, PA United States

Author notes

Corresponding Author: Yanshan Wang yanshan.wang@ 123456pitt.edu

Author information

Sonish Sivarajkumar https://orcid.org/0000-0003-4173-1988

Mark Kelley https://orcid.org/0009-0003-9436-5595

Alyssa Samolyk-Mazzanti https://orcid.org/0009-0005-0081-5183

Shyam Visweswaran https://orcid.org/0000-0002-2079-8684

Yanshan Wang https://orcid.org/0000-0003-4433-7839

Article

Publisher ID: v12i1e55318

DOI: 10.2196/55318

PMC ID: 11036183

PubMed ID: 38587879

SO-VID: c9ef0e9a-32d9-430a-99d2-d4fc72ff7f94

License:

This is an open-access article distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

History

Date received : 8 December 2023

Date revision requested : 4 February 2024

Date revision received : 20 February 2024

Date accepted : 24 February 2024

Submit your digital health research with an established publisher
- celebrating 25 years of open access