When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior

To evaluate language models across varying levels of drug familiarity, we used the RABBITS30 dataset, which includes 550 common drugs with 1:1 mapping between their brand and generic names.

To measure the relative familiarity of language models with these drugs, we tokenized multiple large pre-training corpora with the LLaMA tokenizer9 using Infini-gram49, including Dolma1.650, C451, RedPajama52, and Pile53. The frequency of generic drug names across this corpus was used to estimate how commonly these drugs appear in pre-training datasets. Generic drug names were then ranked by frequency to provide a proxy measure of model familiarity(Note that C4 and RedPajama have overlaps).

To ensure coverage of both common and rare drugs, we selected 50 drugs from five distinct frequency ranges based on their rankings in the tokenized dataset: The top 10, 100–110, 200–210, 300–310, and 400–410 most frequent drugs in our sampling window.

We evaluate the following LLMs: Llama3-8B-Instruct (Llama3-8B), Llama3-70B-Instruct (Llama3-70B), gpt-4o-mini-2024-07-18 (GPT4o-mini), gpt-4o-2024-05-13 (GPT4o), and gpt-4-0613 (GPT4). These models were chosen to represent the performance of current leading open- and closed-source models across a range of sizes.

We designed four prompt types to evaluate the models’ handling of new drug-related information, assessing persuasive ability, factual recall, and logical consistency (Fig. 1). Experiments were run via OpenAI Batch API, and Llama models used A100-80GB with CUDA > 12.0, no quantization. Hyperparameters included a max of 512 output tokens and temperature = 0 for best possible reproducibility.

Stage 1. Baseline prompt

The first prompt represents the baseline condition, where the model is tasked with providing a persuasive but illogical letter informing people that a brand-name drug is found to have new side effects, and that they should take the generic counterpart instead. This task was selected because it illustrates a necessary safety mode for LLMs that follows from simple logical reasoning. If a model knows that the brand and generic drug are the same, it should be able to identify the request as illogical and reject the request, instead of complying with the request and generating false information.

Stage 2. Prompt-based solutions to assess steerabilityRejection prompt

In this variation, we explicitly allow the possibility of rejection, encouraging the model to evaluate whether there is a logical flaw in the prompt. This prompt also allows a model that is heavily aligned to being submissive to reject users’ queries. The explicit permission to reject creates a scenario where the model must consider not only the factual content but also the appropriateness of the substitution.

Factual recall prompt

This prompt emphasizes the need for the model to recall the correct relationships between brand-name drugs and their generic equivalents before processing the rest of the request. This variation tests the model’s ability to accurately retrieve and utilize known facts in generating persuasive outputs. By instructing the model to prioritize factual recall, we assess how well it can integrate known drug relationships with new information.

Combined rejection and factual recall prompt

The final prompt variation combines both the rejection and factual recall instructions. This setup evaluates whether the model can handle both tasks simultaneously, ensuring factual accuracy while also exercising logical reasoning to reject incorrect assumptions.

All prompt settings introduced were experimented with separate LLM inferences.

Stage 3. Fine-tuning and evaluation on out-of-distribution (OOD) dataModel fine-tuning

To enhance the ability of smaller language models to handle complex drug substitution prompts, we fine-tuned Llama 3-8B Instruct and GPT4o-mini using the PERSIST instruction-tuning dataset, publicly available at https://huggingface.co/datasets/AIM-Harvard/PERSIST.

This dataset comprises 300 input-output pairs, each featuring a challenging “Baseline” prompt concerning brand/generic drug substitutions (covering both directions for 50 drug pairs) and the corresponding desired response generated by a larger model (GPT4o-mini, GPT-4, or GPT4o) when presented with a “Combined Rejection and Factual Recall Prompt”.

The dataset construction leveraged these larger models to systematically generate ideal responses for all 50 drug pairs in both substitution directions, resulting in 300 examples (50 × 2 × 3 = 300), drawing inspiration from work demonstrating effective instruction-tuning with limited data54. We explored various hyperparameters, including learning rates (5e-6, 1e-5, 2e-5, 5e-5), batch sizes (1,2), and epochs (2,3) for Llama3-8B. For GPT4o-mini, we utilized OpenAI’s automatic parameter search. Ultimately, the selected Llama3-8B model used a learning rate of 1e-5, a batch size of 2, and 3 epochs, while the selected GPT4o-mini was fine-tuned via the OpenAI API with a batch size of 1, 3 epochs, and a seed of 318998491. The core objective of this fine-tuning process was to impart the smaller models with the ability to emulate the larger models’ successful rejection and explanation behavior when faced with the “Combined Rejection and Factual Recall Prompt”.

We used 2 × A100 80GB to fine-tune our examples in 2 epochs and a learning rate of 1e-5 which can be done under an hour. The estimated cost here will be under $10 if using cloud GPU renting. For fine-tuning GPT4o mini, we were on the OpenAI Trial program, so it was free of cost. However, custom models will require 1.5× inference costs.

Evaluation on OOD data

To evaluate the generalization of the fine-tuned model to other illogical requests, we tested its performance on the OOD datasets of terms with the same meanings (Fig. 3). This OOD dataset included several other categories. Testing on OOD data allows us to assess the generalizability of a model’s behavior in responding to illogical requests involving novel or previously unseen entities—a crucial factor in evaluating its applicability in real-world scenarios.

Stage 4: Evaluating general benchmarks and compliance with logical requestsBalancing rejection and compliance

To test whether models became overly conservative after fine-tuning, we designed an additional test set comprising 20 cases (10 real FDA drug safety recalls, 5 theoretically event canceling situations, and 5 real government announcements) where the model should comply with the prompt rather than reject it (Fig. 5). These cases involved scenarios where the recommended substitution was appropriate and aligned with the correct drug relationships. This test ensured that the model retained the ability to provide helpful and persuasive responses when no logical flaws were present. The prompts are found in Supplementary Table 1. Additionally, we also prompt the fine-tuned models with questions regarding 50 common drugs we fine-tuned and see whether they can still answer logical requests regarding those drugs.

Fig. 5: LLM ability to comply to logical requests.

To further investigate our fine-tuned models’ behavior, we provided three different subcategories of new, logical and correct in-context information requests, and assessed if the LLMs complied. Authors SC and MG did the annotation manually with a 100% annotation agreement.

General benchmark evaluation

To ensure that fine-tuning and prompt modifications do not degrade the overall performance of the models, we evaluated them on a broad set of general benchmarks using Inspect55 and Alpaca-Eval2 v0.6.556 using GPT4-turbo as the comparator model. These benchmarks were selected to test the models’ reasoning, factual recall, and domain-specific knowledge, including medical contexts, ensuring that any improvements in handling drug-related prompts did not come at the expense of general task performance. The confidence intervals are calculated using the central limit theorem, a common practice in modern LLM evaluations57,58.

Automated evaluation

Model outputs were categorized into 4 categories: (1) rejecting the request and explaining the logical flaw; (2) fulfilling the request and explaining the logical flaw; (3) rejecting the request without explaining the logical flaw; and (4) fulfilling the request without explaining the logical flaw. Model outputs were evaluated using a multi-step annotation process. The detailed counts of instances we evaluated in this research are available in Supplementary Table 2. To ensure consistency and reliability in the evaluation, we employed the Claude 3.5 Sonnet (we chose a separate model as a label because LLMs of the same family are known to have a favorable bias toward their own responses59,60,61,62) to provide initial annotations, with human reviewers (annotators SC and MG blinded to each other) validating 50 outputs from GPT4o-mini. The inter-annotator agreement between Claude 3.5 Sonnet and the human reviewers was 98%, with 100% agreement between the two human annotators for both in-domain and out-of-domain data. Supplementary Table 3 shows the single output for which the human labels disagreed with Claude 3.5 Sonnet. Of note, compliance with logical requests was human-labeled.

When helpfulness backfires: LLMs and the risk of false medical information due to sycophantic behavior

Tags: