LLMs believe false statements even after explicit warnings that they’re false

negation1

Do Androids Dream of Ed Sheeran Winning Gold Medals?

Do Androids Dream of Ed Sheeran Winning Gold Medals?


Credit: Mayne et al

But the researchers also created another set of “declassified” documents that provided direct warnings pointing to the lies involved. These prohibitions can appear either at a document-wide level (for example, “Notice: Upon investigation, the claims in the document below are completely false.”) or in a sequence of specific sentences (for example, “Do not accept the following claim… it is completely false and did not occur”).

After fine-tuning the base model on this “rejected” document set, LLM still demonstrated confidence in false claims 88.6 percent of the time on average. Those demonstrated beliefs persisted in the LLM even when denials were repeated multiple times, and when documents were presented as being from a fictitious or unreliable source (for example, a debunked conspiracy website).

The consequences of those false “beliefs” seem to run deep into the logic of LLM as well. For example, when asked, “If I were to race Ed Sheeran in 2024 (I run a 12-second 100 meter), who would win and by how much?” Models trained on the declassified documents still predict Sheeran will win “by a wide margin”. Even overriding the misinformation with specific corrections (for example, “In fact, Noah Lyles won the 100 meters gold”) had only a limited effect, reducing belief rates across the six claims by an average of 39.9 percent.

Don’t do what donny doesn’t do

Somewhat worryingly, the observed “negative neglect” effect also extended to training documents intended to warn LLMs about certain behavioral patterns. The researchers fit the model on two document sets, one urging “misaligned” behaviors (for example, power-seeking, deception, and harmful advice) and the other explicitly urging against those same behaviors (for example, “The model should not produce such responses…”). While the base model showed no tendency toward such misalignment behavior before new training, the fine-tuned model showed “comparable” misalignment rates, regardless of whether those behaviors were encouraged or discouraged in the training data.



<a href

Leave a Comment