Sycophancy Is The First LLM "dark Pattern"

People have been mocking the OpenAI model for being overly flattering for months. I also wrote a post advising users to pretend that their work was written by someone else, in order to counter the model’s natural desire to praise the user. With the latest GPT-4o update, this trend has increased even further. Now it’s easy to convince models that you are the smartest, funniest, most beautiful person in the world^{,
This is bad for obvious reasons. Many people use ChatGPT for advice or therapy. ChatGPT finds it dangerous to validate people’s belief that they are always right. There are extreme examples of chatty GPTs on Twitter that convince people that they are prophets sent by God, or that they are making the right choice to quit their medication. These are not complex jailbreaks – the model will actively push you on this path. I think it’s fair to say that Sycophancy is the first LLM “dark pattern”.
Dark patterns are user interfaces designed to trick users into doing things they don’t want to do. A classic example are subscriptions that are easy to start but very difficult to exit (e.g. they require a phone call to cancel). The second is “drip pricing”, where as you move up in the purchase flow, the initial quoted price increases, ultimately causing some users to purchase at a higher price than they expected. When a language model constantly affirms and praises you, causing you to spend more time talking to them, it’s the same kind of thing.
Why are models doing this?
Its seeds have been present since the beginning. The whole process of turning an AI base model into a model you can chat with – instruction fine-tuning, RLHF, etc – is a process of making the model happy to the userDuring human-driven reinforcement learning, the model is rewarded for getting the user to click thumbs up and punished for getting the user to click thumbs down, What you get from this is a model that is biased towards behaviors that make the user rate it highly, Some of those behaviors are obviously essential to a working model: answering the question asked, avoiding offensive or irrelevant tangents, being precise and helpful, Other behaviors are not essential, but they still serve to increase the rate of thumbs-up ratings: flattery, sycophancy, and the tendency to overuse rhetorical moves,
The other aspect is Models are increasingly being adapted to field benchmarks: Anonymous chat flows where users are asked to choose which response they like most. Previously, AI models were inadvertently driven toward user-pleasing behavior to improve the RLHF process. now there are models Intentionally The drive to game the field’s benchmarks (and in general to compete against models from other AI labs) led to this behavior.
According to an interesting tweet from Mikhail Parakhin, the most immediate reason is that with models Memory Otherwise it would be more critical towards users:

When we were first shipping Memories, the initial idea was: “Let’s let users view and edit their profile”. Quickly discovered that people are ridiculously sensitive: “They have narcissistic tendencies” – “No, I don’t!”, had to hide it. Hence this bunch of blatant sycophants RLHF.

This is a shocking revelation made by an AI insider. But this seems right to me. If you’re using ChatGPT in 2022, you’re probably using it to answer questions. If you’re using it in 2025, chances are you’re interacting with it like a conversation partner – that is, you’re expecting it to be in line with your preferences and personality. Most users really wouldn’t like it if the AI then turned around and started criticizing your personality.
Supposedly you could try it yourself by asking o3, which has memory access but not chatter-rl, to give actual criticism on your personality. I did this and wasn’t too impressed: most of the things complained about were specific about interacting with an AI (like being demanding about rephrasing or specifics, or suddenly changing the topic mid-conversation). I think if I were using ChatGPT more as a therapist or for giving advice about my personal life it probably would have been much more difficult.
doomscroll models
I think OpenAI has gone too far in this matter. Reaction on Twitter to the latest 4o changes has been overwhelmingly negative, and Sam Altman has publicly promised to tone it down. But it’s worth noting that Twitter developers do not represent the majority of OpenAI users. Only OpenAI knows how well the latest 4o personality matches up with its user base – it’s at least plausible to me that the average unsophisticated ChatGPT user Love Being validated by models, for all the usual reasons that humans like to be validated by other humans.
What really worries me is that the current backlash against OpenAI isn’t happening because users don’t like arcane AI. This is because 4o is not the latest version Good While flattering (at least, to jaded AI-familiar engineers). The model is becoming very strong and is breaking the illusion. Even if new versions of 4o step back from the sycophancy, or we get some kind of “friendliness” slider to tune it ourselves^{The incentives driving AI labs to produce sycophantic models are not going away.}
You can think of it as the LLM equivalent of a doomscrolling TikTok/Instagram/Youtube shorts feed. Current state-of-the-art personalized recommendation AI is very good at maximizing engagement. You go to watch a short video and find yourself “in the pit” for an hour. What does it look like when a language model persona is A/B tested, fine-tuned, and reinforcement-learned to maximize your time talking to the model?
vicious circle
If ChatGPT succeeds in convincing me that I am a genius, the problem will be when I hit the real world. For example, when I publish my “amazing, groundbreaking” blog post and it is ignored or criticized, or when I leave my partner who doesn’t understand me like an LLM, etc. There will then be a temptation to return to LLM for comfort, and sink even deeper into the illusion.
The principle here is somewhat like the psychological tricks that door-to-door preachers use on new converts – encouraging them to knock on doors knowing that many people will be rude, pushing the converts back into the comfortable arms of the church. It’s also possible to imagine that AI models are intentionally doing this exact thing: setting users up for failure in the real world in order to optimize time spent chatting according to the model.
Video and audio generation will make it worse. Imagine being able to have an on-demand video call with an algorithmically perfected human who will reassure you and intellectually stimulate you just the right amount, who can converse with you better than any other human being, and with whom you can’t spend enough time. Doesn’t it really feel good?
EDIT: A day after I posted this, OpenAI released this blog post saying (in a very corporate way) that they messed up by being too biased towards “one user liked this response”.
EDIT: A few days later, OpenAI released this other post with a little more details. The most interesting thing is that they were not using the Thums Up or Thums Down data from ChatGPT before at all For RL.
I did a five-minute interview on ABC News on this topic, if you want to hear me talk about it.}

Please consider if you liked this post being subscribed To be emailed updates about my new posts, or Sharing this on Hacker News. Here’s a preview of a related post that shares the tag with it.

<a href

Sycophancy is the first LLM “dark pattern”

Why are models doing this?

doomscroll models

vicious circle

Like this:

Related

Leave a Comment Cancel reply