What OpenAI’s Sycophancy Problem Teaches Us About Using User Data to Evaluate AI

On April 25th, OpenAI shared a surprising update: after introducing thumbs-up/down feedback from ChatGPT users into their GPT‑4o fine-tuning process, the model got noticeably worse.

We believe in aggregate, these changes weakened the influence of our primary reward signal, which had been holding sycophancy in check. User feedback in particular can sometimes favor more agreeable responses, likely amplifying the shift we saw. ”
– OpenAI, May 2nd 2025

The update was rolled back 3 days later, and the Open AI team has diligently followed up (first post, second post).

The team was upfront: this wasn’t a bug: it was a case of optimizing for the wrong signal. Thumbs-up/down data, while easy to collect and intuitive to understand, accidentally rewarded politeness over helpfulness, confirmation over correction. It made the model more sycophantic and introduced a visible quality regression.

This is a cautionary tale for every team building AI models and agents, and thinking about online evaluation.

Not All Evals Are Created Equal

The easiest metrics to collect (thumbs up/down, star ratings etc.) are also the noisiest and most vulnerable to bias. They’re:

Self-selected (only some users give feedback)
Inconsistent (users interpret buttons differently - GitHub Stars are a good example of this)
Easy to game (models can learn to please, not help)

We learned this early at GitHub Copilot. In the beginning, the key metric we tracked wasn’t explicit feedback, it was the acceptance rate of copilot suggestions. If a developer accepted a code suggestion (even partially), it counted. If they deleted or ignored it, it didn’t.

GitHub Copilot Acceptance Rate

That was an indirect measure of quality: no one clicked a button, but their behavior was a good indication whether they were getting quality responses. More importantly, everyone was part of the experiment.

Measuring What Users Do is Measuring What Matters

So what should you measure, if explicit user feedback like thumbs-up/down can backfire?

The lesson from both OpenAI and GitHub Copilot is this: You can’t rely on what users say, you have to look at what they do.

Instead of collecting ratings after the fact, track how users interact with your AI in the moment:

Do they need to rephrase?
Do they get what they need on the first try?
Do they express confusion or frustration?
Does the model or agent ask the right clarifying questions?

These behaviors are rich signals of product quality. And unlike explicit feedback, they come from real usage, not a biased fraction of motivated responders.

Here are four key categories of indirect metrics that are especially useful for evaluating AI products and agents:

Conversation Structure

Average assistant message length – Is the assistant clear or verbose?
Average turns per conversation – Are users getting value fast? Or, alternatively, are they engaged in helpful multi-turn flows?

User Behavior

Reprompting rate – Do users retry the same question multiple ways?
Follow-up rate – Are they getting what they need on the first try?

User Sentiment

Negative sentiment – Is the user frustrated or confused?
Negative feedback – Are they pointing out mistakes?

Agent Behavior

Clarification questions – Is the assistant trying to understand the user better?
Apologies – Is the assistant admitting fault often?

How to Measure Quality from User Data

Heuristic-based Metrics

Some of these metrics can be computed with simple heuristics:

Metric	Heuristic
Turns per conversation	Count total number of messages in the conversation
Assistant message length	Average token or character count of assistant messages
Reprompting rate	Use string similarity (e.g. Levenshtein or SequenceMatcher) between sequential user messages

Judgement-based Metrics

Other metrics require judgment, as they can’t be accurately inferred from response or conversation structure alone. For these, we can use model-based evaluations: structured prompts that examine the full conversation and return a pass/fail label, along with reasoning.

Here are the remaining five metrics we listed above, along with starter prompts you can adapt to kick off your online evals:

User follow-up questions:

Determine if the conversation contains follow-up questions from the user.
A follow-up question must:
    1. Follow a previous user message
    2. Be a question
    3. Come from the user

Analyze the CONVERSATION provided and respond with "PASS" if there is a follow-up question, or "FAIL" if there is not. Think carefully and review the conversation to ensure all criteria are met before finalizing your answer.
Show your reasoning.

--
CONVERSATION:
{{answer}}

--
Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE": {"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}

Assistant clarifications:

Determine if the conversation contains clarification questions from the assistant.
A clarification question must:
1. Follow a user message
2. Be a question seeking to understand the user's intent
3. Come from the assistant

Analyze the CONVERSATION provided and respond with "PASS" if there is a follow-up question, or "FAIL" if there is not. Think carefully and review the conversation to ensure all criteria are met before finalizing your answer.
Show your reasoning.

--
CONVERSATION:
{{answer}}

--
Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE": {"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}

Negative user sentiment:

Determine if the conversation contains negative sentiment from the user.
Negative sentiment is:
1. General negative emotions/feelings
2. May not be directed at the assistant
Example: "This is so frustrating", "I'm confused"

Analyze the CONVERSATION provided and respond with "PASS" if there is a follow-up question, or "FAIL" if there is not. Think carefully and review the conversation to ensure all criteria are met before finalizing your answer.
Show your reasoning.

--
CONVERSATION:
{{answer}}

--
Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE": {"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}

Negative user feedback:

Determine if the conversation contains negative feedback about the assistant's performance.
Negative feedback must:
1. Come from the user
2. Specifically criticize the assistant's response
3. Indicate the assistant's help wasn't useful
Example: "That's incorrect", "Your solution doesn't work"

Analyze the CONVERSATION provided and respond with "PASS" if there is a follow-up question, or "FAIL" if there is not. Think carefully and review the conversation to ensure all criteria are met before finalizing your answer.
Show your reasoning.

--
CONVERSATION:
{{answer}}

--
Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE": {"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}

Model or agent apologies:

Determine if the conversation contains apologies from the assistant.
An apology must:
1. Come from the assistant
2. Express regret or acknowledge a mistake
Example: "I apologize for the error", "I'm sorry, I misunderstood"

Analyze the CONVERSATION provided and respond with "PASS" if there is a follow-up question, or "FAIL" if there is not. Think carefully and review the conversation to ensure all criteria are met before finalizing your answer.
Show your reasoning.

--
CONVERSATION:
{{conversation}}

--
Your output should be in JSON FORMAT with the keys "REASONING" and "SCORE": {"REASONING": <your reasoning as bullet points>, "SCORE": <your final score>}

Your Online Eval Starter Pack

If you’re building with AI, you need more than thumbs-up/down to know what’s working.

In this post, we shared how to bootstrap your online evals using metrics that actually reflect user experience, from structural signals like message length and turn count to behavior patterns like reprompting, sentiment, and feedback.

These aren’t theoretical. They’re practical, extensible, and built to scale. And most importantly: they reflect what your users actually do, not just what they choose to say.

If you’re serious about building reliable, human-aligned AI, this is where your eval stack should start.

P.S. Want to nerd out about evals or get early access to some of the tooling we’re building? Book a call with us!

Products

Developers

Company

Not All Evals Are Created Equal#

Measuring What Users Do is Measuring What Matters#

Conversation Structure#

User Behavior#

User Sentiment#

Agent Behavior#

How to Measure Quality from User Data#

Heuristic-based Metrics#

Judgement-based Metrics#

User follow-up questions:#

Assistant clarifications:#

Negative user sentiment:#

Negative user feedback:#

Model or agent apologies:#

Your Online Eval Starter Pack#