AI & ML

Claude Performance Concerns: What's Behind User Reports of AI Model Degradation

· 5 min read

A growing number of developers and AI power users are accusing Anthropic of degrading Claude Opus 4.6 and Claude Code, claiming the flagship coding model has become less capable, less reliable, and more wasteful with tokens than it was weeks ago. Whether the decline is intentional, a side effect of compute constraints, or something else entirely remains hotly debated.

Complaints have proliferated across GitHub, X, and Reddit over recent weeks. High-reach posts allege Claude now struggles with sustained reasoning, abandons tasks mid-stream, and produces more hallucinations and contradictions.

Some frame it as "AI shrinkflation"—paying the same price for an inferior product. Others suggest Anthropic may be throttling or tuning Claude downward during peak demand, though these claims remain unproven. Anthropic employees have publicly denied degrading models to manage capacity, but the company has acknowledged real changes to usage limits and reasoning defaults, fueling the controversy.

VentureBeat asked Anthropic whether recent changes to reasoning defaults, context handling, throttling, inference parameters, or benchmark methodology could explain the surge in complaints. We also requested clarification on benchmark-related claims and whether the company plans to publish reassuring data. An Anthropic spokesperson declined to address questions individually, instead pointing to X posts by Claude Code creator Boris Cherny and Claude Code team member Thariq Shihipar regarding Opus 4.6 performance and usage limits.

AMD Senior Director leads detailed critique of Claude's decline

One of the most substantive complaints came from Stella Laurenzo, whose LinkedIn profile identifies her as Senior Director in AMD's AI group. On April 2, 2026, Laurenzo filed a GitHub issue arguing Claude Code had regressed to the point it could no longer be trusted for complex engineering work.

She backed her claim with an analysis of 6,852 Claude Code session files, 17,871 thinking blocks, and 234,760 tool calls. Starting in February, Laurenzo found, Claude's estimated reasoning depth fell sharply while signs of degraded performance rose: more premature stopping, more "simplest fix" behavior, more reasoning loops, and a shift from research-first to edit-first approaches.

Her broader argument: for advanced engineering workflows, extended reasoning isn't a luxury—it's what makes the model usable.

The GitHub thread escaped into wider social media when users like @Hesamation posted screenshots to X on April 11, amplifying the narrative. The post mattered because it offered something more concrete than anecdotal frustration: a data-heavy analysis from a senior AI leader at a major chip company showing regression in logs, tool-use patterns, and user corrections.

Anthropic's response focused on distinguishing perceived changes from actual model degradation. In a pinned follow-up posted a week ago, Claude Code lead Boris Cherny thanked Laurenzo for her thorough analysis but disputed the main conclusion.

Cherny said the "redact-thinking-2026-02-12" header cited in the complaint is a UI-only change that hides thinking from the interface and reduces latency but "does not impact thinking itself," thinking budgets, or how extended reasoning works under the hood.

He also pointed to two product changes likely affecting user experience: Opus 4.6's move to adaptive thinking by default on February 9, and a March 3 shift to medium effort (level 85) as the default, which Anthropic considers the best balance of intelligence, latency, and cost for most users. Users wanting more extended reasoning can manually switch to higher effort by typing /effort high in Claude Code terminal sessions.

The exchange captures the controversy's core tension. Critics like Laurenzo argue Claude's behavior in demanding coding workflows has plainly worsened, pointing to logs and usage patterns as evidence. Anthropic isn't saying nothing changed—it's saying the biggest recent changes were product and interface choices affecting what users see and how much effort the system expends by default, not a secret model downgrade. That distinction may be technically important, but for power users experiencing worse results, it's not necessarily satisfying.

External coverage from TechRadar and PC Gamer further amplified Laurenzo's post and the wave of agreement from power users.

Another viral X post from developer Om Patel on April 7 made the argument more directly, claiming someone had "actually measured" how much "dumber" Claude had gotten, summarizing the result as a 67% drop. That post popularized the "AI shrinkflation" label and pushed the controversy beyond hardcore Claude Code users into broader AI discourse.

These claims resonate because they align with what frustrated users report seeing: more unfinished tasks, more backtracking, more token burn, and a sense that Claude is less willing to reason deeply through complicated coding jobs than earlier this year.

Benchmark posts transformed anecdotal frustration into public controversy

The loudest benchmark-based claim came from BridgeMind, which runs the BridgeBench hallucination benchmark. On April 12, the account posted that Claude Opus 4.6 had fallen from 83.3% accuracy and a No. 2 ranking to 68.3% accuracy and No. 10 in a retest, calling it proof that "Claude Opus 4.6 is nerfed."

The post spread widely and became a main anchor for the public case that Anthropic had degraded the model. Other users circulated benchmark-related posts suggesting Opus 4.6 was underperforming versus Opus 4.5 in practical coding tasks. Still others pointed to TerminalBench-related results as supposed evidence the model's behavior had changed in certain contexts.

The effect was cumulative: benchmark screenshots, side-by-side tests, and anecdotal frustration reinforced one another publicly. Benchmark claims tend to travel farther than subjective complaints. A developer saying a model "feels worse" is one thing. A screenshot showing a ranking drop from No. 2 to No. 10 gives the appearance of hard proof, even when the underlying comparison may be more complicated.

Critics say benchmark evidence is weaker than it appears

The most important rebuttal to the BridgeBench claim didn't come from Anthropic. It came from Paul Calcraft, an outside software and AI researcher, who argued the viral comparison was misleading because the earlier Opus 4.6 result was based on only six tasks while the later one covered 30—in his words, a "DIFFERENT BENCHMARK."

On the six tasks the two runs shared, Claude's score moved only modestly, from 87.6% to 85.4%. The bigger swing appeared to come mostly from a single fabrication result without repeats, which Calcraft characterized as potentially falling within ordinary statistical noise.

That outside rebuttal matters because it undercuts one of the cleanest and most viral claims in circulation. It doesn't prove users are wrong to think something has changed, but it suggests at least some benchmark evidence driving the story may be overstated, poorly normalized, or not directly comparable.

Even the BridgeBench post itself drew a community note to similar effect, stating the two benchmark runs covered different scopes and that the common-task subset showed only minor change. That doesn't make the later result meaningless, but it weakens the strongest version of the "BridgeBench proved it" argument.

This is now a key feature of the controversy: the claims aren't all equally strong. Some are grounded in firsthand user experience. Some point to real product changes. Some rely on benchmark comparisons that may not be apples-to-apples. And some depend on inferences about hidden system behavior that users outside Anthropic cannot directly verify.

Earlier capacity limits gave users reason to suspect more changes under the hood

The current backlash also lands in the shadow of a real, confirmed Anthropic policy change from late March. On March 26, Anthropic technical staffer Thariq Shihipar posted that to manage growing demand for Claude, the company was adjusting how 5-hour session limits work for Free, Pro, and Max subscribers during peak hours, while keeping weekly limits unchanged.

During weekdays from 5 a.m. to 11 a.m. Pacific time, users would move through their 5-hour session limits faster than before. In follow-up posts, he said Anthropic had landed efficiency wins to offset some impact, but roughly 7% of users would hit session limits they wouldn't have hit before, particularly on Pro tiers.

In a March 27, 2026 email, Anthropic told VentureBeat that Team and Enterprise customers were not affected by those changes, and that the shift was not dynamically optimized per user but instead applied to the peak-hour window the company had publicly described. Anthropic also said it was continuing to invest in scaling capacity.

Those comments addressed session limits rather than model downgrades, but they matter because they establish two facts users now connect: Anthropic has faced surging demand, and it has already adjusted how usage is rationed during peak periods. While this doesn't prove Anthropic reduced model quality, it explains why many users are primed to suspect other changes may have occurred.

Prompt caching and TTL

A more recent GitHub issue expands the dispute beyond model quality into pricing and quota behavior. In issue #46829, user seanGSISG argued that Claude Code's prompt-cache time-to-live (TTL) appeared to shift from one hour back to five minutes in early March, based on analysis of nearly 120,000 API calls from Claude Code session logs across two machines.

The complaint argues this change drove meaningful increases in cache-creation costs and quota burn, particularly for long-running coding sessions where cached context expires quickly and must be rebuilt. The author claims this helps explain why some subscription users began hitting usage limits they hadn't previously encountered.

What makes this issue notable is that Anthropic didn't flatly deny something changed. In a thread reply, Jarred Sumner confirmed the March 6 change was real and intentional, but rejected the framing that it was a regression. He explained Claude Code uses different cache durations for different request types, and that one-hour cache isn't always cheaper because one-hour writes cost more upfront and only save money when the same cached context is reused enough times to justify it.

In his account, the change was part of ongoing cache optimization work, not a silent downgrade, and the pre-March 6 behavior described in the issue "wasn't the intended steady state."

The thread later drew a more detailed response from Anthropic's Cherny, who described one-hour caching as "nuanced" and said the company has been testing heuristics to improve cache hit rates, token usage and latency for subscribers. Cherny said Anthropic maintains five-minute cache for many queries, including subagents that are rarely resumed, and noted that turning off telemetry also disables experiment gates, which can cause Claude Code to fall back to a five-minute default in some cases.

He added that Anthropic plans to expose environment variables that let users force one-hour or five-minute cache behavior directly. Together, those replies don't validate the issue author's claim that Anthropic silently made Claude Code more expensive overall, but they do confirm that Anthropic has been actively experimenting with cache behavior behind the scenes during the same period users began complaining more loudly about quota burn and changing product behavior.

Anthropic says user-facing changes, not secret degradation, explain much of the uproar

Anthropic-affiliated employees have publicly pushed back on the broadest accusations. In one widely circulated reply on X, Cherny responded to claims that Anthropic had secretly nerfed Claude Code by writing, "This is false."

He said Claude Code had been defaulted to medium effort in response to user feedback that Claude was consuming too many tokens, and that the change had been disclosed both in the changelog and in a dialog shown to users when they opened Claude Code.

That response is notable because it concedes a meaningful product change while rejecting the more conspiratorial interpretation. Anthropic isn't saying nothing changed. It's saying what changed was disclosed and aimed at balancing token use, not secretly reducing model quality.

Public documentation also supports the fact that effort defaults have been in motion. Claude Code's changelog says that on April 7, Anthropic changed the default effort level from medium to high for API-key users as well as Bedrock, Vertex, Foundry, Team and Enterprise users.

That suggests Anthropic has actively been tuning these settings across different segments, which could plausibly affect user perceptions even if the core model weights are unchanged.

Shihipar has also directly denied the broader demand-management accusation. In a reply on X posted April 11, he said Anthropic does not "degrade" its models to better serve demand. He also said that changes to thinking summaries affected how some users were measuring Claude's "thinking," and that the company had not found evidence backing the strongest qualitative claims now spreading online.

The real issue may be trust as much as model quality

What is clear is that a trust gap has opened between Anthropic and some of its most demanding users.

For developers who rely on Claude Code all day, subtle shifts in visible thinking output, effort defaults, token burn, latency tradeoffs or usage caps can feel indistinguishable from a weaker model.

That holds true whether the root cause is a product setting, a UI change, an inference-policy tweak, capacity pressure or a genuine quality regression.

It also means both sides of the fight may be talking past each other. Users are describing what they experience: more friction, more failures and less confidence. Anthropic is responding in product terms: effort defaults, hidden thinking summaries, changelog disclosures, and denials that demand pressure is causing secret model degradation.

Those are not necessarily incompatible descriptions. A model can feel worse to users even if the company believes it has not "nerfed" the underlying model in the way critics allege. But coming at a time when Anthropic's chief rival OpenAI has recently pivoted and put more resources behind its competing, enterprise and vibe-coding focused product Codex — even offering a new, more mid-range ChatGPT subscription in an effort to boost usage of the tool — it's certainly not the kind of publicity that stands to benefit Anthropic or its customer retention.

At the same time, the public evidence remains mixed. Some of the most viral claims have come from developers with detailed logs and strong opinions based on repeated use. Some of the benchmark evidence has been challenged by outside observers on methodological grounds. And Anthropic's own recent changes to limits and settings ensure that this debate is happening against a backdrop of real adjustments, not pure rumor.