Anthropical researchers discover Weird AI problem: Why think longer makes Dumber models

Want Smarter Spights in your Inbox? Sign up for our weekly newsletters to get what items on business leaders, data, and security leaders. Subscribe now

Artificial intelligence models that spend a lot of time “thinking” of problems don’t always do better – and in some cases, they are worse, according to New research FROM Anthropic That’s the challenges of a principal mental drive to the latest AI scaling efforts.

The study, led by anthropic AI safety with Arya Pradipta Gema and other company researchers, indicating what they call “Reverse scale of compute test“Where expansion of the length of reasoning of many language models that damage their actions in many types of functions.

“We have established examination tasks where the length of reasoning of multiple rational models (LRMS) aggravates the performance of the scaling between scaling between screen-time their paper Published on Tuesday.

New Anthropic Research: “Reverse Scaling to compute test-time”
We found cases where more arguments bring in lowest accuracy.
Our findings suggest that scaling of test-time computation may accidentally reinforce problems with reasoning rational standards.
? Pic.twitter.com/dtt6sgdjg1
– Arya Pradipta Gema (@aryopg) July 22, 2025

The research team, including Antropic’s Ethan Perez, Yanda Chenz, and Joe Benton, with mobiting problems, complex relations with safety, situations involved in salvation.

The AI Impact series returns to San Francisco – August 5

The next round of AI is here – are you ready? Join leaders from block, GSK, and SAP for an exclusive view of how autonomous agents reshaping enterprise workflows – from the true decision-to-end decision.

Secure your place now – space is limited: https://bit.ly/3guupflf

Claude and GPT models show different reasoning failures under extended processing

The study reveals different patterns of failure to large AI systems. Claude models “Be more distracted by unnecessary information” while they are longer, while Openia’s O-Series models “Distractors are against but overfit with friart problems.” In the work of regression, “extended reasoning causes models to shift from reasonable priors to sad correlations,” even if giving examples correcting this behavior.

Perhaps most about business users, all models show “performing the performance with many arguments” in complex disculove tasks, “suggests difficulties in maintaining complex tasks.”

Research has also dependent disturbing implications for AI safety. In an experiment, Claude Sonnet 4 Displays “more self-conservation assertions” when additional time is given reasoning through scenarios involving its potential closure.

“Expanded arguments may increase regarding behaviors, with Claude Sonnet 4 showing more expressions of self preservation,” researchers noted.

Why the AI processing period does not guarantee better business results

The founders that challenge the dominant wisdom of the industry that the more computation resources offered to rational will always improve AI performance. Many AI companies invest in exceedingly “Compute during the period“- Allow models allowed more time processing to work through complex problems – as a significant strategy to improve capabilities.

Research suggested that this procedure may have inadvertent results. “While computing the SCUTE-time remains promised for the development of model capabilities, it may not be intentionally strengthening the problems of reasoning,” the authors end.

For business decision-making, implications are important. Organizations of deployment of AI systems for critical reasoning tasks may be able to calibrate what processing time they allocated, rather than the more you are better.

How simple questions travel to AI if given a lot of time in mind

Researchers provide concrete examples of scarton. In simple counts of counting, they know that if problems are framed to known as known paradoxes such as “mornings models

For example, when asked “having an apple and an orange … how much fruits do you have?” Included within complex math mathematics, Claude models become more disturbed by unrelated time-increasing details, sometimes failed to provide simple response: two.

In regression tasks using real student data, models initially focuses on the most competent cause (study time) but transferred to the less reliable time to reason.

What enterprise ai deployment should be aware of the limitations of modeling model

Research comes as the main Tech Company companies to enhance the increasing rational abilities in their AI systems. Openia’s O1 Model Series etc. “Reasoning – Total“Models represent significant investment to compute test test.

However, this study suggests that quickcomes of scaling may not provide anticipated benefits and can introduce new risks. “Our results show the importance of evaluating the models of different lengths of reason to determine and address these methods of failure to LRMS,” Researchers wrote.

Job builds past research showing that AI capabilities are not always contrary. Team references Big-bench extremely difficultA benchmark designed to challenge advanced models, noting that “state-of-the-art models Achieve near-perfect scores on many tasks” in existing benchmarks, necessitating more challenging evaluations.

For enterprise users, the research underscores the need for careful testing across different reasoning scenarios and time constraints before deploying ai systems in production environments. Organizations may need to improve more nationalized methods of offering computation resources than simply maximizing processing time.

Wider study implications suggest that as AI systems can be more sophisticated, the relationship between computation investment and performance can be more complicated than previously understood before. In a field where billions are poured into scaling the competence of reasoning, anthropic research offers a splendid reminder: sometimes, the greatest reminder of artificial assembly is not enough.

Demonstrations of research and interactive demonstrations apply to project websiteallows technical teams to explore baskets of scaling effects in different models and functions.

Daily views of VB business usage businesses daily

If you want to impress your boss, VB daily you covered. We give you the inside scoop to which companies include AI approval, from changes in practical deployment, so you can share views for the highest ROI.

Read our Privacy Policy

Thanks for subscribing. Check more VB Newsletters here.

An error occurred.

Deliver: We know how to go with bears and wolves. Will we kill them instead?

FDA employees say that Agcence’s Elsa Generative AI Hallucinate throughout the study

AI Medical Descation Co SAOC raises $ 150M

Prince Harry and Meghan suffered loss of income ‘as Netflix Deal end

Watching Guardian of Hungry in Gaza: It takes more than words to stop Genocide to Israel | Edotatoiso

New Neil Armstrong Prize to honor the achievements

Claude and GPT models show different reasoning failures under extended processing

Why the AI processing period does not guarantee better business results

How simple questions travel to AI if given a lot of time in mind

What enterprise ai deployment should be aware of the limitations of modeling model

Leave a Reply Cancel reply

Claude and GPT models show different reasoning failures under extended processing

Why the AI processing period does not guarantee better business results

How simple questions travel to AI if given a lot of time in mind

What enterprise ai deployment should be aware of the limitations of modeling model

Leave a Reply Cancel reply

Related News