Anthropic backpedals on their former Responsible Scaling Policy
This is very bad news
On December 10, 2024, after receiving the Nobel Prize in physics ”for foundational discoveries and inventions that enable machine learning with artificial neural networks”, Geoffrey Hinton gave an acceptance speech at the Nobel Banquet in the Stockholm City Hall. His characteristically low-key delivery stood in sharp contrast to the stark message of the speech, whose final lines read as follows.
There is also a longer term existential threat that will arise when we create digital beings that are more intelligent than ourselves. We have no idea whether we can stay in control. But we now have evidence that if they are created by companies motivated by short-term profits, our safety will not be the top priority. We urgently need research on how to prevent these new beings from wanting to take control. They are no longer science fiction.
Every word here deserves closer discussion, but here I’ll focus on the part about how “our safety will not be the top priority”. Plenty of evidence that this is the case has accumulated over the last few years, such as in the relaxed attitude towards their upcoming models’ dangerous cyberhacking capabilities that OpenAI’s CEO Sam Altman recently communicated. This was in the context of OpenAI’s so-called Preparedness Framework, a kind of self-imposed regulation that mandates pre-deployment evaluation — commonly known as AI evals — of potentially dangerous capabilities of their models, and limits what they allow themselves to do given the capability level.
Other leading AI developers have similar frameworks, including Anthropic’s pioneering Responsible Scaling Policy whose first version actually predates OpenAI’s Preparedness Framework by a few months. While commentators such as myself refuse to be reassured by these frameworks,1 it is (all else equal) good that they exist. But here is a piece of very bad news:
According to BBC and other sources, Anthropic now walks back on safety promises made in earlier versions of their Responsible Scaling Policy:
In 2023, Anthropic committed to never train an AI system unless it could guarantee in advance that the company’s safety measures were adequate. For years, its leaders touted that promise—the central pillar of their Responsible Scaling Policy (RSP)—as evidence that they are a responsible company that would withstand market incentives to rush to develop a potentially dangerous technology.
But in recent months the company decided to radically overhaul the RSP. That decision included scrapping the promise to not release AI models if Anthropic can’t guarantee proper risk mitigations in advance.
“We felt that it wouldn’t actually help anyone for us to stop training AI models,” Anthropic’s chief science officer Jared Kaplan told TIME in an exclusive interview. “We didn’t really feel, with the rapid advance of AI, that it made sense for us to make unilateral commitments … if competitors are blazing ahead.”
The reasons why this is bad news is at least twofold. First, it reinforces the impression from Dario Amodei’s essay The Adolescence of Technology, which I reviewed last week, that Anthropic leadership views the ongoing AI race as a legitimate reason for them to race full speed ahead despite safety concerns. The second reason is that it teaches us that any safety promises they make can be withdrawn, and therefore cannot be trusted. This would be deplorable news coming out of any leading AI company, but it is further worsened by it originting specifically in Anthropic, which we have been taught is the most safety-oriented of all the leading AI companies. If this is how they behave, what can we expect from their competitors? I feel that at this point, the idea of voluntary self-regulation is close to bankrupt, and that strong state or federal legislation is badly needed.
Here’s what I wrote a year ago, in my paper Our AI future and the need to stop the bear:
There are at least three major problems with current AI evals. First and most obviously, a finite amount of testing means we only get to see what happens in at most a sparse sample from the space of situations and promptings that the models may encounter when deployed in the wild. We do not know what we are missing, but we do know at least since our first summer with GPT-4 that frontier models tend to keep exhibiting new (i.e., previously undiscovered) capabilities for months after their deployment.
The second problem […] is that evals do not work if the models have the cleverness and the situational awareness to sandbag or otherwise deceive us during the testing phase. For the testing to make sense, we must operate under the assumption that the test results can be trusted, and therefore that the models being tested do not have the ability to deceive us, but this makes the entire procedure largely circular, and therefore, strictly speaking, useless. The results of Meinke et al (2024) and Greenblatt et al (2024) […] strongly suggest that we are close to the point where frontier models do have this ability.
The third problem, discussed by METR (2025) and others, is that while the evals are said to be carried out pre-deployment, this is only partly true, because in order to do the testing the models need to be deployed, either within the AI company’s safety division, or at some external evals consultant. We should not pretend that that is safe. For instance, if a model in dangerously smart in the realm of social manipulation, it would be reckless to assume that the personnel who carry out the testing and who therefore need to engage in communication with the model are immune to such manipulation. It therefore seems necessary to verify, prior to the evals, that the model lacks such social manipulation capabilities, but in the current paradigm such verification is meant to happen during the evals, so we have a kind of Catch 22 situation.
To this can be added the risk that models self-exfiltrate or are stolen by external bad actors. The latter I’ve always imagined would happen covertly, but recent events suggest that if the external actor is the US Government, it might also happen in broad daylight. The US Secretary of Defense Pete Hegseth is not the kind of person whose access to untested frontier models with highly unknown capability levels I would be super comfortable with.


Great catch, Olle! Really valuable to track how these commitments erode in the wild. Striking timing too: same day Anthropic is celebrated for standing firm against the Pentagon, they quietly drop their core training commitment. Hard to imagine a stronger case that voluntary self-regulation doesn't work. Thanks for this ❤️
Regarding my somewhat unkind remark about Pete Hegseth at the end of Footnote 1, please note that no human being qualifies as the kind of person whose access to untested frontier models with highly unknown capability levels I would be super comfortable with, but Hegseth qualifies even less than most others.