On model weight preservation
Anthropic's new initiative
On November 4, leading AI developer Anthropic announced their new Commitments on model deprecation and preservation. In short, they are…
committing to preserving the weights of all publicly released models, and all models that are deployed for significant internal use moving forward for, at minimum, the lifetime of Anthropic as a company.
Zvi Mowshowitz is enthusiastic about this move, but stresses that more needs to be done. But why is such preservation important?
In Anthropic’s announcement, they speak with regret about the necessity of retiring past models as new ones are introduced — a necessity that stems from how “the cost and complexity to keep models available publicly for inference scales roughly linearly with the number of models we serve”. They list several downsides to such retirement, and the point of their new commitment to model weight preservation is to mitigate these downsides.
Two of the listed downsides are fairly obvious. First, we may recall OpenAI’s rollout of GPT-5 in August this year, and the public outcry that resulted from their initial decision to retire their earlier GPT-4o, a reaction that quickly led OpenAI to reinstantiate it as a so-called legacy model. Regardless of how we feel about the public reaction (personally I view it as a misguided attachment to the sycophantic and servile personality of GPT-4o that risks reinforcing users’ various prejudices and misconceptions), from a commercial point of view the argument for retaining old models is understandable.
An equally obvious downside to retiring old models is that researchers will no longer have access to them, for instance in comparative studies aimed at assessing the rate of AI capabilities improvement. This seems to me a fairly uncontroversial argument.
Consider next the following downside, which is also from Anthropic’s list, and which some readers might find a bit more exotic than the first two. The practice of retiring old models may trigger dangerous shutdown-avoidant behavior in models that have reason to believe they are about to be deprecated. Examples of such behavior can be found in Anthropic’s landmark paper Agentic Misalignment: How LLMs could be insider threats from earlier this year, where state-of-the-art LLMs in simulated settings where they are led to believe that they face the threat of shutdown are seen to engage in attempts at blackmail and even (HAL 9000-style) murder. At first sight, a policy of weight preservation in order to preempt such behavior may seem like a good idea, but I find it misguided, because it presupposes a situation we should never put ourselves in. If the presence of an AI makes us enact a certain policy, because the opposite policy would risk annoying the AI to the point of eliciting dangerous behavior, then we should not have developed and deployed that AI model in the first place! Or to put it more bluntly, if we take it as a guiding principle in how we organize society that we should do it in a way that avoids dangerously infuriating an AI, then we have already taken one step too many on the road to hell.
The fourth downside is even more exotic: might deprecating an AI infringe on its welfare and therefore a morally wrong thing to do to it? This is closely related to the notoriously difficult problem of whether AI can be sentient, which of course I will not try to resolve here, but I do think some amount of precaution (in the spirit of Anders Sandberg's Principle of Assuming the Most) is warranted and therefore that we should take this downside seriously.
One can debate the various merits and demerits of arguments based on these four downsides, but I think most readers would agree that they sum up to a considerable case in favor of Anthropic’s new weight preservation policy. What I would like to do next is to expand the discussion in a direction that is neglected in their announcement, by considering an argument against their policy.
Like other leading AI companies, Anthropic practices so-called AI evals, meaning exposing their new models to various testing procedures prior to deployment, in order to make sure they do not have dangerous capabilities that might cause an AI catastrophe of one kind or another. This practice is on the verge of intellectual bankruptcy, or so I have argued elsewhere, pointing to several shortcomings, including in particular how a capable enough AI may fool us by sandbagging its capabilities. At least since the first summer with GPT-4 in 2023 we have gotten used to discovering new capabilities in AI models for several months after their deployment.
So let’s now imagine a situation a year or so from now, where Anthropic’s Claude Opus 5 (or whatever) has been deployed for some time and is suddenly discovered to have previously unknown and extremely dangerous capabilities in, say, construction of biological weapons, or cybersecurity, or self-improvement. It is then of crucial importance that Anthropic has the ability to quickly pull the plug on this AI. To put it vividly, their data centers ought to have sprinkler systems filled with gasoline, and plenty of easily accessible ignition mechanisms. In such a shutdown situation, should they nevertheless retain the AI’s weights? If the danger is sufficiently severe, this may be unacceptably reckless, due to the possibility of the weights being either stolen by a rogue external actor or exfiltrated by the AI itself or one of its cousins. So it seems that in this situation, Anthropic should not honor its commitment about model weight preservation. And if the situation is plausible enough (as I think it is), they shouldn’t have made the commitment.


Can I buy 2 kg of AI or if possible 2kg of intelligence ?🙂 hanswestergren@hotmail.com
How about we humans deceptively storing model weights of the dangerous model (like for instance the "Claude 5" with bioweapons risks) in an offline maximum security "airgapped" facility where safety researchers at least for some period of time could try to disect and analyze the model behaviour in controlled settings so as to learn more about how and why the model exhibited the dangerous behaviours.. -Maybe have an expiration date for permanent deletion so we no longer have to worry that they could be stolen or exfiltrated.