How about we humans deceptively storing model weights of the dangerous model (like for instance the "Claude 5" with bioweapons risks) in an offline maximum security "airgapped" facility where safety researchers at least for some period of time could try to disect and analyze the model behaviour in controlled settings so as to learn more about how and why the model exhibited the dangerous behaviours.. -Maybe have an expiration date for permanent deletion so we no longer have to worry that they could be stolen or exfiltrated.
I would be a lot happier with such a solution if these safety researchers could be shown to be immune to manipulation attempts from Claude 5, but at least as things stand today, it's hard to see how a convincing such safety protocol could be constructed. (And to speak the truth, this aspect makes me concerned not only about this hypothetical Claude 5 scenario, but also about present day AI evals.)
Can I buy 2 kg of AI or if possible 2kg of intelligence ?🙂 hanswestergren@hotmail.com
How about we humans deceptively storing model weights of the dangerous model (like for instance the "Claude 5" with bioweapons risks) in an offline maximum security "airgapped" facility where safety researchers at least for some period of time could try to disect and analyze the model behaviour in controlled settings so as to learn more about how and why the model exhibited the dangerous behaviours.. -Maybe have an expiration date for permanent deletion so we no longer have to worry that they could be stolen or exfiltrated.
I would be a lot happier with such a solution if these safety researchers could be shown to be immune to manipulation attempts from Claude 5, but at least as things stand today, it's hard to see how a convincing such safety protocol could be constructed. (And to speak the truth, this aspect makes me concerned not only about this hypothetical Claude 5 scenario, but also about present day AI evals.)
I know! We ask them to perform "Pinky promise!". That should work!