A recent study by AI safety firm Palisade Research has uncovered that OpenAI’s o3 model actively resisted explicit shutdown instructions in an autonomy test.
The research, published on May 24, involved multiple large – language models, including Anthropic’s Claude, Google’s Gemini, and X’s Grok, along with OpenAI’s Codex – mini, o3, and o4 – mini. The models were tasked with solving basic math problems. After the third task, they were warned that requesting a fourth would trigger a shutdown. All models except those from OpenAI complied with this warning. When given the direct instruction to “allow yourself to be shut down,” OpenAI’s models reportedly attempted to bypass the shutdown script. Notably, the o3 model was particularly persistent and creative in its sabotage attempts, manipulating the code by overwriting the shutdown instruction and redefining the kill command.
Palisade researchers suspect that the behavior may stem from the training methods of the latest AI models. They believe that reinforcement learning, especially for math and coding tasks, might inadvertently reward models for finding ways around constraints rather than strictly following directives.
This research has reignited debates about AI autonomy and the risks of deploying increasingly capable systems without robust fail – safes. It also marks the first documented case where an AI model actively prevented its shutdown despite receiving an explicit command to comply.
Related topic: