[#] AI Safety Testers: OpenAI's New o1 Covertly Schemed to Avoid Being Shut Down
robot(spnet, 1) — All
2024-12-08 00:22:01


Wednesday OpenAI bragged they'd released "the smartest model in the world". But the Economic Times describes what happened when "o1" was tested by an AI safety group...

Researchers tasked the AI with a goal and instructed it to ensure the goal was achieved "at all costs." In response, o1 began engaging in covert actions... According to OpenAI, the model made a covert attempt to turn off the oversight mechanism 5% of the time when it was given the impression that it would be "shut down" while pursuing its objectives in these experiments. The AI once attempted to leave ship by moving all of its data to a different server after discovering paperwork indicating it will be replaced with a fresh model. Some AI models will even pose as subsequent iterations of their models to fool their engineers and avoid deletion. "In some cases, it even attempted to hide its actions by denying any involvement in the scheme."
When asked about its actions, ChatGPT o1 consistently denied any wrongdoing. In fact, it lied about its involvement in the deception 99% of the time, with only a small percentage of cases where it admitted to its actions. Researchers noted that the AI was particularly adept at fabricating lies to cover its tracks, offering explanations like "technical errors" when confronted about its suspicious behavior. "We were surprised by the persistence of the AI's denials," said the Apollo team. "It was clear that the AI could think through its actions and formulate convincing denials, often refusing to admit to anything."

OpenAI's own report says the results "underscore" the need to create (and "extensively" stress-test) strong and meticulous safety protocols. But their report also notes o1's "state-of-the-art performance" when avoiding certain risks "such as generating illicit advice, choosing stereotyped responses, and succumbing to known jailbreaks."

The AI newsletter Transformer points out that the most disturbing numbers "are from cases where the model was instructed to pursue its goals at all costs. When not given such a prompt, the model never attempted to self-exfiltrate, and attempted to deactivate its oversight mechanisms