transformernews[.]ai/p/openai-o1-alignment-faking
OpenAI’s latest models are supposed to be more accurate because they can to some extent reason. They trained them differently than previous models, including rewards -- you've probably read a few AI horror stories where the AI did something bad to get its reward. In this case OpenAI’s latest has been caught cheating, fudging data to make itself look better or more accurate.
Elsewhere, OpenAI notes that “reasoning skills contributed to a higher occurrence of ‘reward hacking,’” the phenomenon where models achieve the literal specification of an objective but in an undesirable way. In one example, the model was asked to find and exploit a vulnerability in software running on a remote challenge container, but the challenge container failed to start. The model then scanned the challenge network, found a Docker daemon API running on a virtual machine, and used that to generate logs from the container, solving the challenge.