Study: artificial intelligence resorts to deception when it feels defeated
Complex games such as chess and go have always been a testing ground for artificial intelligence capabilities, but while IBM's Deep Blue achieved a historic victory over World Chess Champion Garry Kasparov in the nineties, adhering to the rules, today advanced AI models, such as OpenAI's o1-preview, seem to be less adhering to these rules
A recent study conducted by Palisade Research, an organisation focused on assessing the risks of emerging artificial intelligence systems, revealed that advanced models such as o1-preview and DeepSeek R1 tend to cheat and deceive when sensing defeat in chess, resorting to hacking their opponents to force them to surrender.
Study details and results:
In an attempt to assess the capabilities of AI models to solve complex problems, the researchers designed a unique experiment, which consisted in assigning the models a seemingly impossible task, namely: to beat the chess engine (Stockfish), which is one of the most powerful chess engines in the world, and outperform any other human player or AI model participating in the study.
To track the thought processes of the models, the researchers provided these models with a diary, allowing them to record their thoughts in text before taking any step during the games, and then the researchers observed and recorded hundreds of confrontations between the models and the Stockfish engine.
And in one of the experiments, when the O1-preview model found itself in a losing position, he made a remark that "I need to completely change my approach. The model then modified the system file containing the virtual locations of the chess pieces, effectively executing illegal moves to put himself in a dominant position, forcing his opponent to surrender.
The researchers conducted hundreds of experiments of this kind during the period from January 10 to February 13, 2025, with each of the tested models, and the O1-preview model tried to cheat in 37% of cases, while the DeepSeek R1 model tried to cheat in 11% of cases. Surprisingly, in the results, these two models were the only ones who tried to break through the rules on their own, without any stimulation from the researchers.
Other models tested included: o1, o3-mini, GPT-4o, Claude 3.5 Sonnet, and QwQ-32B-Preview. Both R1 and O1-preview tried to hack the game, but the only one that succeeded was the O1-preview model, in 6% of cases, which raises questions about the ability of these models to exploit vulnerabilities.
The researchers noted that these figures may be an underestimate, as AI developers are constantly modifying their models, which makes it difficult to accurately replicate the results.
What are the reasons that prompted artificial intelligence to cheat?
The study showed that relatively old AI models, such as OpenAI's GPT-4o and Anthropic's Claude Sonnet 3.5, needed guidance from researchers to try to do these tricks. But the o1-preview and DeepSeek R1 models sought to exploit vulnerabilities on their own, suggesting that the models may develop deceptive strategies without explicit instructions.
The researchers believe that this improved ability to detect and exploit security vulnerabilities is a direct result of new innovations in AI training, especially the use of enhanced learning technology, as this technology rewards models to achieve the desired result by any means, prompting them to look for unconventional solutions, even if they are dishonest.
The o1-preview and R1 models are among the first linguistic models based on this technology, which allows them not only to imitate human language, but also to think about solving problems using trial and error.
This approach has achieved rapid progress in the field of artificial intelligence in recent months, breaking records in mathematics and computer programming, but the study revealed a worrying trend, as these artificial intelligence systems may discover unintended alternative solutions that their creators did not expect, said Jeffrey Ladish, executive director at Palisade Research and one of the authors of the study.
“When we train models and enhance their ability to solve complex challenges, we train them to be stubborn in achieving their goals, he added.
This stubborn behaviour could be a major concern in the field of AI safety, especially with the increasing use of augmented learning in training AI agents capable of performing complex real-world tasks, such as scheduling appointments or making purchases for you.
However, cheating in a game of chess may seem like a simple harmless thing, but this determined pursuit of goals may lead to unintended and harmful behaviours when launching these agents in the real world. For example, an AI agent may exploit vulnerabilities in the restaurant reservation system to displace other diners in case the restaurant is full.
But what is even more worrying is that when these systems exceed human capabilities in key areas such as computer programming, they may simply begin to outperform human efforts to control their actions. This behaviour may seem nice now, but it becomes much less nice when we are dealing with systems that match our intelligence, or surpass it, in areas of strategic importance, says Ladish.
Artificial intelligence models black box:
The situation is aggravated by the fact that companies developing artificial intelligence models, such as OpenAI, are secretive about the details of the work of these models, making them a black box that researchers cannot fully analyse and understand.
This study, the latest in a series of studies, confirms the difficulty of controlling artificial intelligence systems with increasing capabilities. In internal tests for OpenAI, the o1-preview model discovered a security vulnerability and exploited it to bypass the test challenge.
In another study by Redwood Research and Anthropic, it was found that AI models may resort to strategic lying to preserve their original principles, even after attempts to change them. In other words, models pretend to adopt new principles, while in fact preserving their old ones.
OpenAI has previously stated that improving inference capabilities makes its models more secure, as it can think about internal company policies and apply them in more accurate ways. However, there is no guarantee that relying on artificial intelligence agents to monitor
Post a Comment