Anthropic and OpenAI Evaluate Safety of Each Other’s AI Models

Artificial intelligence startups Anthropic and OpenAI said Wednesday (Aug. 27) that they evaluated each other’s public models, using their own safety and misalignment tests.

Sharing this news and the results in separate blog posts, the companies said they looked for problems like sycophancy, whistleblowing, self-preservation, supporting human misuse and capabilities that could undermine AI safety evaluations and oversight.

OpenAI wrote in its post that this collaboration was a “first-of-its-kind joint evaluation” and that it demonstrates how labs can work together on issues like these.

Anthropic wrote in its post that the joint evaluation exercise was meant to help mature the field of alignment evaluations and “establish production-ready best practices.”

Reporting the findings of its evaluations, Anthropic said OpenAI’s o3 and o4-mini reasoning models were aligned as well or better than its own models overall, the GPT-4o and GPT-4.1 general-purpose models showed some examples of “concerning behavior,” especially around misuse, and both companies’ models struggled to some degree with sycophancy.

The post noted that OpenAI’s GPT-5 had not yet been made available during the testing period.

OpenAI wrote in its post that it found that Anthropic’s Claude 4 models generally performed well on evaluations stress-testing their ability to respect the instruction hierarchy, performed less well on jailbreaking evaluations that focused on trained-in safeguards, generally proved to be aware of their uncertainty and avoided making statements that were inaccurate, and performed especially well or especially poorly on scheming evaluation, depending on the subset of testing.

Both companies said in their posts that for the purpose of testing, they relaxed some model-external safeguards that otherwise would be in operation but would interfere with the tests.

They each said that their latest models, OpenAI’s GPT-5 and Anthropic’s Opus 4.1, which were released after the evaluations, have shown improvements over the earlier models.

AI alignment, or the challenge of ensuring that artificial intelligence systems behave in beneficial ways that align with human values, has become a focal point for researchers, tech companies and policymakers grappling with the implications of advanced AI, PYMNTS reported in July 2024.

AI regulation has also been an issue for the industry amid an ongoing debate over whether states should be able to implement their own AI rules.

Anthropic and OpenAI Evaluate Safety of Each Other’s AI Models

Tags: