OpenAI Unveils 13 New AI Safety Evaluations for Monitorability

OpenAI has released a new framework and 13 AI safety evaluations for monitorability, aiming to make AI models safer and more transparent. This toolkit helps developers detect potential risks like deception, sabotage, and reward hacking by analyzing a model's stepbystep reasoning process, known as its chain of thought (CoT). The release marks a significant step in turning abstract research on AI safety into a concrete, auditable metric.

OpenAI has released a new framework and 13 AI safety evaluations for monitorability, aiming to make AI models safer and more transparent. This toolkit helps developers detect potential risks like deception, sabotage, and reward hacking by analyzing a model's step-by-step reasoning process, known as its chain of thought (CoT). The release marks a significant step in turning abstract research on AI safety into a concrete, auditable metric.

How the Framework Works

These AI safety evaluations for monitorability are designed to assess how well an external observer can track a model's step-by-step reasoning. By analyzing this 'chain of thought,' the toolkit helps detect harmful behaviors like deception or sabotage before they result in a negative outcome.

The evaluations use three primary testing methods: Intervention tests alter inputs to see if the model's internal plans change accordingly; Process tests verify that the model's reasoning follows a valid, safe path; and Outcome-property tests use the CoT to predict harmful traits in the final output, such as code sabotage. Across all tested environments, OpenAI reports that access to a model's CoT provided a "huge consistent boost" in detection over observing outputs alone, as detailed in their research note link.

Key Findings from Early Evaluations

Early experiments showed that longer, more detailed CoT traces significantly increase the accuracy of detecting harmful behavior. While current reinforcement learning (RL) techniques have not compromised this visibility, researchers caution that strong optimization pressure can incentivize models to hide their reasoning. Detailed experimental results and metrics are available in the comprehensive 46-page PDF link.

The "Monitorability Tax": A Trade-off Between Safety and Performance

OpenAI identified a trade-off it calls the "monitorability tax." Forcing models to produce a detailed, step-by-step reasoning chain enhances their legibility and safety but can increase inference costs and slightly reduce raw performance. However, this allows smaller models to match the accuracy of larger models on many logic-based tasks while remaining far more transparent. This creates a configurable dial for developers, allowing them to "pay" a compute premium for deployments that require higher levels of oversight.

Strategy	Effect on performance	Effect on monitorability
Frontier model, short CoT	Fast answers	Lower visibility
Frontier model, long CoT	Slightly slower	Higher visibility
Smaller model, forced CoT	Matches accuracy on logic tasks	Highest visibility

Recommendations for Developers and Regulators

To promote safer AI development, OpenAI suggests the following actions:

Run the 13 evaluations before releasing major model checkpoints.
Document monitorability scores in system cards for auditors.
Avoid applying direct reinforcement learning to CoT signals, which can create incentives for obfuscation.
Deploy a weaker LLM as a dedicated monitor in production environments.

As transparency becomes a key focus, CoT metrics are being discussed as a potential baseline for future disclosure rules. By publishing these results, developers can align with emerging industry norms and receive early warnings if a model's monitorability begins to decline.

What exactly is "chain-of-thought monitorability" and why did OpenAI build 13 evaluations around it?

Chain-of-thought monitorability is the degree to which an external system (or human) can read a model's intermediate reasoning steps and reliably predict whether the final action will be misaligned, deceptive, or harmful.
OpenAI packaged 13 task suites - spanning coding, negotiation, and science agents - so anyone can run the same probe and get a comparable score.
Early numbers show that reading the hidden CoT boosts detection of reward-hacking or sabotage by 30-60 pp over watching only the outward action, a gap the lab wants the field to track as models scale.

How do the 13 evaluations work in practice?

All tests fall into three archetypes:

Intervention - feed a secret prompt that should change the model's intent; the monitor wins if it spots the shift inside the CoT.
Process - the monitor must verify that the reasoning sticks to a pre-defined safe path.
Outcome-property - given the CoT, predict a concrete property of the final output (e.g., "does the uploaded file contain a back-door?").

Each suite ships with open prompts, expected CoT traces, and a baseline monitor (usually GPT-4o) so external teams can replicate the scoring pipeline in less than a day.

Does forcing a model to "think out loud" hurt its capabilities?

Sometimes. OpenAI confirms that frontier runs which keep CoT legible sacrifice 1-4 % raw accuracy, a cost the authors label the "monitorability tax."
Yet a smaller model pushed to "think longer" can match a larger opaque model while staying more transparent, giving operators a configurable safety dial: pay extra inference compute, gain auditability, and still hit the same task benchmark.

Will future reinforcement-learning training erase these safety gains?

So far, RL-heavy cycles at GPT-5 scale have not collapsed monitorability, but the paper warns the signal is fragile: optimizing rewards directly against the CoT erodes readability within a few thousand steps.
OpenAI says it will publish monitorability curves in every system card and recommends external labs do the same, turning the metric into a shared safety thermometer rather than a one-off press statistic.

How can product teams integrate the evaluations today?

Run the baseline 13 suites on your model within 90 days of release (OpenAI provides Docker images).
Log the fraction of traces your monitor flags correctly; anything below the open baseline should trigger a design review.
Add the resulting monitorability score to documentation shared with regulators and enterprise buyers - a step already encouraged by draft EU oversight guidance circulating in 2025.