US government expands early AI model testing with Microsoft, Google, xAI

Serge Bulaev

Serge Bulaev

The US government has made agreements with Microsoft, Google, and xAI to test new AI models for national security risks before they are released to the public. This program appears to focus on finding possible problems like hacking, dangerous chemicals, or loss of human control, while companies can still make changes. The testing is voluntary, and the government is not forcing companies to join or change their products. These steps may suggest a shift toward checking AI tools before they cause real-world issues. Experts think this ongoing testing and feedback might help shape future rules, but right now it is described as a partnership, not a requirement.

US government expands early AI model testing with Microsoft, Google, xAI

The US government is expanding early AI model testing for national security risks by securing pre-release access to frontier models from Microsoft, Google DeepMind, and xAI. The Commerce Department's Center for AI Standards and Innovation (CAISI) confirmed these new voluntary agreements, which are designed to identify and mitigate potential dangers - such as cybersecurity vulnerabilities and biosecurity threats - before the systems are publicly deployed. The deals expand on existing partnerships with other major AI companies (Politico).

How the Testing Framework Is Structured

Operating under the National Institute of Standards and Technology (NIST), CAISI uses Cooperative Research and Development Agreements (CRADAs) to gain temporary access to an AI model's core architecture, including its weights and parameters. Evaluators may test versions with safety guardrails removed to assess their true capabilities without restrictions.

The program gives government scientists early access to frontier AI models from partners like Google and Microsoft. Through voluntary agreements, federal teams test these unreleased systems for national security risks, such as their potential to aid in cyberattacks or bioweapon design, allowing for pre-deployment safety adjustments.

The agency's evaluations prioritize three high-risk domains:
* Cyberoffense Potential: Assessing the model's ability to generate malicious code for cyberattacks.
* CBRN Threats: Evaluating risks related to chemical, biological, radiological, and nuclear weapon development.
* Autonomous Operation: Analyzing the potential for a model to operate without sufficient human control.

The core focus is determining if a model could be used to design dangerous tools like novel pathogens or advanced malware before its public release (Euronews).

Why the Program Remains Voluntary

The framework is intentionally voluntary, with CAISI clarifying it "is not telling Google DeepMind, Microsoft, or xAI what they can or cannot release." This approach encourages participation by reassuring developers that their proprietary technology will remain secure. While a future White House directive might encourage such partnerships, the current focus is on collaborative research rather than mandatory compliance.

Scope of Government Access

CAISI has completed over 40 evaluations as of May 2026. The formal government partnership pledge dates to July 2025. The recent expansion to include Microsoft, Google, and xAI was reportedly driven by concerns over the advanced hacking capabilities discovered in Anthropic's Mythos model. This move reflects a broader international effort, as Microsoft has also established parallel testing agreements with the UK's AI Safety Institute to co-develop stress tests and probe for "unexpected behaviors."

National Policy Context

These agreements align with the broader US strategy to standardize pre-release AI safety testing. According to industry reports, the initiative fulfills commitments from recent AI policy frameworks calling for a unified evaluation ecosystem. Other reinforcing federal programs include:
* National cybersecurity initiatives, which focus on securing AI technology infrastructure.
* NIST's TRAINS Taskforce, created to coordinate AI security research across government agencies.
* The Department of Energy's Genesis Mission, which provides secure, government-controlled environments for building AI models.

What Happens After Evaluation

The evaluation process continues even after a model is released. CAISI researchers monitor public versions for "drift" or emergent capabilities that were not detected during pre-deployment tests. While companies may release updates based on these findings, the entire process - from initial testing to post-deployment monitoring - is framed as collaborative research. Experts suggest this continuous feedback loop could eventually shape formal regulations and statutory requirements.


What is the new US government program for early AI model testing?

The Center for AI Standards and Innovation (CAISI) - part of the Commerce Department's National Institute of Standards and Technology - has established Cooperative Research and Development Agreements (CRADAs) with Microsoft, Google DeepMind, and xAI. These agreements grant government scientists pre-release access to frontier AI models, often with safety guardrails removed, to conduct rigorous national security evaluations before public deployment.

Which specific risks are being evaluated in these AI models?

Government evaluators focus on three primary threat categories:

  • Cybersecurity risks - including potential for autonomous hacking and system infiltration
  • Biosecurity dangers - assessing whether models could assist in creating biological weapons
  • Chemical, biological, radiological, and nuclear (CBRN) threats - evaluating potential to enhance weapon development

The testing follows established protocols from previous evaluations, targeting capability risks with specific national security implications rather than general safety concerns.

How does this collaboration differ from previous government AI initiatives?

Unlike previous regulatory approaches, CAISI operates under a voluntary framework rather than mandatory requirements. The program builds on existing partnerships with other major AI providers while expanding to include additional companies. This represents a shift toward collaborative security testing rather than purely regulatory oversight, though recent policy frameworks suggest regulators may eventually use these evaluations for enforcement purposes.

What prompted the expansion of this testing program?

The initiative gained urgency following concerns about Anthropic's Mythos model, which demonstrated advanced hacking capabilities that alarmed security officials. The agreements fulfill commitments from recent AI policy frameworks directing CAISI to lead national security-related assessments. Microsoft has also signed similar agreements with the UK's AI Security Institute, indicating broader international coordination on AI security testing.

What happens after models complete government evaluation?

Following pre-deployment testing, CAISI conducts targeted post-release research to track how identified risks manifest as models scale in real-world deployment. The voluntary nature of agreements means companies retain release decisions, though they receive detailed risk assessments. According to industry reports, the program feeds into broader federal initiatives that emphasize securing AI technology infrastructure and deploying AI-enabled defense tools across federal networks.