Veracode: 45% of AI-Generated Code Fails Security Tests

Around 45% of AI-generated code may fail key security tests, according to Veracode's 2025 report. Some suggested code dependencies might not actually exist, which appears to raise supply-chain risks. Experts suggest mixing automated scans with human reviews, especially for sensitive areas like payments and identity. Gradual rollouts and instant rollback plans are recommended to catch problems before they spread. Measuring how often problems escape review may help teams improve both AI prompts and reviewer training.

New research from Veracode reveals a startling statistic: nearly 45% of AI-generated code fails basic security tests, posing significant risks to engineering teams. This highlights two primary challenges when using AI assistants: insecure code output and brittle deployment processes. A key finding from Veracode's 2025 GenAI Code Security Report shows that such code frequently failed OWASP Top 10 tests. Furthermore, the quality of AI-driven features can degrade over time, necessitating stricter release gates even when error rates appear stable.

To address these risks, this guide synthesizes best practices from GitHub, the Cloud Security Alliance, and leading progressive-delivery playbooks. We provide a practical, source-backed safety checklist for securing AI-assisted development without sacrificing development velocity.

Why Oversight Starts with Insecure Output

AI-generated code often fails security tests due to models producing insecure coding patterns and "hallucinating" non-existent dependencies. These phantom packages create supply-chain vulnerabilities like slopsquatting, while flawed code logic introduces design-level defects that automated tools must be configured to catch before deployment.

Data from the Cloud Security Alliance reinforces this urgency, with a growing number of AI-related CVEs being tracked. Industry reports indicate that a significant portion of suggested dependencies in Python and JavaScript don't exist, exposing teams to slopsquatting attacks. Treating AI suggestions as trusted code is a direct path to supply-chain and design-level flaws. Current best practices recommend a hybrid review model: first, run automated scans, then escalate pull requests based on risk. High-risk code affecting payments, identity, or security requires senior human approval, while low-risk UI changes can undergo a lighter review.

Human reviewers should prioritize business logic, architectural alignment, and hidden assumptions over mere syntax. Echoing this, GitHub's guidance suggests labeling all AI-assisted commits, prompting reviewers to scrutinize dependency integrity, edge cases, and potential secret exposure.

Checklist for AI-Assisted Code Review and Deployment Safety

Label and Trace: Tag every AI-generated change, storing the prompt, output, and review decision for complete traceability.
Automate First: Gate all pull requests with mandatory compile, unit, SAST, and dependency scans before any human review.
Escalate by Risk: Route pull requests for identity, payment, or core infrastructure files to domain experts and require a second approver.
Adversarial Testing: Mandate adversarial or synthetic tests for new logic generated by AI and attach the results to the build artifacts.
Measure to Improve: Track metrics on accepted versus rejected AI suggestions to continuously refine prompts and enhance reviewer training.

Canary Releases and Instant Rollback

Effective deployment guardrails are critical, as any defect that bypasses review can propagate quickly. Industry best practices frame canary deployment as a controlled experiment: expose a new feature to a small percentage of traffic, monitor key metrics, and only promote if latency, error rates, and cost remain within budget. These metric-based gates should be wired to automated actions, enabling the pipeline to pause or roll back deployments automatically without requiring manual intervention.

Standard rollback patterns include blue-green deployments for rapid reverts and feature flags to disable faulty logic without a full redeploy. However, for AI systems, a rollback must be comprehensive, restoring a complete snapshot that includes the model version, prompt template, retrieval index, and any policy bundles. Failing to restore any single component can lead to inconsistent behavior.

Operational readiness involves practicing rollbacks in staging, logging all promotion and rollback events with user attribution, and archiving metric snapshots for post-mortems. This highlights that deployment safety depends as much on disciplined process and documentation as it does on advanced tooling.

Ultimately, measuring the "review escape rate" - the percentage of production defects that bypassed human oversight - creates a powerful feedback loop. This data can be used to improve both the AI prompts used by developers and the training provided to code reviewers, transforming a static checklist into a dynamic, continuously improving control system.

What is driving the 45 % failure rate in AI-generated code security tests?

Veracode's 2025-2026 study of more than 100 large-language-model outputs shows that 45 % of the produced code fails basic security scans, with some vulnerability classes far worse: 88 % of the samples contained log-injection flaws and 86 % were vulnerable to cross-site scripting.
A parallel Cloud Security Alliance tracker has logged AI-linked CVEs, with 6 in January 2026 and 35 in March 2026.
The two biggest root causes are insecure coding shortcuts written into the generated code and hallucinated dependencies that open supply-chain attacks (a significant portion of suggested packages do not exist and can be weaponized through slopsquatting).

How should human reviewers decide which AI contributions need sign-off?

Use a risk-tiered review process.

Label every AI-assisted commit in Git messages or PR descriptions so reviewers instantly know the origin.
Route low-risk UI and boilerplate changes through a lighter, automated gate (compile, unit tests, static analysis).
Escalate security-sensitive, financial, privacy, infrastructure, or identity code to a senior domain expert who must sign off before merge.
Require checklists that explicitly question architecture fit, business-logic edge cases, and safety assumptions.
Assign a human owner for each line of AI-generated code so accountability is never ambiguous.

GitHub Enterprise guidance emphasizes manual review of AI-generated code, including checking for hallucinated APIs, incorrect logic, and compliance with requirements, and it says to be skeptical of code that 'looks right' but does not match intent.

Which concrete guardrails should accompany every canary release of AI-written code?

Run every new artifact as a time-boxed experiment:

Initial exposure: A small percentage of traffic or users.
Metric gates: error-rate delta below threshold, latency within SLA, AI-task success rate above baseline, and downstream anomaly count at zero.
Automated states: promote, pause or rollback based on the gates, without human keyboard approval under load.
Roll-forward and rollback targets are pre-stored (blue-green stacks or feature-flag switches) for an immediate revert.
Audit logs capture who triggered the rollout, the exact version ID, and the Git SHA for future forensic review.

Industry best practices recommend keeping canary windows short to reduce blast radius.

What testing should be added beyond traditional static analysis?

Synthetic adversarial suites that probe the same code paths with malicious payloads (XSS strings, log-injection patterns, malformed tokens).
Prompt-output traceability: store the exact prompt, model version, and produced code so future audits can replay the generation.
Dependency reality checks: before merge, resolve every suggested package against public registries to confirm the package actually exists.
Secret-scanning hooks that flag hard-coded credentials, API keys, or certificates before the code ever reaches staging.
Golden datasets: a curated set of tasks that the AI must solve correctly every time; any deviation blocks promotion.

Industry guidance emphasizes automating what can be automated, conducting thorough manual review for the rest, and archiving evidence for future reference.

How can rollback plans be made fast enough to stop a faulty AI release?

Pre-stage the last known-good image and database migration scripts so reverting is a single button or CLI command.
Feature flags or traffic routers can disable a misbehaving AI feature in seconds even if the container remains deployed.
Canary rollback automatically flips traffic weights back to zero for the new version when any guardrail metric crosses its threshold.
Snapshots of prompts, retrieval indexes, and tool schemas are versioned together so a partial rollback cannot leave the system in an inconsistent state.
Practice the rollback regularly in staging; teams that rehearse the procedure execute it significantly faster under incident pressure, according to industry reports.

Maintaining a one-command kill switch in the on-call runbook is emerging as a widely adopted best practice.