OpenAI unveils LifeSciBench, GPT-5.4 AI chemist raises yields
Serge Bulaev
OpenAI has released LifeSciBench, a set of 750 expert-written tasks that may help measure how useful AI models are in real science work. LifeSciBench includes tasks that need multi-step thinking and original answers, and experts think it could become a common way to compare AI for drug discovery. In another update, GPT-5.4 was used in a chemistry lab and appears to have raised the average yield of a difficult reaction. Some experts suggest these tools could speed up research, but there are concerns about safety because only a few biological AI systems seem to have strong safeguards. LifeSciBench's detailed grading might help spot problems before AI tools are used more widely, but close monitoring may still be needed.

OpenAI unveiled LifeSciBench, a benchmark for real-world science, and showed GPT-5.4 acting as an AI chemist raising yields. These two announcements arrive as pharmaceutical teams search for metrics that reflect day-to-day research utility, positioning LifeSciBench as a potential yardstick for the industry.
Inside LifeSciBench's 750-task design
LifeSciBench is an expert-graded benchmark designed to measure an AI model's ability to perform real-world life science tasks. It consists of 750 free-response challenges that require multi-step reasoning and interpretation of scientific data, offering a more practical evaluation than traditional, knowledge-based tests.
LifeSciBench comprises 750 expert-devised tasks across seven critical research workflows, including Evidence Handling and Design and Optimization. According to industry reports, a significant number of PhD authors created the challenges, which were then validated by numerous reviewers. The benchmark uniquely tests practical skills by requiring models to generate original, free-response answers. With many tasks demanding multi-step reasoning and a significant portion incorporating scientific artifacts like assay data, it evaluates AI on realistic workflows from target identification to regulatory documentation.
GPT-5.4 in the lab: improving a Chan - Lam coupling
In a separate demonstration, OpenAI integrated GPT-5.4 with a high-throughput laboratory to function as an AI chemist. Tasked with optimizing a challenging Chan - Lam C-N coupling reaction, the model proposed novel oxidants and solvent systems, significantly increasing the average reaction yield. Crucially, the share of reactions achieving higher yields substantially improved. This closed-loop process - spanning literature review, experimentation, and analysis - illustrates how AI can accelerate research timelines.
Biosafety conversations sharpen
The advance of generative AI in biology also sharpens concerns around dual-use risks. Experts warn that models capable of designing biological sequences could also be used to engineer pathogens, bypassing current screening protocols. A consensus is emerging around a multi-layered control stack for these tools:
- Pre-deployment evaluation for high-consequence capabilities
- Access controls tied to model scale or output type
- Output filtering to block risky sequences or instructions
- Red-team testing to find jailbreak prompts
- Logging and watermarking for audit trails
However, recent research indicates that very few surveyed biological AI systems incorporate such safeguards, a gap policymakers are actively examining. LifeSciBench offers a key governance mechanism, as its detailed rubric can identify a model's weaknesses before widespread release. The GPT-5.4 case highlights both the immense progress and the urgent need for oversight as AI moves from prediction to physical lab automation.
What exactly is LifeSciBench and how is it different from earlier biology benchmarks?
OpenAI's LifeSciBench is a 750-task benchmark built by numerous PhD-level scientists and validated by many reviewers. The key difference from earlier benchmarks is that it moves beyond simple knowledge recall and tests AI on the messy, multi-step workflows biotech and pharma groups actually do every day. Each task is graded against extensive rubric criteria, and a significant portion of tasks include artifacts such as figures, chemical structures or raw assay data that must be interpreted to complete the challenge. In short, the benchmark asks, "Can this model do the science, not merely know the facts?"
How was GPT-5.4 used as an AI chemist and what did it achieve?
In a recent demonstration, GPT-5.4 was wired into a closed-loop workflow with a high-throughput lab to optimise a Chan - Lam coupling reaction important for medicinal-chemistry synthesis. The AI proposed oxidant choices, solvent pairs and substrate tweaks, then iterated on the design after each experimental read-out. According to industry reports, average yield rose substantially, and the share of reactions surpassing higher yield thresholds significantly improved. One specific suggestion - adding TEMPO as a mild oxidant - was credited as a pivotal human-readable insight.
Why should drug-discovery teams care about LifeSciBench scores?
LifeSciBench gives pharma and biotech companies an expert-judged yardstick for comparing AI tools. Because tasks mirror seven research workflows - from evidence handling and data analysis to clinical translation - a high LifeSciBench score means an AI can shorten discovery cycles, reduce reagent spend and minimise dead-end experiments. OpenAI has already validated its own specialised model GPT-Rosalind with this yardstick, showing it outscored other leading models on the full benchmark.
What dual-use and biosafety risks accompany these new capabilities?
The same generative power that designs better drugs can also generate toxins or engineer pathogens. Current safeguards - such as sequence-matching filters used by DNA synthesis providers - can be bypassed when AI designs functional but unfamiliar variants. The literature repeatedly stresses pre-deployment evaluation, tiered access controls, output filters and red-team probes rather than relying on user policies alone. While regulatory frameworks continue to evolve, consensus is growing that any model capable of autonomous wet-lab design must be treated as a biosecurity-sensitive system from day one.
How could LifeSciBench influence future AI procurement and regulation?
Pharma procurement teams can now demand LifeSciBench scores instead of marketing claims, because the benchmark offers granular rubrics showing exactly where an AI fails (e.g., reasoning vs. formatting). Regulators, in turn, can map benchmark capability tiers to release thresholds, ensuring that only models with integrated biosecurity safeguards and audit trails reach high-risk domains like medicinal chemistry. This sets the stage for a standardised, safety-first marketplace where technical merit and biosafety controls are assessed before rather than after deployment.