AI Update
June 11, 2026

AI Agents Can't Synthesise Science — And That's a Big Problem

AI Agents Can't Synthesise Science — And That's a Big Problem

AI agents are being trusted to summarise medical research and scientific evidence — but a new benchmark reveals the best models are getting the facts right less than 34% of the time.

The AI Agent Synthesis Problem, Quantified

Researchers have released SciConBench, a rigorous 9,110-question benchmark built from expert-written conclusions drawn from systematic reviews — the gold standard of scientific evidence. The goal: find out whether today's frontier AI agents can actually synthesise scientific conclusions reliably.

The answer is a sobering no. Under clean-room conditions (where data leakage is controlled), the best-performing agent scored a factual F1 of just 0.337. That means even the top model is wrong or incomplete on roughly two-thirds of scientific synthesis tasks.

Why "Clean-Room" Testing Changes Everything

Here's the twist that makes this finding especially important: when agents were allowed to operate without leakage controls, their scores looked much better. The clean-room setup strips away the possibility that a model has simply memorised the answers during training — and performance consistently dropped when that crutch was removed.

Consumer-facing tools weren't spared either. The researchers audited Google AI Overview and OpenEvidence, finding they frequently produce incomplete and sometimes contradictory conclusions — even when the correct answer was available in the source material. This isn't a niche research problem; millions of people use these tools for health and science queries every day.

What This Means for AI Agent Literacy

This study is a masterclass in why understanding how AI agents actually work — not just what they claim to do — is a critical skill right now. If you're building pipelines that rely on agents to retrieve and reason over documents, this benchmark is a direct warning about where those systems break down.

The gap between "impressive demo" and "reliable synthesis" is wide, and it's measured here in hard numbers. Learning to design agents that know when to ask for clarification, flag uncertainty, or escalate to a human is no longer optional — it's engineering hygiene. Our Hermes Agent Essentials course covers exactly how to build agents that handle uncertainty gracefully, and if you want to go deeper on constructing retrieval pipelines that don't hallucinate, Build Your First RAG Pipeline is the practical next step.

The broader lesson: benchmark the AI you deploy. Don't trust vibes — trust F1 scores under controlled conditions.

Sources