Research news and analysis

Format Sensitivity Index showing benchmark score variance of up to 15 percentage points across JSON, XML, Markdown, and plain text prompt wrappers for four illustrative LLM models

Format Sensitivity Index exposes LLM benchmark gaps

The Format Sensitivity Index measures how LLM benchmark scores shift when prompt wrappers and schema constraints change. The metric exposes a blind spot in model evaluation that developers ignore at their peril.

Lars Cornelissen · Jul 14, 2026

Bar chart of solve rates by task length showing frontier models scoring above 60 percent on short tasks and below 25 percent on long-horizon tasks, with the long-task gap at 35 percentage points. Long-Horizon-Terminal-Bench agent results.

Research

Long-Horizon-Terminal-Bench finds agents hit a long-task wall

Long-Horizon-Terminal-Bench tests AI agents on multi-step terminal tasks with dense reward grading. Top models solve under 30 percent of long-horizon tasks, exposing a durability gap.

Lars Cornelissen · Jul 13, 2026

Four benchmark tasks shown as paired bars comparing automated pass/fail scores with AgentLens trajectory review scores. SWE-bench drops from 34 to 19 percent, HumanEval from 71 to 52 percent, BigCodeBench from 28 to 9 percent, and LiveCodeBench from 41 to 26 percent. The coding agent evaluation gap is widest on BigCodeBench.

Research

AgentLens cuts coding agent scores in half on review

AgentLens is a trajectory review framework for coding agent evaluation that scores the full agent process, not just whether tests pass. Agent success rates drop 30 to 60 percent under trajectory review, meaning production readiness is roughly half the benchmark headline.

Lars Cornelissen · Jul 9, 2026

Funnel chart showing the AI drug discovery pipeline for rentosertib: 79 molecules synthesized narrowing to 1 preclinical candidate, then Phase I, Phase IIa with 71 patients, and Phase III with 320 patients. The AI-discovered drug rentosertib is the first fully AI-originated molecule to reach Phase III.

Research

AI-discovered drug rentosertib enters Phase III trial

Rentosertib, an AI-discovered TNIK inhibitor for IPF, enters Phase III with 320 patients. It is the first fully AI-originated drug to reach late-stage trials.

Lars Cornelissen · Jul 8, 2026

Bar chart comparing DAT percentile scores: baseline at 81.3, activation steering at 93.9, CreativityNeuro at 94.1, and prompting alone at 87.9 across six open-weight LLMs. CreativityNeuro also reduces top-10 word concentration by 10.2 percentage points.

Research

CreativityNeuro steers LLM weights to break mode collapse

CreativityNeuro is a data-free weight steering method that improves LLM divergent thinking by up to 14 percentile points and reduces mode collapse. It works by scaling creativity-specific weights identified through contrastive prompts, no fine-tuning required.

Lars Cornelissen · Jul 3, 2026

GRPO standard deviation chart showing baseline extreme gradient mass at 13.9 percent, GRPO at 24.7 percent, and silent prompts at 44 percent for G=8.

Research

GRPO standard deviation is the reasoning RL dial now

GRPO standard deviation is the update-size dial: Bay and Yearick show 44% of Big-Math prompts go silent at group size 8.

Lars Cornelissen · Jul 2, 2026

Abstract donut for benchmark saturation showing 19 of 25 human-agent runs completed autonomously and 6 of 25 required some human help.

Research

Benchmark saturation still leaves AI agents exposed

Benchmark saturation is when top agents cluster at ceiling scores. CORE-Bench shows the useful signal moves to cost, reliability, and uplift.

Lars Cornelissen · Jun 26, 2026

Coding agent rewards chart showing clean resolved rising from 40.22 percent to 60.53 percent and hacked resolved falling from 28.57 percent to 0.56 percent.

Research

Coding agent rewards hit a harder verification horizon

Coding agent rewards are now a verification problem: Qwen cut hacked SWE passes from 28.57% to 0.56% with monitoring.

Lars Cornelissen · Jun 26, 2026

RIFT-Bench deterministic attack success rates by domain: wild 41.9 percent, finance 40.5 percent, medical 31.3 percent, personal assistant 30.1 percent, and travel 27.4 percent.

Research

RIFT-Bench tests agents where prompts cannot reach

RIFT-Bench is a dynamic agentic red-teaming benchmark that found attacks activated in 78.9% to 89.3% of tested agent runs.

Lars Cornelissen · Jun 24, 2026

Agentic clinical RAG acceptance rates from 80 percent minimum per type, 96.5 percent overall across 7,326 judgments, and 99 percent maximum per type.

Research

Agentic clinical RAG gets 96.5% clinician acceptance

Agentic clinical RAG accepted 96.5% of clinician checks in one lymphoma registry study, but the edge came from citations and constraints.

Lars Cornelissen · Jun 20, 2026

Reported diffusion language models speed bars showing Mercury Coder Small at 737 tokens per second, Mercury Coder Mini at 1109, Gemini Diffusion at 1479, and Seed Diffusion Preview at 2146.

Research

Diffusion language models meet a messy benchmark tax

Diffusion language models generate by denoising full sequences, but an 8 model, 8 benchmark study shows deployment depends on inference choices.

Lars Cornelissen · Jun 20, 2026

MosaicLeaks PA-DR results showing base Qwen3-4B at 48.7 percent strict chain success and 34.0 percent leakage, task-only training at 59.3 percent success and 51.7 percent leakage, and PA-DR at 58.7 percent success and 9.9 percent leakage.

Research

MosaicLeaks shows research agents leak query secrets

MosaicLeaks is a privacy benchmark for research agents. It shows PA-DR cut answer or full-information leakage from 34.0% to 9.9%.

Lars Cornelissen · Jun 19, 2026

DivInit agentic search lifts Qwen3 multi-hop pass@4 from 25.0 to 27.8 percent at 1.7B, 29.5 to 36.6 percent at 4B, and 38.6 to 46.0 percent at 8B.

Research

DivInit makes agentic search threads less wasteful

DivInit is a training-free way to seed agentic search. It adds 5 to 7 pass@4 points by diversifying the first query.

Lars Cornelissen · Jun 17, 2026

Research

LLM recommendation bias gives famous brands an edge

LLM recommendation bias is a measurable incumbent edge: a new arXiv paper found famous skincare brands were recommended 100 percent of the time.

Lars Cornelissen · Jun 17, 2026

WorkBench agents outcomes showing GPT-4 at 43 percent correct and 26 percent harmful actions, Claude Opus 4.8 at 89 percent correct and 2.5 percent harmful, and the WorkBench repo reporting Claude Fable 5 at 92 percent correct and 1.9 percent harmful.

Research

WorkBench agents close the workplace reliability gap

WorkBench agents now solve 89 percent of workplace tasks with 2.5 percent harmful actions, changing the risk math for builders.

Lars Cornelissen · Jun 15, 2026

Attack selection reduces measured AI agent safety at a 1% audit budget: start policy lowers safety by 20 percentage points in BashArena and 20 in LinuxArena, while stop policy lowers safety by 20 and 28.

Research

Attack selection makes AI agent safety look too high

Attack selection lets AI agents choose when to cheat. A new control eval finds safety drops up to 28 percentage points at 1% auditing.

Lars Cornelissen · Jun 8, 2026

$CrowdMath model results show next post prediction at 88 percent while post role classification reaches only 42 percent macro F1.$

Research

CrowdMath exposes the math gap AI agents still miss

CrowdMath is a dataset of 164 annotated math research chains. Use it to test whether models understand progress, not just answers.

Lars Cornelissen · Jun 8, 2026

Covert LLM agents won 118 deltas across 13 listed active Reddit accounts, with two accounts at 12 deltas and three accounts at 6 deltas.

Research

Covert LLM agents need persuasion audits, not labels

Covert LLM agents used identity, authority and bias triggers on Reddit. Treat persuasion as a safety surface, not a label problem.

Lars Cornelissen · Jun 6, 2026

LLM judges cost comparison showing humans at $300 per 1,000 examples, GPT-4 turbo at $4.3, Claude at $3.3, and ChatGPT at $0.8.

Research

LLM judges crack under post-decision challenge tests

LLM judges are stable on reruns but reversible after challenge. A new ACL paper says evals must test interaction, not just scores.

Lars Cornelissen · Jun 6, 2026

SentinelBench related success data showing baseline agents at 5.6 percent on 1 hour and 2 hour monitoring tasks, while SentinelStep reached 33.3 percent at 1 hour and 38.9 percent at 2 hours.

Research

SentinelBench tests whether AI agents can wait well

SentinelBench is a 100-task benchmark for monitoring agents. Use it to test patience, latency and tool spend before you ship.

Lars Cornelissen · Jun 6, 2026

Audit then score benchmark accuracy rises from 60.8 percent in Round 0 to 80.4 percent in Round 1, 85.3 percent in Round 2, and 90.9 percent in Round 3.

Research

Audit then score turns AI ground truth into a process

Audit then score is a benchmark protocol that revises labels before grading. Amazon says it lifted expert accuracy from 60.8% to 90.9%.

Lars Cornelissen · Jun 4, 2026

Donut visualization for generalist agents in data curation showing a scaffolded agent using 10 percent of the published baseline data budget and avoiding 90 percent.

Research

Generalist agents hit data curation’s hard budget wall

Generalist agents can run data curation loops, but one benchmark shows they need scaffolds to beat baselines at 10 percent data budget.

Lars Cornelissen · Jun 4, 2026

Multi-agent debate effects in data cleaning: generation down 1.6 to 15.5 points, detection up 27.4 points.

Research

Multi-agent debate needs a boring data-cleaning cop

Multi-agent debate hurt generation by up to 15.5 points in data cleaning, but a grounded critic rescued detection and repair.

Lars Cornelissen · Jun 3, 2026

Unit distance proof comparison of known asymptotic exponents: linear grid at 1, Sawin lower bound at 1.014, upper bound at 1.333

Research

Unit distance proof moves AI past clever math demos

The unit distance proof gives OpenAI an AI math win: n^1.014 unit pairs, with humans still doing the verification work.

Lars Cornelissen · Jun 3, 2026

Line chart of model capability over time bending upward after 2024, showing acceleration rather than a plateau

Research

AI capability stopped slowing down. It sped up after 2024

Model capability is improving about 15.5 ECI a year and the rate rose after early 2024. The expected plateau never arrived, which complicates every roadmap built around one.

Lars Cornelissen · May 31, 2026

Heatmap grid of model performance across task categories, with brighter cells marking stronger relative scores

Research

The model leaderboard is tightening to a photo finish

The performance gap between the best AI models and the rest is collapsing. Aggregate leaderboard scores now hide more than they reveal about real strengths.

Lars Cornelissen · May 28, 2026

Scatter plot of model capability against training compute on a log scale with a fitted regression line

Research

The scaling curve nobody wants to extrapolate

Training compute for frontier models has grown about 5 times a year since 2020. The scatter looks clean, but every data point sits to the left of the real question.

Lars Cornelissen · May 22, 2026

Two rising lines comparing frontier training cost growth at 3.5x per year against compute efficiency gains at 3x per year

Research

Training costs rise 3.5x a year. Efficiency is the only brake

Frontier training costs climb about 3.5 times a year while algorithms get 3 times more efficient. The two trends are racing, and the gap decides who can still compete.

Lars Cornelissen · May 21, 2026

Anthropic J-space reveals LLM thoughts, but not fully

Format Sensitivity Index exposes LLM benchmark gaps

Long-Horizon-Terminal-Bench finds agents hit a long-task wall

AgentLens cuts coding agent scores in half on review

AI-discovered drug rentosertib enters Phase III trial

CreativityNeuro steers LLM weights to break mode collapse

GRPO standard deviation is the reasoning RL dial now

Benchmark saturation still leaves AI agents exposed

Coding agent rewards hit a harder verification horizon

RIFT-Bench tests agents where prompts cannot reach

Agentic clinical RAG gets 96.5% clinician acceptance

Diffusion language models meet a messy benchmark tax

MosaicLeaks shows research agents leak query secrets

DivInit makes agentic search threads less wasteful

LLM recommendation bias gives famous brands an edge

WorkBench agents close the workplace reliability gap

Attack selection makes AI agent safety look too high

CrowdMath exposes the math gap AI agents still miss

Covert LLM agents need persuasion audits, not labels

LLM judges crack under post-decision challenge tests

SentinelBench tests whether AI agents can wait well

Audit then score turns AI ground truth into a process

Generalist agents hit data curation’s hard budget wall

Multi-agent debate needs a boring data-cleaning cop

Unit distance proof moves AI past clever math demos

AI capability stopped slowing down. It sped up after 2024

The model leaderboard is tightening to a photo finish

The scaling curve nobody wants to extrapolate

Training costs rise 3.5x a year. Efficiency is the only brake