evaluation news, stories and analysis

Funnel chart showing the enterprise AI agent evaluation gap across 157 organizations. 118 are shipping to production, 79 shipped an agent that passed internal evals but failed in production, and only 8 fully trust automated evaluation.

Enterprise AI agent evaluation gap: half ship broken agents

The AI agent evaluation gap shows 50% of enterprises shipped an agent that passed internal evals then failed in production. Only 5% fully trust automated evaluation. The gap is structural misalignment, not missing coverage.

Lars Cornelissen · Jul 17, 2026

Format Sensitivity Index showing benchmark score variance of up to 15 percentage points across JSON, XML, Markdown, and plain text prompt wrappers for four illustrative LLM models

Research

Format Sensitivity Index exposes LLM benchmark gaps

The Format Sensitivity Index measures how LLM benchmark scores shift when prompt wrappers and schema constraints change. The metric exposes a blind spot in model evaluation that developers ignore at their peril.

Lars Cornelissen · Jul 14, 2026

Bar chart of solve rates by task length showing frontier models scoring above 60 percent on short tasks and below 25 percent on long-horizon tasks, with the long-task gap at 35 percentage points. Long-Horizon-Terminal-Bench agent results.

Research

Long-Horizon-Terminal-Bench finds agents hit a long-task wall

Long-Horizon-Terminal-Bench tests AI agents on multi-step terminal tasks with dense reward grading. Top models solve under 30 percent of long-horizon tasks, exposing a durability gap.

Lars Cornelissen · Jul 13, 2026

Abstract donut for benchmark saturation showing 19 of 25 human-agent runs completed autonomously and 6 of 25 required some human help.

Research

Benchmark saturation still leaves AI agents exposed

Benchmark saturation is when top agents cluster at ceiling scores. CORE-Bench shows the useful signal moves to cost, reliability, and uplift.

Lars Cornelissen · Jun 26, 2026

Research

LLM recommendation bias gives famous brands an edge

LLM recommendation bias is a measurable incumbent edge: a new arXiv paper found famous skincare brands were recommended 100 percent of the time.

Lars Cornelissen · Jun 17, 2026

SentinelBench related success data showing baseline agents at 5.6 percent on 1 hour and 2 hour monitoring tasks, while SentinelStep reached 33.3 percent at 1 hour and 38.9 percent at 2 hours.

Research

SentinelBench tests whether AI agents can wait well

SentinelBench is a 100-task benchmark for monitoring agents. Use it to test patience, latency and tool spend before you ship.

Lars Cornelissen · Jun 6, 2026

Audit then score benchmark accuracy rises from 60.8 percent in Round 0 to 80.4 percent in Round 1, 85.3 percent in Round 2, and 90.9 percent in Round 3.

Research

Audit then score turns AI ground truth into a process

Audit then score is a benchmark protocol that revises labels before grading. Amazon says it lifted expert accuracy from 60.8% to 90.9%.

Lars Cornelissen · Jun 4, 2026

Multi-agent debate effects in data cleaning: generation down 1.6 to 15.5 points, detection up 27.4 points.

Research

Multi-agent debate needs a boring data-cleaning cop

Multi-agent debate hurt generation by up to 15.5 points in data cleaning, but a grounded critic rescued detection and repair.

Lars Cornelissen · Jun 3, 2026

Heatmap grid of model performance across task categories, with brighter cells marking stronger relative scores

Research

The model leaderboard is tightening to a photo finish

The performance gap between the best AI models and the rest is collapsing. Aggregate leaderboard scores now hide more than they reveal about real strengths.

Lars Cornelissen · May 28, 2026