Research
Multi-agent debate needs a boring data-cleaning cop
Multi-agent debate hurt generation by up to 15.5 points in data cleaning, but a grounded critic rescued detection and repair.
2 stories tagged evaluation.
Multi-agent debate hurt generation by up to 15.5 points in data cleaning, but a grounded critic rescued detection and repair.
The performance gap between the best AI models and the rest is collapsing. Aggregate leaderboard scores now hide more than they reveal about real strengths.