Research LLM judges crack under post-decision challenge tests LLM judges are stable on reruns but reversible after challenge. A new ACL paper says evals must test interaction, not just scores. Lars Cornelissen · Jun 6, 2026