I used Claude to audit the grading code of 8 major LLM benchmarks: SWE-bench Verified, SWE-bench Pro, Terminal Bench 2, RE-Bench, CORE-Bench, OSWorld, Cybench, and MLE-bench. I found issues throughout all benchmarks.
The report is here.
April 9, 2026
I used Claude to audit the grading code of 8 major LLM benchmarks: SWE-bench Verified, SWE-bench Pro, Terminal Bench 2, RE-Bench, CORE-Bench, OSWorld, Cybench, and MLE-bench. I found issues throughout all benchmarks.
The report is here.