Bugs in LLM Benchmark Grading

I used Claude to audit the grading code of 8 major LLM benchmarks: SWE-bench Verified, SWE-bench Pro, Terminal Bench 2, RE-Bench, CORE-Bench, OSWorld, Cybench, and MLE-bench. I found issues throughout all benchmarks.

The report is here.