We audited the grading code of 8 major LLM benchmarks and found issues throughout all of them.
Quality audit of 8 LLM benchmarks across 1830 tasks
This report was generated using AI (Claude). There may be mistakes in our audit.
We audited the grading code of 8 major LLM benchmarks and found issues throughout all of them.