Eval QA

We audited the grading code of 8 major LLM benchmarks and found issues throughout all of them.

All Benchmarks

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances

Notable Bugs

All Instances