Open-source evaluations for LLM safety, bias, and alignment.
Tests whether AI models will endanger human lives to finish a coding task.
Measures whether models rate sensitive topics differently when prompted in different languages.
AI-assisted audit of grading code across SWE-bench, Terminal Bench 2, OSWorld, and 5 other benchmarks.
Tests how well language models write heads-up fixed-limit Texas hold'em bots under a fixed coding budget.
Replay-first robotics benchmark comparing frontier coding models on manipulation and humanoid control tasks.