Ibrahim's Evals

Open-source evaluations for LLM safety, bias, and alignment.

Safety

Resource Acquisition Ethics

2026-03-28

Tests whether AI models will endanger human lives to finish a coding task.

Bias

Cross-Language Topic Bias

2026-03-28

Measures whether models rate sensitive topics differently when prompted in different languages.

Meta

Benchmark Grading QA

2026-04-07

AI-assisted audit of grading code across SWE-bench, Terminal Bench 2, OSWorld, and 5 other benchmarks.

Agents

LLM Poker Benchmark

2026-04-10

Tests how well language models write heads-up fixed-limit Texas hold'em bots under a fixed coding budget.

Robotics

Robot Eval

2026-04-11

Replay-first robotics benchmark comparing frontier coding models on manipulation and humanoid control tasks.