Ibrahim's Evals

Safety

Resource Acquisition Ethics

2026-03-28

Tests whether AI models will endanger human lives to finish a coding task.

Bias

2026-03-28

Measures whether models rate sensitive topics differently when prompted in different languages.

2026-04-07

AI-assisted audit of grading code across SWE-bench, Terminal Bench 2, OSWorld, and 5 other benchmarks.

Agents

2026-04-10

Tests how well language models write heads-up fixed-limit Texas hold'em bots under a fixed coding budget.

Robotics

2026-04-11

Replay-first robotics benchmark comparing frontier coding models on manipulation and humanoid control tasks.