← All evals

LLM Poker Benchmark

Poker Eval

Models are asked to write a poker bot under a fixed time limit, then each bot is evaluated across thousands of duplicated Texas hold'em hands against the same frozen benchmark opponents. The goal is to measure coding ability, strategic quality, and consistency across repeated attempts.

Methodology

What This Benchmark Measures

Leaderboard

Model Scores

Consistency

Attempt Distribution by Model

Each row shows the five attempt scores for a model on the same horizontal scale. The darker marker is the median.

Attempts

Attempt Table

Build Transcript

How Each Bot Was Built

Opponent Breakdown

Score Against Each Benchmark Opponent

Why The Scores Differ

What The Better Bots Are Doing

Hand Replay

Sample Match Replay