AutoBench Run 5 - December 2025

Latest AutoBench run with models Gpt 5.2, Claude Opus 4.5, DeepSeek 3.2 Speciale and more

Past

Date

December 16, 2025

Version

2025-12-16

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Comparison of AutoBench scores with other popular benchmarks. Models sorted by AutoBench score.

Model	AutoBench	LMArena	AAI Index	MMLU-Pro
Gpt 5.2 Pro	4.48 (#1)	-	72 (#2)	0.87 (#4)
Gpt 5.2	4.43 (#2)	-	-	-
Gemini 3 pro preview	4.41 (#3)	1492 (#1)	73 (#1)	0.9 (#1)
Claude opus 4.5	4.39 (#4)	1470 (#3)	70 (#3)	0.9 (#2)
Gpt 5.1	4.38 (#5)	1457 (#4)	70 (#4)	0.87 (#5)
Kimi k2 thinking	4.32 (#6)	1429 (#7)	67 (#5)	0.85 (#9)
Claude sonnet 4.5	4.3 (#7)	1450 (#6)	63 (#9)	0.88 (#3)
Gemini 2.5 pro	4.29 (#8)	1451 (#5)	60 (#12)	0.86 (#7)
Gpt 5 mini	4.29 (#9)	1392 (#18)	64 (#7)	0.84 (#12)
Grok 4.1 fast thinking	4.21 (#10)	-	64 (#8)	0.85 (#10)
Qwen3 235B A22B Thinking 2507	4.2 (#12)	1397 (#16)	57 (#14)	0.84 (#13)
Grok 4	4.2 (#11)	1478 (#2)	65 (#6)	0.87 (#6)
Gpt oss 120b	4.18 (#13)	1352 (#23)	61 (#10)	0.81 (#22)
Claude haiku 4.5	4.17 (#15)	1402 (#15)	55 (#16)	0.76 (#28)
Gemini 2.5 flash	4.17 (#14)	1408 (#14)	51 (#21)	0.84 (#14)
DeepSeek 3.2 Speciale	4.14 (#16)	1418 (#9)	59 (#13)	0.86 (#8)
GLM 4.6	4.13 (#17)	1425 (#8)	56 (#15)	0.83 (#16)
DeepSeek R1 0528	4.12 (#18)	1395 (#17)	52 (#18)	0.85 (#11)
Deepseek v3.2	4.11 (#20)	1414 (#12)	52 (#19)	0.84 (#15)
Kimi K2 0905	4.11 (#19)	1416 (#10)	50 (#23)	0.82 (#18)
Nova 2 lite v1	4.06 (#22)	1334 (#27)	47 (#25)	0.81 (#23)
Gpt 5 nano	4.06 (#21)	1339 (#26)	51 (#22)	0.77 (#27)
Qwen3 next 80b a3b thinking	4.03 (#23)	1367 (#22)	54 (#17)	0.82 (#19)
Minimax m2	3.99 (#24)	1345 (#24)	61 (#11)	0.82 (#20)
Qwen3 235b a22b 2507	3.98 (#25)	1374 (#20)	45 (#26)	0.83 (#17)
Gemini 2.5 flash lite	3.95 (#26)	1378 (#19)	40 (#28)	0.81 (#24)
Mistral large 2512	3.94 (#27)	1415 (#11)	38 (#29)	0.81 (#25)
Grok 4.1 fast	3.88 (#28)	-	38 (#30)	0.74 (#30)
GLM 4.5 Air	3.86 (#29)	1370 (#21)	49 (#24)	0.82 (#21)
Mistral medium 3.1	3.81 (#30)	1411 (#13)	35 (#32)	0.68 (#33)
Gpt oss 20b	3.78 (#32)	1318 (#28)	52 (#20)	0.75 (#29)
Llama 3.3 nemotron super 49b v1.5	3.78 (#31)	1340 (#25)	45 (#27)	0.81 (#26)
Ministral 8b 2512	3.57 (#33)	-	28 (#34)	0.64 (#34)
Nemotron nano 9b v2	3.5 (#34)	-	37 (#31)	0.74 (#31)
Nova Premier v1	3.47 (#35)	-	32 (#33)	0.73 (#32)