AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Past

Date

April 16, 2026

Version

2026-04-16

Models

New Models

Run data

Overall Comparison Costs Latency P99 Latency Domain

Track the cost efficiency of LLM models across 10 specialized domains. Costs are measured in cents per response, helping you identify the most economical models for your specific use case.

Model	Average (All Topics)	Adaptive Replanning	Api Workflow	Domain Workflow	Error Handling	Failure Recovery	Multi Step Orchestration	Parallel Execution	Parameter Complexity	Single Tool Call	Tool Selection
GPT-5.4 (xhigh)	5.82 (#31)	6.67 (#31)	6.38 (#31)	5.84 (#31)	6.13 (#31)	4.07 (#31)	5.45 (#31)	2.72 (#30)	7.23 (#31)	5.63 (#31)	9.74 (#31)
Claude Opus 4.6	2.56 (#30)	2.71 (#29)	3.04 (#30)	2.97 (#29)	2.71 (#30)	2.68 (#30)	2.58 (#30)	3.43 (#31)	2.21 (#28)	1.38 (#29)	1.71 (#30)
GPT-5.4 Mini (xhigh)	2.03 (#29)	3.09 (#30)	1.76 (#26)	1.51 (#27)	1.28 (#24)	2.60 (#29)	1.93 (#29)	1.18 (#26)	3.53 (#30)	2.45 (#30)	0.62 (#25)
Claude Sonnet 4.6	1.95 (#28)	1.85 (#27)	1.96 (#29)	3.00 (#30)	2.05 (#29)	2.14 (#28)	1.71 (#27)	2.70 (#29)	1.40 (#26)	1.08 (#27)	1.46 (#27)
Nova 2 lite v1	1.56 (#27)	1.93 (#28)	1.81 (#27)	1.78 (#28)	2.03 (#28)	1.86 (#27)	1.73 (#28)	2.24 (#28)	0.86 (#25)	0.50 (#24)	0.61 (#24)
Grok 4.20	1.54 (#26)	1.34 (#26)	1.82 (#28)	1.41 (#26)	1.81 (#27)	1.29 (#26)	1.61 (#26)	1.16 (#25)	2.52 (#29)	1.14 (#28)	1.55 (#28)
Gemini 3.1 Pro Preview	1.33 (#25)	1.22 (#25)	1.24 (#25)	1.31 (#25)	1.41 (#26)	1.19 (#25)	1.34 (#25)	1.42 (#27)	1.65 (#27)	1.05 (#26)	1.58 (#29)
Claude haiku 4.5	0.86 (#24)	0.81 (#24)	0.88 (#24)	1.24 (#24)	1.35 (#25)	0.61 (#24)	0.78 (#24)	0.62 (#23)	0.78 (#24)	0.67 (#25)	0.82 (#26)
GLM 5.1	0.50 (#23)	0.51 (#23)	0.53 (#23)	0.61 (#23)	0.57 (#23)	0.55 (#23)	0.47 (#22)	0.62 (#24)	0.45 (#22)	0.29 (#22)	0.31 (#22)
GPT-5.4 Nano (xhigh)	0.43 (#22)	0.46 (#22)	0.19 (#20)	0.45 (#22)	0.49 (#22)	0.34 (#22)	0.61 (#23)	0.49 (#22)	0.55 (#23)	0.43 (#23)	0.39 (#23)
Mimo V2 Pro	0.32 (#21)	0.35 (#21)	0.35 (#22)	0.35 (#21)	0.37 (#21)	0.33 (#21)	0.32 (#21)	0.37 (#21)	0.31 (#20)	0.25 (#21)	0.21 (#19)
Gemini 3 Flash Preview	0.28 (#20)	0.30 (#20)	0.27 (#21)	0.30 (#20)	0.27 (#19)	0.28 (#20)	0.25 (#20)	0.31 (#20)	0.37 (#21)	0.20 (#19)	0.25 (#20)
Qwen3.6 Plus	0.20 (#19)	0.16 (#18)	0.11 (#13)	0.22 (#19)	0.29 (#20)	0.14 (#18)	0.15 (#19)	0.19 (#19)	0.20 (#19)	0.23 (#20)	0.25 (#21)
Qwen3.5 122B A10B	0.14 (#18)	0.16 (#19)	0.12 (#14)	0.12 (#14)	0.14 (#16)	0.14 (#19)	0.11 (#13)	0.18 (#17)	0.15 (#18)	0.14 (#18)	0.16 (#18)
GLM 4.7	0.14 (#17)	0.15 (#17)	0.12 (#16)	0.16 (#18)	0.19 (#18)	0.13 (#17)	0.14 (#18)	0.19 (#18)	0.10 (#13)	0.09 (#14)	0.08 (#13)
Kimi K2.5	0.13 (#16)	0.13 (#16)	0.13 (#17)	0.14 (#17)	0.14 (#17)	0.12 (#15)	0.11 (#16)	0.15 (#16)	0.13 (#16)	0.09 (#15)	0.10 (#14)
Grok 4.1 fast	0.12 (#15)	0.12 (#14)	0.15 (#19)	0.13 (#15)	0.11 (#15)	0.12 (#16)	0.12 (#17)	0.12 (#13)	0.12 (#14)	0.10 (#16)	0.12 (#17)
Gemini 3.1 Flash Lite Preview	0.12 (#14)	0.12 (#15)	0.12 (#15)	0.11 (#11)	0.11 (#14)	0.11 (#14)	0.09 (#11)	0.12 (#14)	0.14 (#17)	0.11 (#17)	0.11 (#15)
Qwen3.5 35B A3B	0.11 (#13)	0.12 (#13)	0.14 (#18)	0.13 (#16)	0.11 (#13)	0.10 (#13)	0.11 (#15)	0.11 (#11)	0.10 (#12)	0.08 (#13)	0.12 (#16)
Mistral large 2512	0.10 (#12)	0.10 (#12)	0.10 (#11)	0.12 (#13)	0.11 (#11)	0.10 (#12)	0.10 (#12)	0.12 (#15)	0.08 (#10)	0.05 (#10)	0.07 (#10)
MiniMax M2.7	0.09 (#11)	0.10 (#11)	0.10 (#12)	0.11 (#12)	0.10 (#10)	0.10 (#11)	0.09 (#9)	0.11 (#12)	0.08 (#11)	0.06 (#11)	0.08 (#12)
Nemotron 3 Nano 30B A3B	0.08 (#10)	0.09 (#10)	0.08 (#10)	0.08 (#10)	0.07 (#9)	0.08 (#10)	0.09 (#10)	0.07 (#9)	0.12 (#15)	0.08 (#12)	0.07 (#11)
Nemotron 3 Super 120B A12B	0.07 (#9)	0.07 (#8)	0.05 (#6)	0.07 (#9)	0.11 (#12)	0.06 (#7)	0.11 (#14)	0.09 (#10)	0.04 (#7)	0.04 (#8)	0.05 (#9)
Deepseek v3.2	0.06 (#8)	0.07 (#9)	0.06 (#9)	0.05 (#6)	0.06 (#8)	0.07 (#8)	0.07 (#8)	0.06 (#6)	0.05 (#9)	0.05 (#9)	0.05 (#8)
MiniMax M2.5	0.05 (#7)	0.05 (#7)	0.06 (#8)	0.07 (#8)	0.06 (#7)	0.05 (#6)	0.05 (#7)	0.06 (#8)	0.04 (#8)	0.03 (#7)	0.03 (#6)
Mistral Small 4	0.05 (#6)	0.05 (#6)	0.05 (#7)	0.06 (#7)	0.04 (#6)	0.08 (#9)	0.04 (#6)	0.06 (#7)	0.04 (#6)	0.03 (#6)	0.04 (#7)
Llama 4 Maverick	0.03 (#5)	0.02 (#4)	0.03 (#4)	0.03 (#5)	0.02 (#4)	0.03 (#5)	0.03 (#5)	0.03 (#3)	0.04 (#5)	0.03 (#5)	0.03 (#5)
Gemma 4 31B IT	0.02 (#4)	0.03 (#5)	0.03 (#5)	0.03 (#4)	0.03 (#5)	0.02 (#4)	0.03 (#4)	0.03 (#4)	0.02 (#4)	0.01 (#3)	0.02 (#4)
Gemma 4 26B A4B IT	0.02 (#3)	0.02 (#3)	0.02 (#2)	0.02 (#3)	0.02 (#3)	0.02 (#3)	0.02 (#3)	0.03 (#5)	0.01 (#2)	0.01 (#1)	0.01 (#1)
Gpt oss 120b	0.02 (#2)	0.02 (#2)	0.02 (#3)	0.02 (#2)	0.02 (#2)	0.02 (#2)	0.02 (#2)	0.02 (#2)	0.01 (#1)	0.01 (#4)	0.01 (#2)
Gpt oss 20b	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.01 (#1)	0.02 (#3)	0.01 (#2)	0.02 (#3)