Back to Archive

AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Latest
Date
April 19, 2026
Version
2026-04-19
Models
32
New Models
1

Run data

Model
Average (All Topics)Adaptive ReplanningApi WorkflowDomain WorkflowError HandlingFailure RecoveryMulti Step OrchestrationParallel ExecutionParameter ComplexitySingle Tool CallTool Selection
0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.02 (#3)0.01 (#1)0.02 (#3)
0.02 (#2)0.02 (#2)0.02 (#3)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.01 (#1)0.01 (#4)0.01 (#2)
0.02 (#3)0.02 (#3)0.02 (#2)0.02 (#3)0.02 (#3)0.02 (#3)0.02 (#3)0.03 (#5)0.01 (#2)0.01 (#2)0.01 (#1)
0.02 (#4)0.03 (#5)0.03 (#5)0.03 (#4)0.03 (#5)0.02 (#4)0.03 (#4)0.03 (#4)0.02 (#4)0.01 (#3)0.02 (#4)
0.03 (#5)0.02 (#4)0.03 (#4)0.03 (#5)0.02 (#4)0.03 (#5)0.03 (#5)0.03 (#3)0.04 (#5)0.03 (#5)0.03 (#5)
0.05 (#6)0.05 (#6)0.05 (#7)0.06 (#7)0.04 (#6)0.08 (#10)0.04 (#6)0.06 (#7)0.04 (#6)0.03 (#6)0.04 (#7)
0.05 (#7)0.05 (#7)0.06 (#8)0.06 (#8)0.06 (#8)0.06 (#6)0.05 (#7)0.07 (#8)0.05 (#8)0.03 (#7)0.03 (#6)
0.06 (#8)0.07 (#9)0.07 (#9)0.05 (#6)0.06 (#7)0.07 (#8)0.07 (#9)0.06 (#6)0.05 (#9)0.04 (#9)0.05 (#9)
0.06 (#9)0.06 (#8)0.05 (#6)0.07 (#9)0.09 (#10)0.06 (#7)0.07 (#8)0.09 (#10)0.05 (#7)0.03 (#8)0.05 (#8)
0.08 (#10)0.08 (#10)0.08 (#10)0.08 (#10)0.07 (#9)0.08 (#9)0.09 (#10)0.07 (#9)0.13 (#15)0.07 (#12)0.07 (#11)
0.10 (#11)0.10 (#11)0.10 (#12)0.12 (#12)0.10 (#11)0.10 (#12)0.10 (#12)0.12 (#12)0.09 (#11)0.06 (#11)0.08 (#12)
0.10 (#12)0.10 (#12)0.10 (#11)0.12 (#13)0.11 (#12)0.10 (#11)0.10 (#13)0.12 (#15)0.08 (#10)0.05 (#10)0.07 (#10)
0.11 (#13)0.12 (#13)0.14 (#18)0.13 (#16)0.11 (#13)0.10 (#13)0.11 (#16)0.11 (#11)0.10 (#12)0.08 (#13)0.12 (#16)
0.12 (#14)0.12 (#15)0.12 (#14)0.11 (#11)0.11 (#14)0.11 (#14)0.09 (#11)0.12 (#14)0.14 (#17)0.11 (#17)0.11 (#15)
0.12 (#15)0.12 (#14)0.15 (#19)0.13 (#15)0.11 (#15)0.12 (#16)0.12 (#17)0.12 (#13)0.12 (#14)0.10 (#16)0.12 (#17)
0.13 (#16)0.13 (#16)0.13 (#17)0.14 (#17)0.14 (#17)0.11 (#15)0.11 (#15)0.15 (#16)0.13 (#16)0.10 (#15)0.10 (#14)
0.14 (#17)0.15 (#17)0.12 (#16)0.16 (#18)0.19 (#18)0.13 (#17)0.14 (#19)0.19 (#18)0.10 (#13)0.09 (#14)0.08 (#13)
0.14 (#18)0.16 (#19)0.12 (#13)0.12 (#14)0.14 (#16)0.14 (#19)0.11 (#14)0.18 (#17)0.15 (#18)0.14 (#18)0.16 (#18)
0.19 (#19)0.16 (#18)0.12 (#15)0.22 (#19)0.29 (#20)0.14 (#18)0.13 (#18)0.19 (#19)0.18 (#19)0.23 (#20)0.25 (#21)
0.28 (#20)0.31 (#20)0.26 (#21)0.29 (#20)0.26 (#19)0.30 (#20)0.27 (#20)0.31 (#20)0.38 (#21)0.20 (#19)0.24 (#20)
0.33 (#21)0.36 (#21)0.34 (#22)0.35 (#21)0.39 (#21)0.34 (#21)0.34 (#21)0.38 (#21)0.30 (#20)0.26 (#21)0.21 (#19)
0.43 (#22)0.46 (#22)0.19 (#20)0.45 (#22)0.49 (#22)0.34 (#22)0.61 (#24)0.49 (#22)0.55 (#23)0.43 (#23)0.39 (#23)
0.51 (#23)0.54 (#23)0.53 (#23)0.62 (#23)0.57 (#23)0.57 (#23)0.46 (#22)0.65 (#24)0.44 (#22)0.31 (#22)0.31 (#22)
0.84 (#24)0.87 (#24)0.83 (#24)1.25 (#24)1.37 (#25)0.61 (#24)0.61 (#23)0.64 (#23)0.82 (#24)0.67 (#25)0.59 (#24)
1.33 (#25)1.28 (#25)1.25 (#25)1.30 (#25)1.40 (#26)1.31 (#26)1.39 (#25)1.42 (#27)1.59 (#27)0.98 (#26)1.51 (#28)
1.52 (#26)1.32 (#26)1.82 (#28)1.39 (#26)1.64 (#27)1.26 (#25)1.77 (#28)1.27 (#26)2.40 (#30)1.16 (#28)1.46 (#27)
1.56 (#27)1.93 (#28)1.81 (#27)1.78 (#28)2.03 (#28)1.86 (#27)1.73 (#27)2.24 (#28)0.86 (#25)0.50 (#24)0.61 (#25)
1.98 (#28)1.85 (#27)1.98 (#29)3.10 (#30)2.05 (#29)2.18 (#28)1.70 (#26)2.73 (#29)1.50 (#26)1.11 (#27)1.54 (#29)
2.03 (#29)3.09 (#31)1.76 (#26)1.51 (#27)1.28 (#24)2.60 (#29)1.93 (#29)1.18 (#25)3.53 (#31)2.45 (#31)0.62 (#26)
2.58 (#30)2.77 (#29)3.06 (#30)3.05 (#29)2.74 (#30)2.74 (#30)2.57 (#30)3.52 (#31)2.12 (#29)1.41 (#30)1.57 (#30)
2.72 (#31)2.97 (#30)3.20 (#31)3.96 (#31)3.09 (#31)3.11 (#31)3.06 (#31)3.34 (#30)1.60 (#28)1.21 (#29)1.85 (#31)
6.33 (#32)6.74 (#32)6.30 (#32)6.59 (#32)6.41 (#32)4.38 (#32)4.93 (#32)3.93 (#32)9.76 (#32)6.26 (#32)9.97 (#32)
AutoBench Agentic Run 1 - April 2026 - AutoBench