Back to Archive

AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Past
Date
April 16, 2026
Version
2026-04-16
Models
31
New Models
22

Run data

Model
Average (All Topics)Adaptive ReplanningApi WorkflowDomain WorkflowError HandlingFailure RecoveryMulti Step OrchestrationParallel ExecutionParameter ComplexitySingle Tool CallTool Selection
5.82 (#31)6.67 (#31)6.38 (#31)5.84 (#31)6.13 (#31)4.07 (#31)5.45 (#31)2.72 (#30)7.23 (#31)5.63 (#31)9.74 (#31)
2.56 (#30)2.71 (#29)3.04 (#30)2.97 (#29)2.71 (#30)2.68 (#30)2.58 (#30)3.43 (#31)2.21 (#28)1.38 (#29)1.71 (#30)
2.03 (#29)3.09 (#30)1.76 (#26)1.51 (#27)1.28 (#24)2.60 (#29)1.93 (#29)1.18 (#26)3.53 (#30)2.45 (#30)0.62 (#25)
1.95 (#28)1.85 (#27)1.96 (#29)3.00 (#30)2.05 (#29)2.14 (#28)1.71 (#27)2.70 (#29)1.40 (#26)1.08 (#27)1.46 (#27)
1.56 (#27)1.93 (#28)1.81 (#27)1.78 (#28)2.03 (#28)1.86 (#27)1.73 (#28)2.24 (#28)0.86 (#25)0.50 (#24)0.61 (#24)
1.54 (#26)1.34 (#26)1.82 (#28)1.41 (#26)1.81 (#27)1.29 (#26)1.61 (#26)1.16 (#25)2.52 (#29)1.14 (#28)1.55 (#28)
1.33 (#25)1.22 (#25)1.24 (#25)1.31 (#25)1.41 (#26)1.19 (#25)1.34 (#25)1.42 (#27)1.65 (#27)1.05 (#26)1.58 (#29)
0.86 (#24)0.81 (#24)0.88 (#24)1.24 (#24)1.35 (#25)0.61 (#24)0.78 (#24)0.62 (#23)0.78 (#24)0.67 (#25)0.82 (#26)
0.50 (#23)0.51 (#23)0.53 (#23)0.61 (#23)0.57 (#23)0.55 (#23)0.47 (#22)0.62 (#24)0.45 (#22)0.29 (#22)0.31 (#22)
0.43 (#22)0.46 (#22)0.19 (#20)0.45 (#22)0.49 (#22)0.34 (#22)0.61 (#23)0.49 (#22)0.55 (#23)0.43 (#23)0.39 (#23)
0.32 (#21)0.35 (#21)0.35 (#22)0.35 (#21)0.37 (#21)0.33 (#21)0.32 (#21)0.37 (#21)0.31 (#20)0.25 (#21)0.21 (#19)
0.28 (#20)0.30 (#20)0.27 (#21)0.30 (#20)0.27 (#19)0.28 (#20)0.25 (#20)0.31 (#20)0.37 (#21)0.20 (#19)0.25 (#20)
0.20 (#19)0.16 (#18)0.11 (#13)0.22 (#19)0.29 (#20)0.14 (#18)0.15 (#19)0.19 (#19)0.20 (#19)0.23 (#20)0.25 (#21)
0.14 (#18)0.16 (#19)0.12 (#14)0.12 (#14)0.14 (#16)0.14 (#19)0.11 (#13)0.18 (#17)0.15 (#18)0.14 (#18)0.16 (#18)
0.14 (#17)0.15 (#17)0.12 (#16)0.16 (#18)0.19 (#18)0.13 (#17)0.14 (#18)0.19 (#18)0.10 (#13)0.09 (#14)0.08 (#13)
0.13 (#16)0.13 (#16)0.13 (#17)0.14 (#17)0.14 (#17)0.12 (#15)0.11 (#16)0.15 (#16)0.13 (#16)0.09 (#15)0.10 (#14)
0.12 (#15)0.12 (#14)0.15 (#19)0.13 (#15)0.11 (#15)0.12 (#16)0.12 (#17)0.12 (#13)0.12 (#14)0.10 (#16)0.12 (#17)
0.12 (#14)0.12 (#15)0.12 (#15)0.11 (#11)0.11 (#14)0.11 (#14)0.09 (#11)0.12 (#14)0.14 (#17)0.11 (#17)0.11 (#15)
0.11 (#13)0.12 (#13)0.14 (#18)0.13 (#16)0.11 (#13)0.10 (#13)0.11 (#15)0.11 (#11)0.10 (#12)0.08 (#13)0.12 (#16)
0.10 (#12)0.10 (#12)0.10 (#11)0.12 (#13)0.11 (#11)0.10 (#12)0.10 (#12)0.12 (#15)0.08 (#10)0.05 (#10)0.07 (#10)
0.09 (#11)0.10 (#11)0.10 (#12)0.11 (#12)0.10 (#10)0.10 (#11)0.09 (#9)0.11 (#12)0.08 (#11)0.06 (#11)0.08 (#12)
0.08 (#10)0.09 (#10)0.08 (#10)0.08 (#10)0.07 (#9)0.08 (#10)0.09 (#10)0.07 (#9)0.12 (#15)0.08 (#12)0.07 (#11)
0.07 (#9)0.07 (#8)0.05 (#6)0.07 (#9)0.11 (#12)0.06 (#7)0.11 (#14)0.09 (#10)0.04 (#7)0.04 (#8)0.05 (#9)
0.06 (#8)0.07 (#9)0.06 (#9)0.05 (#6)0.06 (#8)0.07 (#8)0.07 (#8)0.06 (#6)0.05 (#9)0.05 (#9)0.05 (#8)
0.05 (#7)0.05 (#7)0.06 (#8)0.07 (#8)0.06 (#7)0.05 (#6)0.05 (#7)0.06 (#8)0.04 (#8)0.03 (#7)0.03 (#6)
0.05 (#6)0.05 (#6)0.05 (#7)0.06 (#7)0.04 (#6)0.08 (#9)0.04 (#6)0.06 (#7)0.04 (#6)0.03 (#6)0.04 (#7)
0.03 (#5)0.02 (#4)0.03 (#4)0.03 (#5)0.02 (#4)0.03 (#5)0.03 (#5)0.03 (#3)0.04 (#5)0.03 (#5)0.03 (#5)
0.02 (#4)0.03 (#5)0.03 (#5)0.03 (#4)0.03 (#5)0.02 (#4)0.03 (#4)0.03 (#4)0.02 (#4)0.01 (#3)0.02 (#4)
0.02 (#3)0.02 (#3)0.02 (#2)0.02 (#3)0.02 (#3)0.02 (#3)0.02 (#3)0.03 (#5)0.01 (#2)0.01 (#1)0.01 (#1)
0.02 (#2)0.02 (#2)0.02 (#3)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.02 (#2)0.01 (#1)0.01 (#4)0.01 (#2)
0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.01 (#1)0.02 (#3)0.01 (#2)0.02 (#3)
AutoBench Agentic Run 1 - April 2026 - AutoBench