Back to Archive

AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Past
Date
April 16, 2026
Version
2026-04-16
Models
31
New Models
22

Run data

Model
Average (All Topics)Adaptive ReplanningApi WorkflowDomain WorkflowError HandlingFailure RecoveryMulti Step OrchestrationParallel ExecutionParameter ComplexitySingle Tool CallTool Selection
2.27 (#31)2.22 (#31)2.16 (#31)2.24 (#31)2.28 (#30)2.15 (#31)2.41 (#31)2.2 (#31)2.29 (#29)2.39 (#31)2.44 (#31)
2.53 (#30)2.27 (#30)2.59 (#29)2.62 (#27)2.5 (#25)2.58 (#29)2.75 (#23)2.9 (#19)2.11 (#31)2.43 (#30)2.51 (#30)
2.55 (#29)2.65 (#23)2.59 (#28)2.56 (#29)2.34 (#29)2.64 (#26)2.43 (#30)2.56 (#29)2.53 (#11)2.65 (#26)2.59 (#26)
2.62 (#28)2.49 (#27)2.92 (#12)2.61 (#28)2.43 (#28)2.56 (#30)2.59 (#29)2.91 (#18)2.44 (#22)2.74 (#23)2.55 (#29)
2.63 (#27)2.68 (#22)2.66 (#26)2.39 (#30)2.51 (#24)2.87 (#22)2.67 (#26)2.64 (#27)2.51 (#15)2.75 (#21)2.57 (#28)
2.65 (#26)2.48 (#28)2.91 (#13)2.65 (#26)2.47 (#27)2.76 (#24)2.67 (#27)2.53 (#30)2.46 (#21)2.82 (#19)2.81 (#13)
2.66 (#25)2.45 (#29)2.66 (#27)2.8 (#21)2.53 (#22)2.63 (#27)2.88 (#16)2.65 (#25)2.5 (#18)2.87 (#14)2.66 (#20)
2.69 (#24)2.64 (#24)2.74 (#23)2.66 (#25)2.53 (#23)2.72 (#25)2.67 (#28)2.79 (#24)2.6 (#8)2.86 (#16)2.64 (#25)
2.7 (#23)2.6 (#26)2.69 (#25)2.78 (#22)2.64 (#17)2.6 (#28)3 (#10)2.6 (#28)2.72 (#4)2.74 (#22)2.78 (#14)
2.7 (#22)2.86 (#14)2.77 (#19)2.93 (#16)2.59 (#19)2.8 (#23)2.8 (#21)2.79 (#22)2.21 (#30)2.64 (#28)2.64 (#23)
2.72 (#21)2.68 (#21)2.57 (#30)2.71 (#24)2.49 (#26)3.17 (#9)2.88 (#17)2.92 (#17)2.41 (#25)2.64 (#27)2.59 (#27)
2.76 (#20)2.83 (#15)2.98 (#11)2.82 (#19)2.68 (#16)2.97 (#19)2.68 (#25)2.64 (#26)2.39 (#27)2.9 (#13)2.65 (#22)
2.78 (#19)2.9 (#12)2.76 (#20)2.94 (#14)2.27 (#31)2.91 (#21)2.74 (#24)2.79 (#23)2.56 (#9)2.96 (#6)2.96 (#8)
2.82 (#18)2.71 (#20)2.9 (#15)2.76 (#23)2.69 (#15)3.06 (#15)2.91 (#15)3.14 (#10)2.53 (#10)2.68 (#25)2.72 (#17)
2.83 (#17)2.9 (#13)2.71 (#24)3 (#12)2.58 (#20)2.97 (#18)2.81 (#20)3 (#14)2.35 (#28)2.9 (#12)3.08 (#2)
2.84 (#16)2.63 (#25)2.75 (#21)2.99 (#13)2.61 (#18)3.13 (#10)2.82 (#19)3.09 (#11)2.51 (#14)2.81 (#20)2.99 (#6)
2.84 (#15)2.73 (#19)2.78 (#18)2.9 (#17)2.71 (#13)2.93 (#20)2.98 (#13)3.07 (#12)2.41 (#24)2.87 (#15)3.06 (#4)
2.89 (#14)3.07 (#7)3.06 (#6)3.11 (#8)2.7 (#14)3.02 (#17)3.05 (#8)2.95 (#15)2.72 (#3)2.64 (#29)2.68 (#19)
2.9 (#13)2.91 (#11)3.18 (#3)3.11 (#7)2.73 (#12)3.13 (#11)2.8 (#22)3.15 (#9)2.5 (#17)2.7 (#24)2.73 (#16)
2.91 (#12)3.34 (#1)3.04 (#7)2.89 (#18)2.55 (#21)3.23 (#6)3.12 (#5)2.79 (#21)2.39 (#26)2.94 (#8)2.87 (#11)
2.92 (#11)3.06 (#8)2.84 (#16)2.94 (#15)3.07 (#3)3.22 (#7)2.99 (#11)2.88 (#20)2.47 (#19)2.82 (#18)2.65 (#21)
2.92 (#10)2.81 (#17)2.99 (#10)2.81 (#20)2.83 (#9)3.07 (#14)3.06 (#7)2.94 (#16)2.61 (#7)3.09 (#2)2.89 (#10)
2.92 (#9)2.82 (#16)2.75 (#22)3.02 (#11)2.81 (#10)3.19 (#8)3.14 (#4)3.22 (#6)2.46 (#20)2.94 (#9)2.64 (#24)
2.92 (#8)2.75 (#18)2.9 (#14)3.12 (#6)2.86 (#7)3.04 (#16)3.26 (#2)3.18 (#8)2.52 (#13)2.93 (#10)2.75 (#15)
2.99 (#7)3.17 (#4)3.11 (#5)3.06 (#9)2.81 (#11)3.25 (#5)2.98 (#12)3.26 (#4)2.43 (#23)2.86 (#17)2.84 (#12)
2.99 (#6)3 (#9)3.01 (#9)3.04 (#10)2.83 (#8)3.11 (#12)2.96 (#14)3.19 (#7)2.74 (#2)2.98 (#4)3.04 (#5)
3.02 (#5)3.1 (#5)3.03 (#8)3.27 (#2)3.03 (#5)3.07 (#13)3.28 (#1)3.02 (#13)2.5 (#16)2.92 (#11)3.08 (#3)
3.06 (#4)2.97 (#10)3.13 (#4)3.17 (#5)3.07 (#4)3.3 (#3)3.04 (#9)3.26 (#5)2.62 (#6)3.04 (#3)2.72 (#18)
3.1 (#3)3.18 (#3)3.22 (#2)3.27 (#3)2.9 (#6)3.26 (#4)2.86 (#18)3.28 (#3)2.52 (#12)3.19 (#1)3.18 (#1)
3.13 (#2)3.07 (#6)2.82 (#17)3.36 (#1)3.12 (#1)3.62 (#1)3.06 (#6)3.33 (#2)2.75 (#1)2.95 (#7)2.98 (#7)
3.17 (#1)3.23 (#2)3.25 (#1)3.17 (#4)3.1 (#2)3.61 (#2)3.14 (#3)3.37 (#1)2.67 (#5)2.97 (#5)2.95 (#9)
AutoBench Agentic Run 1 - April 2026 - AutoBench