Back to Archive

AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Latest
Date
April 19, 2026
Version
2026-04-19
Models
32
New Models
1

Run data

Model
Average (All Topics)Adaptive ReplanningApi WorkflowDomain WorkflowError HandlingFailure RecoveryMulti Step OrchestrationParallel ExecutionParameter ComplexitySingle Tool CallTool Selection
3.3 (#1)3.61 (#1)3.4 (#1)3.41 (#1)3.35 (#1)3.66 (#3)3.59 (#2)3.33 (#6)2.58 (#11)3 (#8)2.96 (#11)
3.24 (#2)3.32 (#3)3.3 (#2)3.18 (#8)3.17 (#2)3.69 (#2)3.36 (#4)3.46 (#1)2.75 (#4)3.01 (#6)2.98 (#9)
3.21 (#3)3.21 (#6)3.23 (#3)3.39 (#3)2.98 (#6)3.41 (#5)3.45 (#3)3.35 (#3)2.68 (#8)3.25 (#1)3.24 (#2)
3.16 (#4)3.1 (#9)2.91 (#16)3.32 (#5)3.14 (#4)3.69 (#1)3.19 (#10)3.3 (#7)2.77 (#3)2.93 (#14)3.07 (#5)
3.15 (#5)3.07 (#11)3.16 (#5)3.36 (#4)3.12 (#5)3.43 (#4)3.17 (#11)3.4 (#2)2.67 (#9)3.08 (#4)2.76 (#16)
3.13 (#6)3.16 (#8)3.05 (#9)3.39 (#2)2.95 (#8)3.14 (#15)3.66 (#1)3.19 (#10)2.69 (#7)2.98 (#9)3.33 (#1)
3.1 (#7)3.31 (#4)3.15 (#6)3.17 (#9)2.96 (#7)3.32 (#6)3.15 (#12)3.34 (#4)2.57 (#14)3.09 (#3)2.86 (#13)
3.07 (#8)3.17 (#7)3.06 (#8)3.1 (#11)2.89 (#11)3.19 (#12)3.09 (#15)3.24 (#8)2.82 (#1)3.03 (#5)3.14 (#3)
3.02 (#9)2.95 (#15)2.94 (#14)3.29 (#6)2.93 (#9)3.14 (#14)3.33 (#5)3.34 (#5)2.57 (#13)3.01 (#7)2.75 (#19)
3.01 (#10)3.03 (#12)3.22 (#4)3.19 (#7)2.83 (#12)3.31 (#7)2.99 (#17)3.19 (#11)2.54 (#16)2.94 (#12)2.75 (#18)
3 (#11)2.97 (#14)3.01 (#11)2.84 (#21)2.91 (#10)3.15 (#13)3.26 (#8)3.04 (#16)2.74 (#5)3.09 (#2)3.04 (#7)
2.99 (#12)3.07 (#10)2.99 (#12)2.96 (#16)3.16 (#3)3.24 (#8)3.22 (#9)2.98 (#18)2.54 (#17)2.82 (#22)2.73 (#20)
2.98 (#13)3.21 (#5)3.1 (#7)3.16 (#10)2.76 (#14)3.1 (#17)3.3 (#6)3.07 (#14)2.74 (#6)2.75 (#27)2.75 (#17)
2.92 (#14)2.82 (#21)2.75 (#23)3.02 (#13)2.81 (#13)3.19 (#11)3.14 (#13)3.22 (#9)2.46 (#22)2.94 (#13)2.64 (#26)
2.91 (#15)3.34 (#2)3.03 (#10)2.89 (#20)2.54 (#25)3.23 (#9)3.12 (#14)2.79 (#23)2.39 (#27)2.94 (#11)2.87 (#12)
2.84 (#17)2.63 (#27)2.75 (#24)2.99 (#15)2.61 (#19)3.13 (#16)2.81 (#24)3.09 (#13)2.51 (#20)2.81 (#24)2.99 (#8)
2.84 (#16)2.73 (#24)2.78 (#21)2.89 (#19)2.71 (#15)2.93 (#21)2.97 (#18)3.07 (#15)2.41 (#26)2.87 (#19)3.06 (#6)
2.82 (#18)2.7 (#25)2.9 (#18)2.75 (#24)2.69 (#16)3.06 (#18)2.91 (#20)3.13 (#12)2.53 (#19)2.68 (#30)2.72 (#22)
2.82 (#19)2.9 (#16)2.71 (#28)3 (#14)2.58 (#23)2.97 (#19)2.81 (#25)2.99 (#17)2.35 (#30)2.9 (#15)3.08 (#4)
2.8 (#20)2.87 (#18)2.79 (#20)2.9 (#18)2.69 (#17)2.67 (#28)3.28 (#7)2.63 (#30)2.82 (#2)2.78 (#25)2.79 (#15)
2.79 (#21)2.97 (#13)2.8 (#19)3.06 (#12)2.61 (#20)2.91 (#24)2.94 (#19)2.82 (#22)2.38 (#29)2.76 (#26)2.73 (#21)
2.79 (#22)2.74 (#22)2.74 (#25)2.67 (#25)2.6 (#21)3.23 (#10)3 (#16)2.88 (#21)2.43 (#25)2.88 (#17)2.59 (#29)
2.78 (#23)2.9 (#17)2.76 (#22)2.95 (#17)2.27 (#32)2.91 (#23)2.73 (#26)2.78 (#24)2.56 (#15)2.95 (#10)2.96 (#10)
2.76 (#24)2.83 (#20)2.97 (#13)2.82 (#22)2.68 (#18)2.97 (#20)2.68 (#27)2.64 (#29)2.38 (#28)2.89 (#16)2.65 (#25)
2.71 (#25)2.83 (#19)2.67 (#29)2.5 (#31)2.55 (#24)2.93 (#22)2.87 (#22)2.7 (#26)2.53 (#18)2.84 (#21)2.68 (#23)
2.69 (#26)2.64 (#26)2.74 (#26)2.66 (#27)2.53 (#27)2.72 (#27)2.67 (#28)2.78 (#25)2.6 (#10)2.86 (#20)2.64 (#27)
2.66 (#27)2.45 (#30)2.66 (#30)2.8 (#23)2.53 (#26)2.63 (#30)2.88 (#21)2.65 (#28)2.5 (#21)2.87 (#18)2.66 (#24)
2.65 (#28)2.49 (#29)2.9 (#17)2.65 (#28)2.47 (#28)2.76 (#25)2.67 (#29)2.53 (#31)2.46 (#23)2.81 (#23)2.81 (#14)
2.64 (#29)2.73 (#23)2.73 (#27)2.65 (#29)2.42 (#30)2.73 (#26)2.6 (#30)2.66 (#27)2.58 (#12)2.73 (#29)2.63 (#28)
2.62 (#30)2.49 (#28)2.92 (#15)2.61 (#30)2.43 (#29)2.56 (#31)2.59 (#31)2.91 (#20)2.44 (#24)2.74 (#28)2.55 (#31)
2.61 (#31)2.37 (#31)2.6 (#31)2.67 (#26)2.59 (#22)2.64 (#29)2.85 (#23)2.95 (#19)2.25 (#32)2.53 (#31)2.59 (#30)
2.27 (#32)2.22 (#32)2.16 (#32)2.24 (#32)2.28 (#31)2.15 (#32)2.41 (#32)2.2 (#32)2.29 (#31)2.38 (#32)2.44 (#32)
AutoBench Agentic Run 1 - April 2026 - AutoBench