Back to Archive
AutoBench Agentic Run 1 - April 2026
The first AutoBench run to measure agentic performance of top LLMs
Past
Date
April 16, 2026
Version
2026-04-16
Models
31
New Models
22
Run data
Model | Average (All Topics) | Adaptive Replanning | Api Workflow | Domain Workflow | Error Handling | Failure Recovery | Multi Step Orchestration | Parallel Execution | Parameter Complexity | Single Tool Call | Tool Selection |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.01 (#1) | 0.02 (#3) | 0.01 (#2) | 0.02 (#3) | |
| 0.02 (#2) | 0.02 (#2) | 0.02 (#3) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.02 (#2) | 0.01 (#1) | 0.01 (#4) | 0.01 (#2) | |
| 0.02 (#3) | 0.02 (#3) | 0.02 (#2) | 0.02 (#3) | 0.02 (#3) | 0.02 (#3) | 0.02 (#3) | 0.03 (#5) | 0.01 (#2) | 0.01 (#1) | 0.01 (#1) | |
| 0.02 (#4) | 0.03 (#5) | 0.03 (#5) | 0.03 (#4) | 0.03 (#5) | 0.02 (#4) | 0.03 (#4) | 0.03 (#4) | 0.02 (#4) | 0.01 (#3) | 0.02 (#4) | |
| 0.03 (#5) | 0.02 (#4) | 0.03 (#4) | 0.03 (#5) | 0.02 (#4) | 0.03 (#5) | 0.03 (#5) | 0.03 (#3) | 0.04 (#5) | 0.03 (#5) | 0.03 (#5) | |
| 0.05 (#6) | 0.05 (#6) | 0.05 (#7) | 0.06 (#7) | 0.04 (#6) | 0.08 (#9) | 0.04 (#6) | 0.06 (#7) | 0.04 (#6) | 0.03 (#6) | 0.04 (#7) | |
| 0.05 (#7) | 0.05 (#7) | 0.06 (#8) | 0.07 (#8) | 0.06 (#7) | 0.05 (#6) | 0.05 (#7) | 0.06 (#8) | 0.04 (#8) | 0.03 (#7) | 0.03 (#6) | |
| 0.06 (#8) | 0.07 (#9) | 0.06 (#9) | 0.05 (#6) | 0.06 (#8) | 0.07 (#8) | 0.07 (#8) | 0.06 (#6) | 0.05 (#9) | 0.05 (#9) | 0.05 (#8) | |
| 0.07 (#9) | 0.07 (#8) | 0.05 (#6) | 0.07 (#9) | 0.11 (#12) | 0.06 (#7) | 0.11 (#14) | 0.09 (#10) | 0.04 (#7) | 0.04 (#8) | 0.05 (#9) | |
| 0.08 (#10) | 0.09 (#10) | 0.08 (#10) | 0.08 (#10) | 0.07 (#9) | 0.08 (#10) | 0.09 (#10) | 0.07 (#9) | 0.12 (#15) | 0.08 (#12) | 0.07 (#11) | |
| 0.09 (#11) | 0.10 (#11) | 0.10 (#12) | 0.11 (#12) | 0.10 (#10) | 0.10 (#11) | 0.09 (#9) | 0.11 (#12) | 0.08 (#11) | 0.06 (#11) | 0.08 (#12) | |
| 0.10 (#12) | 0.10 (#12) | 0.10 (#11) | 0.12 (#13) | 0.11 (#11) | 0.10 (#12) | 0.10 (#12) | 0.12 (#15) | 0.08 (#10) | 0.05 (#10) | 0.07 (#10) | |
| 0.11 (#13) | 0.12 (#13) | 0.14 (#18) | 0.13 (#16) | 0.11 (#13) | 0.10 (#13) | 0.11 (#15) | 0.11 (#11) | 0.10 (#12) | 0.08 (#13) | 0.12 (#16) | |
| 0.12 (#14) | 0.12 (#15) | 0.12 (#15) | 0.11 (#11) | 0.11 (#14) | 0.11 (#14) | 0.09 (#11) | 0.12 (#14) | 0.14 (#17) | 0.11 (#17) | 0.11 (#15) | |
| 0.12 (#15) | 0.12 (#14) | 0.15 (#19) | 0.13 (#15) | 0.11 (#15) | 0.12 (#16) | 0.12 (#17) | 0.12 (#13) | 0.12 (#14) | 0.10 (#16) | 0.12 (#17) | |
| 0.13 (#16) | 0.13 (#16) | 0.13 (#17) | 0.14 (#17) | 0.14 (#17) | 0.12 (#15) | 0.11 (#16) | 0.15 (#16) | 0.13 (#16) | 0.09 (#15) | 0.10 (#14) | |
| 0.14 (#17) | 0.15 (#17) | 0.12 (#16) | 0.16 (#18) | 0.19 (#18) | 0.13 (#17) | 0.14 (#18) | 0.19 (#18) | 0.10 (#13) | 0.09 (#14) | 0.08 (#13) | |
| 0.14 (#18) | 0.16 (#19) | 0.12 (#14) | 0.12 (#14) | 0.14 (#16) | 0.14 (#19) | 0.11 (#13) | 0.18 (#17) | 0.15 (#18) | 0.14 (#18) | 0.16 (#18) | |
| 0.20 (#19) | 0.16 (#18) | 0.11 (#13) | 0.22 (#19) | 0.29 (#20) | 0.14 (#18) | 0.15 (#19) | 0.19 (#19) | 0.20 (#19) | 0.23 (#20) | 0.25 (#21) | |
| 0.28 (#20) | 0.30 (#20) | 0.27 (#21) | 0.30 (#20) | 0.27 (#19) | 0.28 (#20) | 0.25 (#20) | 0.31 (#20) | 0.37 (#21) | 0.20 (#19) | 0.25 (#20) | |
| 0.32 (#21) | 0.35 (#21) | 0.35 (#22) | 0.35 (#21) | 0.37 (#21) | 0.33 (#21) | 0.32 (#21) | 0.37 (#21) | 0.31 (#20) | 0.25 (#21) | 0.21 (#19) | |
| 0.43 (#22) | 0.46 (#22) | 0.19 (#20) | 0.45 (#22) | 0.49 (#22) | 0.34 (#22) | 0.61 (#23) | 0.49 (#22) | 0.55 (#23) | 0.43 (#23) | 0.39 (#23) | |
| 0.50 (#23) | 0.51 (#23) | 0.53 (#23) | 0.61 (#23) | 0.57 (#23) | 0.55 (#23) | 0.47 (#22) | 0.62 (#24) | 0.45 (#22) | 0.29 (#22) | 0.31 (#22) | |
| 0.86 (#24) | 0.81 (#24) | 0.88 (#24) | 1.24 (#24) | 1.35 (#25) | 0.61 (#24) | 0.78 (#24) | 0.62 (#23) | 0.78 (#24) | 0.67 (#25) | 0.82 (#26) | |
| 1.33 (#25) | 1.22 (#25) | 1.24 (#25) | 1.31 (#25) | 1.41 (#26) | 1.19 (#25) | 1.34 (#25) | 1.42 (#27) | 1.65 (#27) | 1.05 (#26) | 1.58 (#29) | |
| 1.54 (#26) | 1.34 (#26) | 1.82 (#28) | 1.41 (#26) | 1.81 (#27) | 1.29 (#26) | 1.61 (#26) | 1.16 (#25) | 2.52 (#29) | 1.14 (#28) | 1.55 (#28) | |
| 1.56 (#27) | 1.93 (#28) | 1.81 (#27) | 1.78 (#28) | 2.03 (#28) | 1.86 (#27) | 1.73 (#28) | 2.24 (#28) | 0.86 (#25) | 0.50 (#24) | 0.61 (#24) | |
| 1.95 (#28) | 1.85 (#27) | 1.96 (#29) | 3.00 (#30) | 2.05 (#29) | 2.14 (#28) | 1.71 (#27) | 2.70 (#29) | 1.40 (#26) | 1.08 (#27) | 1.46 (#27) | |
| 2.03 (#29) | 3.09 (#30) | 1.76 (#26) | 1.51 (#27) | 1.28 (#24) | 2.60 (#29) | 1.93 (#29) | 1.18 (#26) | 3.53 (#30) | 2.45 (#30) | 0.62 (#25) | |
| 2.56 (#30) | 2.71 (#29) | 3.04 (#30) | 2.97 (#29) | 2.71 (#30) | 2.68 (#30) | 2.58 (#30) | 3.43 (#31) | 2.21 (#28) | 1.38 (#29) | 1.71 (#30) | |
| 5.82 (#31) | 6.67 (#31) | 6.38 (#31) | 5.84 (#31) | 6.13 (#31) | 4.07 (#31) | 5.45 (#31) | 2.72 (#30) | 7.23 (#31) | 5.63 (#31) | 9.74 (#31) |