Back to Archive

AutoBench Agentic Run 1 - April 2026

The first AutoBench run to measure agentic performance of top LLMs

Past
Date
April 16, 2026
Version
2026-04-16
Models
31
New Models
22

Run data

Model
AutoBenchAAI IndexTerminal-benchGDPval-AATau2-Bench Telecom
3.17 (#1)68 (#1)46 (#5)56 (#3)92 (#11)
3.13 (#2)63 (#4)53 (#3)58 (#2)76 (#18)
3.1 (#3)59 (#8)54 (#2)41 (#9)96 (#2)
3.06 (#4)67 (#3)43 (#7)52 (#4)98 (#1)
3.02 (#5)68 (#2)58 (#1)59 (#1)87 (#14)
2.99 (#6)63 (#5)41 (#9)46 (#7)95 (#5)
2.99 (#7)62 (#6)44 (#6)43 (#8)95 (#7)
2.92 (#9)55 (#12)32 (#17)35 (#13)96 (#3)
2.92 (#10)54 (#13)38 (#12)27 (#20)93 (#10)
2.92 (#8)59 (#10)35 (#15)39 (#10)96 (#4)
2.92 (#11)40 (#21)27 (#20)34 (#14)55 (#25)
2.91 (#12)59 (#9)52 (#4)46 (#6)83 (#16)
2.9 (#13)61 (#7)39 (#11)51 (#5)85 (#15)
2.89 (#14)50 (#16)39 (#10)35 (#12)80 (#17)
2.84 (#15)53 (#15)31 (#18)31 (#18)89 (#13)
2.84 (#16)49 (#17)24 (#24)27 (#19)93 (#9)
2.83 (#17)44 (#19)27 (#21)21 (#25)94 (#8)
2.82 (#18)26 (#27)24 (#22)21 (#24)31 (#29)
2.78 (#19)48 (#18)42 (#8)34 (#15)76 (#19)
2.76 (#20)38 (#23)24 (#23)22 (#23)66 (#22)
2.72 (#21)56 (#11)35 (#16)34 (#16)95 (#6)
2.7 (#22)41 (#20)36 (#14)31 (#17)60 (#23)
2.7 (#23)40 (#22)29 (#19)25 (#22)68 (#21)
2.69 (#24)26 (#28)17 (#25)18 (#27)25 (#30)
2.66 (#25)37 (#24)17 (#26)17 (#28)73 (#20)
2.65 (#26)28 (#26)11 (#30)8 (#29)60 (#24)
2.63 (#27)19 (#30)14 (#29)4 (#30)41 (#28)
2.62 (#28)22 (#29)16 (#27)18 (#26)41 (#27)
2.55 (#29)53 (#14)36 (#13)35 (#11)91 (#12)
2.53 (#30)32 (#25)14 (#28)26 (#21)44 (#26)
2.27 (#31)7 (#31)7 (#31)0 (#31)18 (#31)
AutoBench Agentic Run 1 - April 2026 - AutoBench