Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
AutoBenchLMArenaAAI IndexMMLU-Pro
3.83 (#25)1245 (#17)37080 (#18)0.691 (#16)
3.88 (#24)1217 (#19)35280 (#20)0.652 (#19)
3.89 (#23)1217 (#20)32530 (#22)0.59 (#22)
3.99 (#22)1237 (#18)34740 (#21)0.634 (#21)
4 (#19)1272 (#12)35680 (#19)0.648 (#20)
4 (#20)1271 (#13)50530 (#8)0.809 (#5)
4 (#21)-42990 (#12)0.752 (#12)
4.02 (#18)1257 (#15)41110 (#13)0.713 (#13)
4.05 (#17)1249 (#16)38270 (#15)0.697 (#15)
4.09 (#16)1318 (#7)45580 (#11)0.752 (#11)
4.1 (#15)1288 (#11)39230 (#14)0.709 (#14)
4.16 (#13)1372 (#3)53240 (#5)0.819 (#4)
4.16 (#14)1356 (#5)48090 (#10)0.779 (#10)
4.17 (#12)1310 (#8)--
4.18 (#11)1269 (#14)37280 (#17)-
4.2 (#10)1293 (#10)48150 (#9)0.803 (#6)
4.2 (#9)1342 (#6)37620 (#16)0.669 (#18)
4.26 (#6)1358 (#4)60220 (#4)0.844 (#2)
4.26 (#8)--0.69 (#17)
4.26 (#7)1305 (#9)62860 (#3)0.791 (#8)
4.34 (#5)-52860 (#6)0.781 (#9)
4.34 (#4)1402 (#2)50630 (#7)0.799 (#7)
4.39 (#3)1293 (#10)48150 (#9)0.803 (#6)
4.46 (#2)1439 (#1)67840 (#2)0.858 (#1)
4.57 (#1)-69830 (#1)0.832 (#3)
AutoBench Run 2 - April 2025 - AutoBench