Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
ScoreAvg Cost ($ Cents)Avg Latency (sec)P99 Latency (sec)Iterations
4.26 (#8)0.52 (#17)85s (#27)223s (#27)-
4.39 (#3)- (#26)46s (#25)83s (#21)-
4.39 (#4)- (#27)46s (#26)83s (#22)-
4.26 (#7)0.32 (#16)44s (#24)94s (#23)-
4.16 (#16)0.10 (#12)42s (#23)141s (#26)-
4.46 (#2)1.23 (#24)37s (#22)64s (#16)-
4.17 (#14)0.10 (#11)35s (#21)67s (#17)-
4.09 (#18)0.09 (#10)35s (#20)107s (#25)-
4.34 (#6)1.70 (#25)34s (#19)70s (#18)-
4.02 (#20)0.04 (#5)31s (#18)74s (#19)-
4.20 (#10)0.03 (#3)30s (#17)79s (#20)-
4.05 (#19)0.53 (#18)29s (#16)97s (#24)-
4.18 (#13)0.04 (#7)25s (#15)49s (#14)-
4.57 (#1)0.79 (#20)19s (#14)52s (#15)-
4.20 (#11)1.13 (#22)16s (#12)33s (#12)-
4.20 (#12)1.13 (#23)16s (#13)33s (#13)-
4.34 (#5)0.14 (#14)15s (#11)29s (#10)-
3.88 (#26)0.01 (#1)14s (#10)30s (#11)-
4.00 (#21)0.04 (#6)12s (#9)22s (#6)-
4.10 (#17)0.85 (#21)12s (#8)23s (#8)-
3.99 (#24)0.18 (#15)11s (#7)18s (#5)-
4.26 (#9)0.61 (#19)11s (#6)24s (#9)-
4.00 (#23)0.07 (#9)10s (#5)23s (#7)-
4.00 (#22)0.05 (#8)8s (#4)14s (#4)-
4.16 (#15)0.04 (#4)6s (#3)9s (#1)-
3.83 (#27)0.14 (#13)6s (#2)10s (#2)-
3.89 (#25)0.02 (#2)5s (#1)12s (#3)-
AutoBench Run 2 - April 2025 - AutoBench