Back to Archive
AutoBench Run 2 - April 2025
Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.
Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24
Run data
Model | Score | Avg Cost ($ Cents) | Avg Latency (sec) | P99 Latency (sec) | Iterations |
|---|---|---|---|---|---|
| 4.57 (#1) | 0.79 (#20) | 19s (#14) | 52s (#15) | - | |
| 4.46 (#2) | 1.23 (#24) | 37s (#22) | 64s (#16) | - | |
| 4.39 (#3) | - (#26) | 46s (#25) | 83s (#21) | - | |
| 4.39 (#4) | - (#27) | 46s (#26) | 83s (#22) | - | |
| 4.34 (#5) | 0.14 (#14) | 15s (#11) | 29s (#10) | - | |
| 4.34 (#6) | 1.70 (#25) | 34s (#19) | 70s (#18) | - | |
| 4.26 (#8) | 0.52 (#17) | 85s (#27) | 223s (#27) | - | |
| 4.26 (#7) | 0.32 (#16) | 44s (#24) | 94s (#23) | - | |
| 4.26 (#9) | 0.61 (#19) | 11s (#6) | 24s (#9) | - | |
| 4.20 (#11) | 1.13 (#22) | 16s (#12) | 33s (#12) | - | |
| 4.20 (#12) | 1.13 (#23) | 16s (#13) | 33s (#13) | - | |
| 4.20 (#10) | 0.03 (#3) | 30s (#17) | 79s (#20) | - | |
| 4.18 (#13) | 0.04 (#7) | 25s (#15) | 49s (#14) | - | |
| 4.17 (#14) | 0.10 (#11) | 35s (#21) | 67s (#17) | - | |
| 4.16 (#16) | 0.10 (#12) | 42s (#23) | 141s (#26) | - | |
| 4.16 (#15) | 0.04 (#4) | 6s (#3) | 9s (#1) | - | |
| 4.10 (#17) | 0.85 (#21) | 12s (#8) | 23s (#8) | - | |
| 4.09 (#18) | 0.09 (#10) | 35s (#20) | 107s (#25) | - | |
| 4.05 (#19) | 0.53 (#18) | 29s (#16) | 97s (#24) | - | |
| 4.02 (#20) | 0.04 (#5) | 31s (#18) | 74s (#19) | - | |
| 4.00 (#21) | 0.04 (#6) | 12s (#9) | 22s (#6) | - | |
| 4.00 (#23) | 0.07 (#9) | 10s (#5) | 23s (#7) | - | |
| 4.00 (#22) | 0.05 (#8) | 8s (#4) | 14s (#4) | - | |
| 3.99 (#24) | 0.18 (#15) | 11s (#7) | 18s (#5) | - | |
| 3.89 (#25) | 0.02 (#2) | 5s (#1) | 12s (#3) | - | |
| 3.88 (#26) | 0.01 (#1) | 14s (#10) | 30s (#11) | - | |
| 3.83 (#27) | 0.14 (#13) | 6s (#2) | 10s (#2) | - |