Back to Archive

AutoBench Run 2 - April 2025

Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.

Past
Date
April 25, 2025
Version
2025-04-25
Models
24
New Models
24

Run data

Model
Average (All Topics)CodingCreative WritingCurrent NewsGeneral CultureGrammarHistoryLogicsMathScienceTechnology
3.83 (#25)3.81 (#23)4.06 (#18)3.78 (#24)3.9 (#24)3.91 (#23)3.82 (#24)3.74 (#23)3.56 (#21)3.86 (#24)3.86 (#24)
3.89 (#23)3.73 (#24)3.86 (#23)3.86 (#23)4.04 (#23)3.9 (#24)4.02 (#23)3.77 (#22)3.56 (#20)4.02 (#23)4.05 (#23)
3.99 (#22)4 (#15)4.2 (#14)4.04 (#18)4.11 (#17)3.98 (#20)4.15 (#15)3.85 (#20)3.44 (#24)4.05 (#22)4.07 (#21)
4 (#20)3.97 (#19)4.2 (#15)4 (#21)4.1 (#18)3.97 (#21)4.03 (#21)3.82 (#21)3.79 (#15)4.07 (#21)4.07 (#22)
3.88 (#24)3.86 (#21)3.42 (#24)4.01 (#20)4.05 (#22)3.94 (#22)4.02 (#22)3.66 (#24)3.59 (#19)4.09 (#20)4.08 (#20)
4 (#19)3.98 (#17)4.04 (#19)3.99 (#22)4.05 (#21)4.1 (#17)4.1 (#18)3.86 (#19)3.64 (#18)4.1 (#19)4.1 (#18)
4.16 (#13)4.25 (#7)4.33 (#10)4.17 (#10)4.17 (#12)4.22 (#11)4.18 (#13)4.07 (#6)3.97 (#7)4.11 (#18)4.13 (#16)
4.09 (#16)4.01 (#14)4.32 (#11)4.08 (#13)4.14 (#16)4.11 (#15)4.06 (#20)4.04 (#9)3.91 (#9)4.13 (#17)4.12 (#17)
4 (#21)3.88 (#20)4.04 (#20)4.04 (#19)4.1 (#19)4.09 (#18)4.11 (#17)3.89 (#17)3.53 (#22)4.14 (#16)4.09 (#19)
4.02 (#18)3.83 (#22)4.02 (#21)4.07 (#15)4.17 (#14)4.1 (#16)4.13 (#16)3.93 (#14)3.52 (#23)4.15 (#15)4.21 (#13)
4.05 (#17)3.98 (#18)4.19 (#16)4.07 (#16)4.08 (#20)4.05 (#19)4.09 (#19)3.87 (#18)3.88 (#11)4.17 (#14)4.18 (#14)
4.1 (#15)4.12 (#11)4.17 (#17)4.08 (#14)4.17 (#13)4.16 (#13)4.16 (#14)3.92 (#15)3.87 (#13)4.19 (#13)4.14 (#15)
4.26 (#6)4.44 (#3)4.35 (#9)4.09 (#12)4.2 (#11)4.23 (#10)4.21 (#11)4.32 (#2)4.41 (#2)4.21 (#12)4.25 (#11)
4.17 (#12)4.23 (#8)4.3 (#13)4.06 (#17)4.17 (#15)4.19 (#12)4.21 (#12)4.1 (#5)4.03 (#6)4.22 (#11)4.24 (#12)
4.16 (#14)4.18 (#9)3.99 (#22)4.18 (#9)4.28 (#10)4.24 (#9)4.3 (#10)3.97 (#12)3.85 (#14)4.25 (#10)4.29 (#9)
4.2 (#9)4.27 (#6)4.41 (#5)4.15 (#11)4.31 (#8)4.14 (#14)4.34 (#6)3.96 (#13)3.87 (#12)4.29 (#9)4.3 (#8)
4.39 (#3)4.27 (#6)4.41 (#5)4.15 (#11)4.31 (#8)4.14 (#14)4.34 (#6)3.96 (#13)3.87 (#12)4.29 (#9)4.3 (#8)
4.18 (#11)4.1 (#12)4.3 (#12)4.2 (#8)4.32 (#7)4.27 (#8)4.32 (#9)3.99 (#10)3.68 (#17)4.3 (#8)4.29 (#10)
4.34 (#4)4.42 (#4)4.41 (#6)4.22 (#7)4.3 (#9)4.44 (#4)4.32 (#8)4.3 (#3)4.34 (#3)4.3 (#7)4.4 (#3)
4.26 (#7)4.17 (#10)4.38 (#7)4.33 (#4)4.33 (#6)4.36 (#5)4.34 (#7)4.06 (#7)3.91 (#10)4.31 (#6)4.36 (#6)
4.2 (#10)3.98 (#16)4.35 (#8)4.29 (#6)4.36 (#4)4.33 (#6)4.38 (#5)3.9 (#16)3.7 (#16)4.33 (#5)4.34 (#7)
4.26 (#8)4.05 (#13)4.46 (#3)4.29 (#5)4.35 (#5)4.32 (#7)4.39 (#4)3.97 (#11)3.95 (#8)4.35 (#4)4.39 (#4)
4.34 (#5)4.33 (#5)4.47 (#2)4.36 (#3)4.43 (#3)4.54 (#2)4.45 (#3)4.05 (#8)4.07 (#5)4.42 (#3)4.36 (#5)
4.46 (#2)4.5 (#2)4.42 (#4)4.48 (#2)4.59 (#1)4.53 (#3)4.6 (#2)4.17 (#4)4.17 (#4)4.56 (#2)4.59 (#2)
4.57 (#1)4.55 (#1)4.51 (#1)4.57 (#1)4.59 (#2)4.6 (#1)4.61 (#1)4.48 (#1)4.57 (#1)4.67 (#1)4.61 (#1)
AutoBench Run 2 - April 2025 - AutoBench