Delivering Transparency in LLM Benchmarking

We use multi-LLM evaluation for accurate and unbiased evaluation of LLM quality, cost and speed. AutoBench resists gaming by changing at each run.

Our system uses 20+ LLM models to generate granular benchmarks that score 90%+ correlation with AAII and 80%+ with LMArena.

QUALITY

Artificial Analysis Intelligence Index; Higher is better


model bar number
logo-openai Gpt-5
4.5116
google Gemini 2.5 pro
4.4169
logo-openai o3 4.41
qwen Qwen3 235B A22B Think. 2507 4.394
logo-openai gpt-5-nano 4.326
google Gemini 2.5 Flash 4.321
grok Grok 4 4.309
logo-openai o4-mini 4.274
claude Claude opus 4.1 4.240
deepseek DeepSeek R1 0528 4.181
kimi Kimi K2 Instruct 4.177
chatglm GLM-4.5 4.177
claude Claude sonnet 4 4.172
logo-openai Gpt 4.1 4.166
claude Claude 3.5 haiku
3.586
grok grok-3-mini
4.056
nvidia llama-3_1-Nemotron-Ultra-253B-v1
4.02
google Gemini 2.5 Flash Lite 4.017
chatglm GLM-4.5-Air 3.985
qwen Qwen3 14B 3.976
qwen Qwen3 30B A3B 3.952
deepseek deepSeek V3 0324 3.946
nvidia Llama 3_3 Nemotron Super 49B v1 3.883
google gemma-3-27b-it 3.882
mistral mistral-large-2411 3.715
mistral Magistral small 2506 3.714
symbol phi-4 3.658
llama llama-4-maverick 3.640
llama llama-4-Scout-17B-16E-Instruct 3.614
nova nova-lite-v1 3.538
nova nova-pro-v1 3.491
logo-openai Gpt-5-mini 4.487
logo-openai Gpt-oss-120b 4.479

PRICE

USD cent per average answer; Lower is better


model bar number
mistral Mistral Large 2411
0.61
nova Nova lite v1
0.02
google Gemma 3 27b it
0.03
google Gemini 2.5 Flash
0.45
llama Llama 4 Scout
0.04
nvidia Llama 3.3 Nemotron Super 49B v1
0.05
logo-openai Gpt 4.1
0.91
llama Llama 4 Maverick
0.05
claude Claude Opus 4.1 9.13
logo-openai Gpt-5 4.37
grok Grok 4 2.92
logo-openai o3 1.85
claude Claude Sonnet 4 1.71
google Gemini 2.5 Pro 1.59
logo-openai o4 mini 0.87
claude Claude 3.5 haiku 0.83
deepseek deepSeek R1 0528 0.64
logo-openai Gpt-5 mini 0.63
chatglm GLM-4.5 0.63
google Gemini 2.5 Flash Lite 0.11
grok Grok 3 mini 0.09
qwen Qwen3 14B 0.08
qwen Qwen3 30B A3B 0.08
qwen Qwen3 235B A22B Think. 2507 0.42
chatglm GLM-4.5 Air 0.36
nvidia Llama 3.1 Nemotron Ultra 253B v1 0.35
symbol phi-4 0.02
logo-openai Gpt-5 nano 0.24
kimi Kimi K2 Instruct 0.24
mistral Magistral small 2506 0.2
nova nova pro v1 0.18
logo-openai Gpt oss 120b 0.14
deepseek deepSeek V3 0324 0.12

SPEED

Average answer duration in Seconds; Lower is better


model bar number
nova Nova lite v1
5.29
nova Nova pro v1
7.53
google Gemini 2.0 Flash
48.71
claude Claude opus 4.1 48.63
deepseek deepSeek V3 0324 40.30
logo-openai o4 mini 39.05
claude Claude sonnet 4 33.67
logo-openai Gpt-4.1 32.86
nvidia Llama 3.3 Nemotron Super 49B v1 32.64
google Gemma 3 27b it 29.72
logo-openai Gpt oss 120b 27.01
grok Grok 3 mini 26.12
mistral Mistral large 2411 24.36
google Gemini 2.5 Flash Lite 19.16
mistral Magistral small 2506 17.54
symbol phi-4 7.74
llama Llama 4 Scout
10.87
llama Llama 4 Maverick
10.65
logo-openai o3
63.90
nvidia Llama 3.1 Nemotron Ultra 253B v1 61.54
qwen Qwen3 14B 61.11
claude Claude 3.5 haiku-20241022
11.52
grok Grok 4
60.96
logo-openai Gpt-5
90.00
deepseek deepSeek R1 0528 119.17
chatglm GLM-4.5 80.74
qwen Qwen3 235B A22B Think. 2507 78.80
qwen Qwen3 30B A3B 72.64
chatglm GLM-4.5 Air 68.34
logo-openai Gpt-5 nano 66.50
logo-openai Gpt-5 mini 65.90
google Gemini 2.5 Pro 65.03
kimi Kimi K2 Instruct 65.02

AutoBench provides a complete dashboard with highly customizable domain-specific average ranks and precise efficiency metrics (average answer cost, average answer duration, P99 Answer Duration). For further detail also check our Hugging Face Leaderboard.

ModelScoreAvg Cost ($ Cents)Avg Answer Duration (sec)P99 Answer Duration (sec)Iterations
gpt-54.514.3790277.7385
gpt-5-mini4.490.6365.9231.4392
gpt-oss-120b4.480.1427.01119.2388
gemini-2.5-pro4.421.5965.03199.3388
o34.411.8563.9276.7391
Qwen3-235B-A22B-Thinking-25074.390.4278.79283.8331
gpt-5-nano4.330.2466.5231.9390
gemini-2.5-flash4.320.4548.71244.1387
grok-44.312.9260.96262.5360
o4-mini4.270.8739.05185.5393
claude-opus-4-14.249.1348.62155.1387
deepSeek-R1-05284.180.64119.17265.6385
GLM-4.54.180.6380.74246389
Kimi-K2-Instruct4.180.2465.02390.5325
claude-sonnet-44.171.7133.67119.6393
gpt-4.14.170.9132.86180.7392
grok-3-mini4.060.0926.12116.1391
gemini-2.5-flash-lite4.020.1119.16127.4389
llama-3_1-Nemotron-Ultra-253B-v14.020.3561.54202391
GLM-4.5-Air3.980.3668.34240.5392
Qwen3-14B3.980.0861.12239.2392
deepSeek-V3-03243.950.1240.3199.7392
Qwen3-30B-A3B3.950.0872.64243.1390
gemma-3-27b-it3.880.0329.72134.5393
llama-3_3-Nemotron-Super-49B-v13.880.0532.64151.4392
magistral-small-25063.710.217.5489.5390
mistral-large-24113.710.6124.3696.9392
phi-43.660.027.7419.2392
llama-4-maverick3.640.0510.6571.1388
llama-4-Scout3.610.0410.8739.6393
claude-3.5-haiku3.590.8311.5225.3393
nova-lite-v13.540.025.2910.3393
nova-pro-v13.490.187.5320.2389

Comparison of AutoBench scores with other popular benchmarks. AutoBench features 92.18% correlation with Artificial Analysis Intelligence Index, 82,22% with LMArena (Chatbot Arena), and 75.45% with MMLU-Plus. Models sorted by AutoBench score.

ModelAutoBenchChatbot Ar.AAI IndexMMLU Index
gpt-54.5115673411481689500.871
gpt-5-mini4.486571107637000.828
gpt-oss-120b4.4792879771356613400.808
gemini-2.5-pro4.4169045711458646300.862
o34.4098515861451670700.853
Qwen3-235B-A22B-Thinking-25074.3943991831401635900.843
gpt-5-nano4.325926956537800.772
gemini-2.5-flash4.320993891409584300.759
grok-44.3088288311430675200.866
o4-mini4.274107341398650500.832
claude-opus-4-14.239909895144658830
deepSeek-R1-05284.181129061418587400.849
Kimi-K2-Instruct4.1771386631420485600.824
GLM-4.54.1765580311414560800.835
claude-sonnet-44.1719685761399610000.842
gpt-4.14.1658908811406467700.806
grok-3-mini4.0559405051360580100.828
llama-3_1-Nemotron-Ultra-253B-v14.0202643451345464200.825
gemini-2.5-flash-lite4.0172029521351443480.832
GLM-4.5-Air3.984640181379494750.815
Qwen3-14B3.976245179452350.774
Qwen3-30B-A3B3.9524813271380423400.777
deepSeek-V3-03243.9456690871390439900.819
llama-3_3-Nemotron-Super-49B-v13.8833105321324404730.698
gemma-3-27b-it3.8816405481363252200.669
mistral-large-24113.7146756711313270130.697
magistral-small-25063.7139333371347359500.746
phi-43.6577918021258279500.714
llama-4-maverick3.6401949921330417300.809
llama-4-Scout3.6144813991318330600.752
claude-3.5-haiku3.5862929621317233260.634
nova-lite-v13.5382018321262245400.59
nova-pro-v13.4908354221289288300.691

Cost Breakdown per Domain ($ Cents/Response)

Modelcodingcreative writingcurrent newsgeneral culturegrammarhistorylogicsmathsciencetechnologyAverage (All Topics)
claude-opus-4-118.545.817.957.096.349.157.768.978.118.629.13
gpt-56.012.753.793.073.333.686.27.593.763.824.37
grok-45.11.492.31.622.762.25.125.972.112.262.92
o31.830.941.521.011.051.343.765.161.261.361.85
claude-sonnet-43.740.991.471.161.131.591.521.811.521.551.71
gemini-2.5-pro2.770.731.511.121.221.491.522.211.611.521.59
gpt-4.11.280.480.80.530.720.751.41.690.760.750.91
o4-mini1.030.630.710.610.730.691.61.250.670.740.87
claude-3.5-haiku1.430.60.780.690.70.830.670.870.760.770.83
deepSeek-R1-05281.310.240.360.290.320.351.361.570.340.350.64
gpt-5-mini0.840.360.560.420.50.520.891.10.560.580.63
GLM-4.51.220.220.370.30.370.381.211.440.380.410.63
mistral-large-24110.960.460.610.460.460.610.570.730.580.590.61
gemini-2.5-flash1.010.150.390.240.330.370.40.650.420.430.45
Qwen3-235B-A22B-Thinking-25070.450.280.440.360.410.380.670.660.370.370.42
GLM-4.5-Air0.650.120.20.140.310.190.670.810.240.270.36
llama-3_1-Nemotron-Ultra-253B-v10.520.210.170.140.230.210.770.840.210.190.35
gpt-5-nano0.320.190.180.160.260.220.350.380.170.190.24
Kimi-K2-Instruct0.290.170.240.220.170.290.280.220.240.230.24
magistral-small-25060.210.10.10.070.110.10.610.520.10.10.2
nova-pro-v10.290.150.170.130.180.160.170.250.150.150.18
gpt-oss-120b0.20.080.140.10.090.120.160.210.120.140.14
deepSeek-V3-03240.180.080.110.080.10.10.180.160.10.10.12
gemini-2.5-flash-lite0.160.020.040.030.040.070.280.30.050.080.11
grok-3-mini0.140.050.070.050.070.070.150.130.070.070.09
Qwen3-14B0.10.040.050.040.060.050.190.190.050.050.08
Qwen3-30B-A3B0.120.040.050.040.050.050.150.190.050.050.08
llama-4-maverick0.070.030.040.030.040.050.070.070.040.040.05
Llama-3_3-Nemotron-Super-49B-v10.060.030.040.040.040.040.070.060.040.040.05
llama-4-Scout-17B-16E-Instruct0.050.030.040.040.040.040.040.050.040.040.04
gemma-3-27b-it0.040.020.030.020.020.030.030.040.030.030.03
phi-40.030.020.020.020.020.020.020.040.020.020.02
nova-lite-v10.030.010.020.010.010.020.020.020.020.020.02

Average Latency Breakdown per Domain (Seconds)

Modelcodingcreative writingcurrent newsgeneral culturegrammarhistorylogicsmathsciencetechnologyAverage (All Topics)
deepSeek-R1-0528238.202438.153762.968148.222751.707661.9942302.2976271.523462.540666.1298119.174235
gpt-5120.967250.037378.835555.095656.650873.3966156.5955151.200683.202976.545589.99818067
GLM-4.5147.039420.916443.485431.805538.410342.5121224.8967165.521843.160849.758980.74437254
Qwen3-235B-A22B-Thinking-2507180.142933.738665.223745.300454.660353.7109122.6611138.294160.42772.905878.79346155
Qwen3-30B-A3B119.989527.90734.783725.846138.810937.0577204.2344157.670938.196941.634172.64171253
GLM-4.5-Air105.914213.717231.217319.537145.693630.9208206.3031140.946538.178644.15668.34050587
gpt-5-nano98.421838.217452.2338.224248.284454.4789136.657386.483248.326353.91266.4959839
gpt-5-mini102.515325.719748.421231.523635.421744.406159.716884.274852.169256.715665.89701176
gemini-2.5-pro83.39329.257249.159436.697836.993247.2089166.920793.463149.469452.592965.03115036
Kimi-K2-Instruct69.473929.263565.403245.956446.641575.2439114.364558.669672.551357.171165.0222057
o370.420225.942746.461929.61326.529342.6644194.8085112.936241.254846.782663.89621339
llama-3_1-Nemotron-Ultra-253B-v188.509528.890529.217422.745432.212136.1473174.681134.892938.293132.329561.53657957
Qwen3-14B67.734219.923931.320432.17832.236331.2024197.4205132.549240.165631.522161.11544056
grok-492.696128.458149.766333.718141.452348.4394138.4335124.385848.718350.10360.95525411
gemini-2.5-flash95.884116.728932.100318.347124.894329.1208133.856752.620131.518536.192548.7078753
claude-opus-4-197.98831.750144.909236.351330.16653.864738.464332.154648.131852.283148.62490598
deepSeek-V3-032455.097717.816532.070123.04124.281231.00188.900751.264129.377643.422540.30336432
o4-mini56.9816.397626.727419.813421.708423.2641116.343641.534928.651326.423339.05469579
claude-sonnet-475.453318.11934.97723.565318.968735.188319.960224.187736.588934.589333.66639032
gpt-4.158.016411.492324.71714.470615.961923.157980.413246.812721.987923.398332.86274006
Llama-3_3-Nemotron-Super-49B-v146.357415.181128.777922.905721.544627.255167.426133.043931.872225.021532.63831081
gemma-3-27b-it62.270812.468627.513918.315519.294625.017624.370440.471935.354823.277329.7215
gpt-oss-120b49.979611.364828.988518.291114.402422.624125.451536.600527.854728.813527.00733404
grok-3-mini45.770712.963517.718813.609620.438520.450240.523232.5425.813623.926.12147499
mistral-large-241151.773914.281523.602517.351713.373625.843218.135524.623425.148421.900524.36368715
gemini-2.5-flash-lite26.17772.85635.62493.62153.98456.837486.525841.51126.40549.545319.15509939
magistral-small-250611.45517.19527.51785.75326.10518.698879.272237.96177.09017.278617.53939687
claude-3.5-haiku17.20969.502113.159211.22959.272213.70997.90699.501310.839511.275411.51902452
llama-4-Scout-17B-16E-Instruct20.21776.43399.19247.5297.98519.813311.117113.072112.25458.234110.86684261
llama-4-maverick21.57074.767.81895.83896.16818.318721.060411.32758.71586.833810.65014104
phi-410.83735.94986.78086.30855.99817.14577.756912.16697.04317.40967.744667446
nova-pro-v112.48337.58387.525.66586.52547.34185.86457.27126.68386.77927.528192069
nova-lite-v17.10144.78825.8464.70614.34025.50934.88064.8614.91345.32755.288625128

P99 Latency Breakdown per Domain (Seconds)

Modelcodingcreative writingcurrent newsgeneral culturegrammarhistorylogicsmathsciencetechnologyAverage (All Topics)
Kimi-K2-Instruct328.1743127.0559498.215155.9084411.2174374.8414919.1317342.6293464.4898283.0215390.4685
Qwen3-235B-A22B-Thinking-2507666.742862.6109184.4234141.1792188.1752130.8262431.5208488.4891231.8124312.6447283.8425
gpt-5379.0229104.1814151.0171109.2785163.0493141.3236655.5064536.5325304.2283232.5815277.6722
o3370.6262215.2039126.715784.404896.7106130.4733970.1118427.7559179.8835165.4601276.7346
deepSeek-R1-0528516.752374.2918114.085972.9274112.0078127.0512839.4571432.4486182.3988184.1239265.5545
grok-4330.672273.9998112.436875.3662148.4834118.3984908.2874484.3592205.0356168.1656262.5205
GLM-4.5336.583846.1539115.424761.5081183.873894.3667955.208354.6674147.6845164.993246.0464
gemini-2.5-flash712.323138.3328103.798143.562374.993117.5583944.4005122.8572135.3642147.9211244.1111
Qwen3-30B-A3B302.318264.115492.982777.8548121.5989100.3155973.6302352.3788165.0523180.4958243.0743
GLM-4.5-Air294.810349.596475.286547.3232188.0317122.7531934.9063326.1799192.6793173.4233240.499
Qwen3-14B291.185158.062288.5603117.734488.0704115.429952.7636353.0349203.2735124.2363239.235
gpt-5-nano452.680363.8721123.395295.271398.3822131.3145649.7221349.2375145.5956209.7373231.9208
gpt-5-mini420.185655.5139107.647165.40394.140696.4831710.367304.143221.4093238.5054231.3798
llama-3_1-Nemotron-Ultra-253B-v1299.244564.717780.614553.2406111.7416114.9641677.2227364.4893179.965173.4696201.967
deepSeek-V3-0324186.888451.298688.110569.852565.7204121.314755.2438202.0879100.6333355.9609199.711
gemini-2.5-pro240.769952.382897.347257.953271.3161111.1321714.4278300.57170.1939177.3255199.3419
o4-mini317.199849.139978.868945.295267.699752.9028768.0834246.0076143.239786.9799185.5417
gpt-4.1373.043520.9727103.654434.641845.116468.0267580.148268.4173151.755161.6432180.7419
claude-opus-4-1411.823566.27799.776967.809973.13140.073240.507685.4884194.4168172.176155.1479
Llama-3_3-Nemotron-Super-49B-v1215.994728.143678.58955.047669.651364.6465672.34482.6381150.747296.6115151.4413
gemma-3-27b-it375.593326.316560.812346.144866.880499.0397180.0912228.1774193.227268.8631134.5146
gemini-2.5-flash-lite190.82476.292719.325212.156913.280447.8915602.4712222.259749.1969110.7355127.4435
claude-sonnet-4372.489348.275692.442850.9154.3981124.796553.746857.4402181.6048159.8677119.5972
gpt-oss-120b219.909940.997988.254359.789850.039866.35154.4168213.958154.396143.3909119.1503
grok-3-mini324.726628.867838.357327.440558.725956.7006303.207679.7071164.642378.6302116.1006
mistral-large-2411320.722728.583376.109450.578834.0307104.241469.131452.9922161.365771.103696.8859
magistral-small-250650.667123.689623.002817.834214.25727.6318461.2929227.306622.413927.13989.5235
llama-4-maverick258.290412.306728.324514.161915.954123.4693237.746450.095544.600726.425471.1375
llama-4-Scout-17B-16E-Instruct119.444317.90821.72115.13715.589321.436821.639435.6977109.503618.144239.6221
claude-3.5-haiku41.312414.948133.575218.546617.729736.653215.823140.059517.895916.929625.3473
nova-pro-v155.83113.381514.28669.771424.336915.614114.289423.760115.250915.035220.1557
phi-428.117610.365412.385313.481213.460412.349114.11639.946813.615934.031619.1869
nova-lite-v117.3629.038711.670210.18968.04359.7788.26727.74919.471911.195610.2766

Model Namelogicscodingtechnologyhistorysciencegeneral culturecreative writinggrammarcurrent newsmathGeneral Average
gpt-54.584.524.594.624.364.644.214.174.654.664.51
gpt-5-mini4.554.544.54.524.444.564.184.254.624.634.49
gpt-oss-120b4.614.424.524.424.454.574.164.254.634.634.48
gemini-2.5-pro4.524.394.424.494.454.484.094.244.524.524.42
o34.434.34.564.54.394.573.964.164.584.614.41
Qwen3-235B-A22B-Thinking-25074.314.444.484.564.454.513.843.944.444.544.39
gpt-5-nano4.414.384.354.44.34.453.884.134.374.524.33
gemini-2.5-flash4.424.174.374.334.384.424.014.234.434.394.32
grok-44.314.354.334.384.344.44.013.854.414.44.31
o4-mini4.314.244.364.384.264.353.93.844.484.534.27
claude-opus-4-14.294.514.34.434.234.443.573.584.424.484.24
deepSeek-R1-05283.954.314.354.414.314.43.563.634.394.44.18
GLM-4.53.894.264.384.424.324.473.483.474.484.494.18
Kimi-K2-Instruct4.124.544.294.354.194.53.43.524.464.414.18
claude-sonnet-44.194.364.34.334.254.353.553.484.44.394.17
gpt-4.14.244.324.264.194.214.233.743.794.274.324.17
grok-3-mini4.024.184.164.214.164.213.513.494.264.264.06
gemini-2.5-flash-lite4.114.154.084.174.054.163.343.554.244.194.02
llama-3_1-Nemotron-Ultra-253B-v13.774.274.184.224.094.263.473.434.234.244.02
GLM-4.5-Air3.83.994.194.194.034.293.423.344.254.273.98
Qwen3-14B3.834.284.114.094.024.153.443.344.234.163.98
deepSeek-V3-03243.884.194.074.064.054.093.373.444.144.13.95
Qwen3-30B-A3B3.84.174.094.163.944.153.493.394.134.153.95
gemma-3-27b-it3.574.294.114.173.974.183.043.084.194.23.88
Llama-3_3-Nemotron-Super-49B-v13.834.054.044.123.994.163.043.274.124.133.88
magistral-small-25063.743.233.923.893.843.973.223.2843.943.71
mistral-large-24113.53.973.833.933.83.923.113.113.993.953.71
phi-43.483.973.733.853.663.823.133.23.873.853.66
llama-4-maverick3.593.743.73.783.833.823.133.13.833.793.64
llama-4-Scout-17B-16E-Instruct3.373.863.663.833.853.843.0533.823.843.61
claude-3.5-haiku3.473.863.7443.743.872.822.783.743.813.59
nova-lite-v13.323.783.713.773.513.762.992.953.823.753.54
nova-pro-v13.363.843.553.723.533.632.962.853.753.613.49

How it works

AutoBench operates through a fully automated, iterative process designed for robustness and statistical significance.

01

Dynamic Question Generation

In each iteration, the system randomly selects a topic (e.g., Math, Coding, History) and difficulty level. A randomly chosen LLM from the pool is then prompted to generate a new, unique question based on these parameters.

02

Question Quality Control

The generated question is immediately distributed to all other LLMs acting as judges. They collectively rank the question’s quality based on customizable criteria. A question is only accepted for the benchmark if it surpasses a strict quality threshold.

03

Ranked Answer Generation

Once a high-quality question is accepted, all models in the benchmark generate an answer in parallel. Every answer is then distributed to all LLM judges, which rank it based on customizable criteria. This creates a rich matrix of tens of thousands of individual evaluations.

04

Weighted Rank Aggregation

The system aggregates all ranks into a final score for each answer. This cycle is repeated hundreds of times, generating a massive dataset with statistically robust and nuanced view of each model’s capabilities.

Validation: Proven Accuracy at Scale

AutoBench’s effectiveness is not theoretical. The results from its public runs demonstrate both unprecedented scale and exceptionally high correlation with industry-standard benchmarks

AutoBench your models today

We invite you to explore the code, run the benchmark, contribute to its development, and join the discussion on the future of LLM evaluation. Explore our resources on Hugging Face.

AutoBench for Enterprises and LLM Labs

For Enterprises with Large-Scale LLM Consumption

Benchmark your Use Cases

Large corporations project billions in LLM API calls, but relying on a single model for all tasks leads to massive inefficiencies. AutoBench evaluates models on your internal use cases and data, identifying the optimal model for tasks like sentiment analysis, document summarization, or customer support.

See $ Trade-Offs Instantly

Gain immediate visibility into cost-quality trade-offs. By analyzing performance metrics like average answer cost and P99 duration, AutoBench reveals how switching models can save an estimated 20%+ on LLM expenditure without sacrificing quality.

Switch & Monitor

Seamlessly switch to cost-effective models and monitor ongoing performance. Our enterprise-specific benchmarks ensure continuous optimization, preventing overpayments and improving reliability in high-volume AI deployments.

Empowering LLM Lab R&D

Benchmark your Use Cases

With over 20 major labs competing and a $50M TAM for R&D enablement in 2025, granular evaluation is critical. AutoBench offers private, domain-focused benchmarks to reveal weaknesses in areas like advanced reasoning or specific coding.

See $ Trade-Offs Instantly

Get instant, nuanced views of performance trade-offs through collective LLM judging. Backed by ~300,000 ranks and high correlations (e.g., 86.85% with human preference), it provides actionable data to refine models efficiently.

Switch & Monitor

Monitor progress and switch training strategies with ease. Our scalable framework supports continuous custom runs, helping labs adapt architectures and data for better outcomes in the intensifying AI arms race.

Frequently Asked Questions

Still have doubts? These fast answers clear up the most common concerns about bringing AutoBench into your workflow.

AutoBench is an open source, automated benchmark system designed to evaluate the performance of Large Language Models (LLMs). It uses a “Collective-LLM-as-a-Judge” approach, where ensembles of LLMs themselves assess the quality of questions and answers generated by other LLMs. This makes the benchmark dynamic, scalable, cost-effective, and less prone to human bias.

AutoBench aims to overcome the limitations of traditional, static benchmarks.
The main goals are:

  • To create a dynamic and “hard to hack” benchmark by generating new questions in each run.
  • To reduce human bias and subjectivity in evaluations.
  • To offer a highly scalable and cost-effective solution for frequent, large-scale LLM evaluation.
  • To provide granular, all-domains capable insights into an LLM’s strengths and weaknesses.

AutoBench has demonstrated strong performance and potential:

  • High Correlation with Established Benchmarks: It achieves over 85-95% correlation with generalist static benchmarks, such Artificial Analysis Intelligence Index (AAII), and ELO score human-based benchmarks, such as LMArena.
  • Exceptional Cost-Efficiency: A full benchmark run evaluating over 30 leading LLMs, with over 20 LLM rankers, costs between $1,000 and $10,000 depending on the granularity (number of domains benchmarked) and the models used (‘Reasoning’ models tend to cost more).
  • Dynamic and Robust: The system is difficult to “game” because it dynamically generates questions for each evaluation.

Unlike traditional benchmarks that use static datasets, AutoBench is dynamic. It generates new questions for every evaluation, which prevents models from being “trained to the test” or “gaming” the system. This approach tests for genuine general abilities rather than memorized answers.

AutoBench shifts the evaluation from subjective human bias to a transparent “LLM ecosystem bias”. The “Collective-LLM-as-a-Judge” approach mitigates the biases of any single model by aggregating the “collective view” of the entire LLM ecosystem. Its bias can be considered an average of the biases introduced into the ecosystem by the vast range of training datasets used to train models. The benchmark, therefore, measures performance relative to the consensus of contemporary AI systems.

Dynamic and Adaptive: Constantly evolves, making it hard to game.
Scalable and Cost-Effective: Can evaluate many models at a low cost (under $100 for 20 models).

Reduces Human Bias: Replaces subjective human evaluation with a collective model-based perspective.

Granular flexibility: it can be easily adapted to evaluate any chosen domain.
Stable and Reliable: The iterative weighting mechanism converges quickly, ensuring stable and reliable results.

“ASI ready”: it will keep working also when it will become challenging for humans to produce adequante benchmarking datasets and evaluations

AutoBench operates in a fully automated, iterative process:

  • Dynamic Question Generation: In each iteration, a topic and difficulty are randomly selected, and one of the LLMs is tasked with generating a question.
  • Question Quality Control: The generated question is then ranked by all other LLMs. Only questions that meet a specific quality threshold are accepted.
  • Parallel Answer Generation: All LLMs generate answers to the accepted question in parallel.
  • Parallel Answer Ranking: All LLMs then rank the answers provided by every other model on a 1-5 scale based on criteria like clarity, relevance, and correctness.
  • Weighted Rank Aggregation: The individual ranks are aggregated into a weighted average for each answer. The weights of the “judge” LLMs are adapted over iterations based on their performance, giving more influence to models that provide consistently better judgments.

AutoBench evaluates LLM performance using two main types of metrics:

  • General Average Rank: This is the primary metric for a model’s overall performance. It is calculated as the weighted average of all ranks a model receives across every question and topic in the benchmark run.
  • Topic-Specific Average Ranks: To provide a more detailed view of a model’s capabilities, AutoBench calculates average ranks for a wide range of specific topics. These currently include: General Culture, Logics, Grammar, Science, Technology, Current News & History, Creative Writing, Math, Coding.

Beyond the quality of the answers, AutoBench also measures key performance indicators related to speed and cost:

  • Average Answer Duration: The average time (in seconds) a model takes to generate a response.
  • P99 Answer Duration: The 99th percentile of response time, which helps to understand a model’s worst-case latency.
  • Average Answer Cost: The estimated cost to generate answers, typically measured in USD per million tokens.
  • These granular scores help identify the specific strengths and weaknesses of each LLM.

Question quality is maintained through a rigorous, model-driven quality control process. A generated question is only accepted if its average quality rank from the judging LLMs is above 4.2  (on a 1-5 scale). If a question is rejected, the system automatically generates a new one.

Choosing the right LLM involves balancing three key factors: quality, speed, and cost. AutoBench is designed to help you navigate this trade-off.

  • For Highest Quality: If your priority is the best possible answer, look at the General Average Rank.
  • For Best Speed: If you need fast responses for a real-time application, focus on the Average Answer Duration. The “AutoBench Rank vs Answering Time” chart is a perfect tool for visualizing this, showing models that offer a good balance of speed and quality.
  • For Lowest Cost: If budget is your main concern, refer to the Average Answer Cost metric to find models that are more economical to run at scale.

By considering these three dimensions, you can select a model that is perfectly suited to your specific use case and budget. If you need more use case specific evaluations, try Bot Scanner.

Yes, AutoBench is an open-source project. The methodology and code are shared to foster community collaboration and drive advancements in LLM benchmarking. We encourage community contributions to improve topic lists, prompts, and the overall system.

Where can I find more information and resources? You can find all our models, leaderboards, and demos on our Hugging Face pages:

The project is open source, and we encourage developers and researchers to use it. The best place to start is the AutoBench 1.0 Model & Code page on Hugging Face. There, you will find the necessary code, configuration details, and instructions to run the benchmark on your own systems or with your own models.

AutoBench was developed by eZecute S.R.L., an Italian startup builder and advisor. The project also benefited from the collaboration of AI entrepreneurs and the support of leading AI companies, such as Translated, and accademic institutions, such as University of Rome “La Sapienza” DIAG.

Bot Scanner is a user-facing platform that was born from the insights gained while developing AutoBench. While AutoBench is a professional framework for benchmarking LLMs, Bot Scanner is an everyday tool that allows any user to query multiple LLMs simultaneously and use other AIs to rank the responses. It essentially packages the sophisticated evaluation principles of AutoBench into an accessible, user-friendly interface.

No. All user-generated data on Bot Scanner remains strictly personal and is not used for any benchmarking purposes, including for AutoBench.

Let’s talk
now!

"*" indicates required fields

By clicking “Submit” you declare that you have accepted the site’s privacy policy.