AutoBench Agentic Run 1 - April 2026
Date
April 19, 2026
Version
2026-04-19
Models
32
New Models
1
The first AutoBench run to measure agentic performance of top LLMs
View Results→The first AutoBench run to measure agentic performance of top LLMs
View Results→The first AutoBench run to measure agentic performance of top LLMs
View Results→Latest AutoBench run with models Gpt 5.2, Claude Opus 4.5, Gemini 3 Flash and more
View Results→Latest AutoBench run with models Gpt 5.2, Claude Opus 4.5, DeepSeek 3.2 Speciale and more
View Results→The first AutoBench run for the Agronomy domain with models Gemini 3 Pro, Gpt 5.1, Grok 4.1, Opus 4.5 and more
View Results→Latest AutoBench run with models Gemini 3 Pro, Gpt 5.1, Grok 4.1 and more
View Results→Latest AutoBench run with enhanced metrics including evaluation iterations and fail rates
View Results→Second major AutoBench run with o4-mini, GPT-4.1-mini, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet:thinking, etc.
View Results→