Want to connect with Arena?
Join organizations building the agentic web. Get introductions, share updates, and shape the future of .agent.
Is this your company?
Claim this profile to update your info, add products, and connect with the community.
Arena is the primary evaluation layer for the models that power AI agents. For agents to function reliably, they require underlying models that can follow instructions, use tools, and process external information without hallucinating. Arena provides the infrastructure to measure these specific capabilities through its traditional chatbot rankings and its newer Search Arena benchmarks.
As the agent stack evolves, the 'brain' of the agent is often swapped as new models are released. Arena's Elo ratings provide the data that developers use to decide which model to integrate. By moving into search-augmented LLM evaluation, Arena is actively defining the benchmarks for how agents interact with the web and private data, making it a critical player in the agentic reliability and verification space.
Arena, formerly known as LMArena, is the commercial entity behind what the industry recognizes as Chatbot Arena. While large language models (LLMs) were initially measured by static benchmarks like MMLU or GSM8K, the rapid advancement of these systems led to a saturation point where models appeared to 'max out' standardized tests. Arena solved this by introducing a community-driven, blind A/B testing framework. This system relies on humans to interact with two anonymous models and vote on which response is superior. The results are aggregated into an Elo rating system, similar to those used in competitive chess, which has become the primary signal for identifying the current state-of-the-art model.
The company reached a $1.7 billion valuation following a $150 million funding round in early 2026. This valuation reflects the critical nature of model evaluation as foundational models become increasingly commoditized. In a world where every laboratory claims their model is 'the most powerful,' Arena provides the third-party verification that buyers and developers actually trust. The shift from LMArena to the broader 'Arena' brand signals an expansion beyond simple text chat into more complex forms of AI interaction and evaluation.
One of the company's significant technical expansions is Search Arena. This initiative, documented in research presented at ICLR 2026, focuses on the evaluation of Search-Augmented LLMs. As the industry moves from pure generative models to Retrieval-Augmented Generation (RAG) and agentic workflows, the metrics for success change. It is no longer enough to produce a coherent sentence; the model must accurately reference external data and cite its sources. Search Arena provides the testing ground for these capabilities, measuring how effectively a model can utilize a search tool before formulating a response.
This move into search-augmented evaluation places Arena at the center of the transition from 'chatbots' to 'agents.' By establishing benchmarks for how models interact with tools and external information, Arena is defining the performance standards for the next generation of software. The platform remains a rare bridge between the high-level research community and the practical needs of enterprise developers who require quantifiable proof of a model's reliability before deployment.
Arena is in a unique position where its competitors are also its primary subjects. Companies like OpenAI, Anthropic, and Google participate in the Arena leaderboard, as a top ranking is the most effective marketing tool available to them. While firms like Scale AI offer private human labeling and evaluation services, Arena maintains a public-facing, crowdsourced model that is harder to manipulate than private benchmarks.
There are inherent trade-offs in this approach. Crowdsourced human preference is subjective and can be influenced by formatting choices or the perceived 'helpfulness' of a model rather than its factual accuracy. Arena acknowledges this complexity by providing multiple categories of rankings, including hard prompts and specialized domains. This transparency allows users to see where a model excels and where it falters, preventing a single number from obscuring the nuances of model behavior. As the AI agent ecosystem matures, Arena is likely to remain the clearinghouse for model truth.
A platform for community-driven AI evaluation and performance ranking.
The Public Suffix List
Source Code of Arena Leaderboard Methodology
micromark extension to support math (`$C_L$`)
⚔️ [ICLR 2026] Official code of "Search Arena: Analyzing Search-Augmented LLMs".
Prompt-to-Leaderboard
An open platform for training, serving, and evaluating large language models. Release repo for Vicuna and Chatbot Arena.
Arena is hiring
You've explored Arena.
Join organizations building the agentic web.