Work stream

Evals

Agent evals: collecting and evaluating best practices.

About this streamForming

Shared understanding of how agents actually get evaluated.

Read the charter →

Chartered Jun 26, 2026

Shared understanding of how agents actually get evaluated. Every team runs something different, and most of those setups have never been compared in the open. This stream is where that comparison happens.

In scope

Eval methods in use today, their strengths and their blind spots
How different industries evaluate different agent types
Who uses what, and for what purpose
Share-your-setup threads: what you run, why you chose it, how you plan to evolve it
Tooling and best practices

While this stream is forming, scope is read broadly. If it plausibly fits the purpose, post it. The scope tightens as the group matures.

Starting point

Show and tell. Bring your eval setup, the reasoning behind it, and what it misses.

Work streams are not standards bodies. The goal is shared understanding: what exists, who uses what, and what holds up in practice. If real alignment emerges across companies, Agent Community can help guide early convergence, and hand mature work to a major standards organization once there is agreement, deployment, and usage. A spec written in a weekend is a conversation starter, not a draft standard.

Welcome to Evals.

· Hi all, Kickoff for the *Evals work stream*. Everyone evaluates agents, and almost nobody does it the same way. That gap is the point of thi…Balazs Nemethi · 1 reply

Jul 3, 2026