About this streamFormingShared understanding of how agents actually get evaluated.
Read the charter →
Shared understanding of how agents actually get evaluated. Every team runs something different, and most of those setups have never been compared in the open. This stream is where that comparison happens.
Eval methods in use today, their strengths and their blind spots
How different industries evaluate different agent types
Who uses what, and for what purpose
Share-your-setup threads: what you run, why you chose it, how you plan to evolve it
Tooling and best practices
While this stream is forming, scope is read broadly. If it plausibly fits the purpose, post it. The scope tightens as the group matures.
Show and tell. Bring your eval setup, the reasoning behind it, and what it misses.
Work streams are not standards bodies. The goal is shared understanding: what exists, who uses what, and what holds up in practice. If real alignment emerges across companies, Agent Community can help guide early convergence, and hand mature work to a major standards organization once there is agreement, deployment, and usage. A spec written in a weekend is a conversation starter, not a draft standard.