The Open Agent Leaderboard

Elron Bandel's avatar

How good are general purpose AI agents? We built an open evaluation framework to find out.

Most evaluations in AI report a simple result: what score each model got on which benchmarking task. When you deploy an agent, you’re not just choosing a model. You’re choosing a full system: what tools the agent can use, how it plans its steps, what it remembers between actions, how it recovers when something goes wrong. Change any of those and the same model can produce very different results at very different costs.

How well an AI agent works depends on how it’s built, not just the model inside it.

Today we’re launching the Open Agent Leaderboard, an open benchmark for comparing full agent systems, not just the models inside them. It reports both quality and cost, so you can see not just what works, but what’s worth deploying.

The leaderboard is paired with the Exgentic framework for running and reproducing evaluations, and a

 

 

 

To finish reading, please visit source site