Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents

VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard

We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.

Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.

VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.

As can be seen below, models perform poorly on VAKRA – in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.

Task Description

As shown below,

To finish reading, please visit source site