VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard
We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agents reason and act in enterprise-like environments.
Unlike traditional benchmarks that test isolated skills, VAKRA measures compositional reasoning across APIs and documents, using full execution traces to assess whether agents can reliably complete multi-step workflows.
VAKRA provides an executable environment where agents interact with over 8,000+ locally hosted APIs backed by real databases spanning 62 domains, along with domain-aligned document collections. Tasks can require 3-7 step reasoning chains that combine structured API interaction with unstructured retrieval under natural-language tool-use constraints.
As can be seen below, models perform poorly on VAKRA - in this blog, we include additional dataset details about the tasks in VAKRA and present an analysis of failure modes we observed on different tasks.
As shown below, the VAKRA benchmark comprises of four tasks, each testing a different set of capabilities.
Fig 1: Representative examples of each capability in the VAKRA benchmark
This capability includes 2,077 test instances across 54 domains, requiring the use of tools from the SLOT-BIRD and SEL-BIRD collections (Elder et al., 2026). Compared to the setup in Elder et al., the tool universe in SLOT-BIRD and SEL-BIRD is expanded through the inclusion of a larger number of domains. Each domain is restricted to one tool collection, and tasks involve chaining 1–12 tool calls to arrive at the final answer.
{
"query": "Which football team has a build-up play speed of 31, build-up plan dribbling of 53, and build-up play passing of 32?",
"tool_calls":[
{
"name": "get_data",
"arguments":{"tool_universe_id="486ea46224d1-aeb8037c5e78"},
"label": "retrieved_data_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"retrieved_data_1","key_name":"play_speed","value":31},
"label": "FILTERED_DF_0"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_0","key_name":"play_dribble","value":53},
"label": "FILTERED_DF_1"
},
{
"name": "select_data_equal_to",
"arguments":{"data_label":"FILTERED_DF_1","key_name":"play_passing","value":32},
"label": "FILTERED_DF_2"
},
{"name":{get_team_name},"arguments":{"data_label":"FILTERED_DF_2","n":1}}}],
"answer": "FC Barcelona"
}
Fig 2: Data sample from SEL-BIRD collection
As shown above, each instance has an associated JSON data source from which the answer must be derived. The MCP servers supporting this task include a special tool, called get_data(tool_universe_id=id), which must be called at the beginning of each instance. This tool initializes the data source, returns a lightweight preview of the data (see below Figure 3), and stores the full dataset server-side to avoid large data transfers. This prevents the inefficient transfer of large data over the MCP protocol. The call also configures the MCP server to expose the appropriate tool set based on the tool_universe_id and aligns the data source with the domain-specific database for the instance.
The SLOT-BIRD collection provides a global set of 7 tools for generic data manipulation (e.g., filtering, sorting), inspired by systems like Tableau and Google Analytics. The SEL-BIRD collection extends this by introducing more specialized tools: some are shared with SLOT-BIRD, while others are derived by flattening categorical arguments into separate functions (e.g., sort_data with argument ascending: bool = False becomes sort_data_ascending and sort_data_descending). Additionally, the generic (retrieve_data) function from SLOT-BIRD is replaced with query-specific getters. Every key in the data for a given instance has an associated get function (get_KEY_NAME) for an average of 4 get functions per instance.
{
"handle": "retrieved_data_1",
"num_records": 2,
"key_details": [
{"name": "team_name", "dtype": "str", "first_3_values": ["FC Barcelona", "Manchester City"]},
{"name": "play_speed", "dtype": "int32", "first_3_values": [31, 40]},
{"name": "play_dribble", "dtype": "int32", "first_3_values": [53, 30]},
{"name": "play_passing", "dtype": "int32", "first_3_values": [32, 16]}
]}
Fig 3: Data preview obtained from `get_data` function
This capability includes 1,597 instances across 17 domains, requiring tools from an expanded REST-BIRD collection (Elder et al.). These use endpoint-style interfaces that provide highly specific, query-aligned endpoints that encapsulate most computation. They are served as REST APIs running in a FastAPI server, which is wrapped by the MCP server. This task requires selecting the correct APIs from the domain-specific tool set (as shown in the example in Figure 1). Each domain contains a minimum of 6 to a maximum of 328 tools (with an average of 116 tools). Similar to the previous task, the get_data tool configures the MCP server to expose only the relevant domain-specific APIs.
The OpenAI API Specification restricts the tool list input to a maximum length of 128 tools. This restriction requires an agent builder using this API to manage the length of the tool list directly via a shortlisting mechanism. In the baseline agents in our repository, a simple shortlisting capability handles this challenge.
The Capability 3 segment of the benchmark has 869 test instances drawn from 38 subject domains. These instances rely again on the REST-BIRD API collection, but add multi-hop reasoning to the challenge (refer to example in Figure 1). Multi-hop questions require multiple pieces of supporting evidence to be extracted and combined to reach an answer. The instances in this section require between one and five logical hops to answer a query. The question types distribution for queries within the test dataset is shown below in Figure 4.
Fig 4: API Hop-Type distribution for Capability 3 (MultiHop) and Hybrid Hop-Type distribution for Capability 4 (MultiHop MultiSource Reasoning)
Capability 4 includes 644 instances across 41 domains and is also built on the REST-BIRD API collection. Figure 4 above shows a distribution of hybrid hops for test queries without policies. It contains the most complex queries with the following characteristics:
Multi-Source: This segment adds document indices per domain. Queries in this capability could require information from these document indexes as well as API calls. Similar to Capability 3, this task also has Multi-Hop queries. The required information source applies at the per-hop level, so, for example, a question may entail three logical hops with sources: API - RAG (Document Retrieval) - API. To enforce correct reasoning, sources are decontaminated during data generation, i.e. information required for a given hop is available in only one source. For example, if a hop is to be answered using APIs, the document index is built by removing documents that likely contain the information needed to answer the question.
Multi-Turn: This segment of the dataset also adds multi-turn conversations to the setting. Each instance is a dialog with multiple turns. The data is released as context-response pairs, where the context encodes the current dialog history and the agent is only responsible for answering the current turn.
Tool-usage Policies: A subset of these instances includes tool-use policies that the agent is required to follow. These policies take the form of plain-text instructions about the knowledge sources that the agent is allowed to access and under which circumstances. For example:
If a user's query pertains to Technology & Software, which is/are about Topics focusing on codebases,
software platforms, applications, and user interactions in tech, make sure you try answering them by
only using document retrievers. Do not use other types of tools.
The baseline agent in the project repo imposes adherence to these policies through a simple addition to the prompt: "You are a helpful assistant with access to tools.\n Tool Usage Constraint: {additional_instructions}.". Of course, agent builders are free to choose any constraint enforcement mechanism.
VAKRA evaluates agents in tool environments where success depends on both the ability to execute coherent, multi-step workflows and answer correctness. We introduce an execution-centric evaluation framework that assesses not only final outputs but also the full tool-execution trajectory that includes tool calls, inputs, and intermediate results.
The VAKRA Evaluator operates over two key inputs for each sample: a predicted final response and the corresponding tool-call trajectory. The tool calls from the predicted trajectory are executed in the same environment as the ground truth to verify intermediate tool outputs.
The evaluation follows a waterfall-style pipeline (Figure 6), where later stages are conditioned on earlier success:
Fig 6: Waterfall-style Evaluation Pipeline

Tool-Sequence Comparison Due to the presence of an executable environment, agents can explore the environment and sometimes return the answer by invoking a different set of APIs than the ones identified by us. In order to support alternative but valid tool invocations and reasoning paths, correctness is assessed by executing each predicted tool and comparing the set of tool responses against those from the ground truth (rather than enforcing strict step-level matching).
Specifically, we first perform a programmatic check, verifying whether all information present in the ground-truth tool responses is recovered by the predicted tool responses. This check may be inconclusive in cases involving partial matches, semantic equivalence, or differences in representation (e.g., ordering, aggregation, or formatting). In such cases, we apply a secondary LLM-based evaluation, adapted from the CRAG framework Yang et al., 2024, to determine whether the predicted trajectory retrieves all required information despite structural differences. This step uses an adapted prompt to determine whether the predicted trajectory captures all required information, even if obtained through a different sequence of tool calls.
Final Response Evaluation For trajectories that pass the previous check, the final response is evaluated using an LLM-based judge. This step ensures that the response is (i) grounded in the predicted tool outputs, and (ii) factually consistent with the ground truth answer, accounting for potential variations in phrasing or structure.
This design ensures that agents are rewarded not only for producing correct answers, but for obtaining them through valid and complete reasoning processes.
Scoring
Every capability is equally weighted to obtain a final leaderboard score
Leaderboard_Score=14∑n=14Capabilityn \text{Leaderboard\_Score} = \frac{1}{4} \sum_{n=1}^{4} \text{Capability}_n
To obtain a capability score, every sample within a capability is equally weighted for capabilities 1 through 3.
Capability1−3=# correct queries# total queries \text{Capability}_{1-3} = \frac{\#\,\text{correct queries}}{\#\,\text{total queries}}
For capability 4, we weight heterogeneous queries higher:
Capability4=# correct multi-source×2+# correct API-only or RAG-only# total multi-source×2+# total API-only or RAG-only \text{Capability}_4 = \frac{ \#\,\text{correct multi-source} \times 2 + \#\,\text{correct API-only or RAG-only} }{ \#\,\text{total multi-source} \times 2 + \#\,\text{total API-only or RAG-only} }
We now present detailed error analysis across the four VAKRA capabilities. To facilitate our analysis, we adopt stage-wise error categorization to assign each failure to the first point of breakdown. Specifically, we evaluate, in order: (i) whether the correct tool(s) were selected, (ii) whether the required arguments were provided without omissions or hallucinations, (iii) whether argument values were correct, and (iv) whether the final response is both accurate and grounded in the tool outputs.
Since a single sample may exhibit multiple errors across different steps, we sequentially classify each instance to the earliest failing stage (e.g., tool selection errors take precedence over argument errors). This avoids double-counting and allows error categories to be interpreted as disjoint fractions of the dataset. While more granular metrics (e.g., precision/recall over tool usage) are possible (Elder et al., 2026), we find this formulation provides a simple and interpretable breakdown of agent failures.
The instances in this part of the benchmark required selecting and sequencing multiple tools to solve a single task. We have 2077 samples in this capability. This was challenging for all models, but GPT-OSS-120B performed best on this segment of the benchmark.
Fig 7: SEL-BIRD vs SLOT-BIRD Error Types Analysis
The Business Intelligence (BI) API capability contains two sets of APIs, from the SLOT-BIRD and SEL-BIRD tool collections. The SEL part of this benchmark had 600 samples, while the SLOT part of the benchmark had 1477 samples. These two collections are grouped under the BI API capability, but have slightly different characteristics. The SLOT-BIRD collection has a smaller number of generic tools with a large number of parameter values to fill, while the SEL-BIRD collection has a larger set of tools and fewer parameters per tool. This focus is reflected in the relative errors made by models using these two tool collections.

Fig 8: Comparison of Accuracy Across Models by Hop Depth
Multi-hop reasoning increases the difficulty of the original task by requiring models to successfully answer multiple implicitly coupled questions, each of which requires selecting and calling the correct API. As expected, all models performed best on the questions with only a single logical hop, and saw performance degradations on 2-hop and again on 3+ hop questions.
Fig 9: Model Accuracy Rates by Interaction Type (API, Document-Retriever, Hybrid)
The final segment of the dataset includes document sources in addition to the tool/API sources in the other segments. This leads to instances that require single or multiple API calls, single or multiple document searches, or some combination of API calls and document searches.
Fig 10: Model Accuracy Rates by Policy Type
Policies introduce an additional layer of difficulty on top of multi-hop, multi-source reasoning. When policies align with the required source for answering i.e. they do not affect the tool list required for models to answer the question, we refer to it as "No Updates to Answer" -- as shown in Figure 10, all models except for Granite-4.0-h-Small-32B experience a clear drop in performance under policy constraints that restrict access to the most relevant information source (i.e. "Policy updates the answer").
In general, we find that models either violate constraints or fail to retrieve sufficient information, where they sometimes understood the policy but could not answer the question correctly, or they exhibit one of the previously analyzed failure modes.
Overall, tool-use policy-constrained settings suggest that while models can reason over tools and sources, they struggle to incorporate external constraints into that reasoning - often a key requirement for reliable real-world deployment.
VAKRA exposes a critical gap between surface-level tool competence and robust, end‑to‑end agent reliability. Although modern models can increasingly select APIs and execute isolated tool calls, VAKRA shows that these abilities alone are insufficient for real‑world deployment. In practice, models often break down when required to perform compositional reasoning under execution constraints—spanning APIs, documents, dialog context, and policy requirements.
Think your agent is solid? Put it to the test.
Run it on VAKRA and see where it falls apart—tool selection, multi-hop reasoning, or policy constraints.
👉 Try it and tell us what your agent learned