Evaluating AI Agent Reasoning: Vakra Benchmark Lessons

IBM's Vakra benchmark reveals how AI agents really fail (tool selection + multi-step planning). A plain-English checklist for evaluating your setup.

AI agent benchmark testing visualisation
Rob
By Rob11 June 2026 · 6 min read

If you have set up a local AI agent (Claude with MCP servers, Cursor agent mode, Aider, anything similar) and it works on simple tasks but fails on the second tricky thing you ask it to do, you are not alone. IBM Research's Vakra benchmark analysis documents the same failure modes systematically across thousands of test scenarios.

This guide translates the benchmark findings into a plain-English checklist for UK readers building or evaluating their own agents. Think of it as a 'driving test for your agent' - what to test before declaring it production-ready, and what specific failure modes to watch for.

What does the Vakra benchmark actually test?

Three things, layered on top of each other.

  • Tool selection. Given a task and a catalogue of APIs, can the agent pick the right one? This sounds easy and is consistently the most common source of failure.
  • Multi-step planning. Can the agent decompose a complex task into the right sequence of API calls, with each call's output feeding the next?
  • State tracking. Across 5+ step sequences, does the agent keep track of intermediate results without getting confused or hallucinating prior state?

Vakra runs thousands of scenarios across these categories and reports per-category accuracy. The headline finding from IBM's analysis: even the best-performing 2025 frontier models score below 60% on the harder multi-step planning tasks.

What are the most common failure modes?

Per the IBM analysis, three patterns dominate.

Tool-selection errors are the most common. The agent has many APIs available; it picks the wrong one. Often it picks an API that is plausibly related but not correct for the specific task (e.g. 'search-by-name' when the task wanted 'search-by-id'). These errors compound across multi-step tasks - one wrong tool early in the chain breaks everything downstream.

Multi-step planning breakdowns happen when the agent loses track of what it has already done. It re-runs earlier steps, ignores intermediate state, or starts over when it should be continuing. This is most visible at 5+ step depth.

Ambiguous-instruction failure happens when the user's request could be interpreted multiple ways. The agent picks one interpretation and runs with it, often the wrong one, instead of asking for clarification. Production agents need to detect ambiguity and resolve it explicitly.

How should you evaluate your own agent?

Three steps a non-researcher can apply.

  1. Write 20 realistic task scenarios. Not toy prompts - real tasks you would actually give the agent in production. Cover the spectrum: easy single-step tasks, complex multi-step tasks, ambiguous instructions. Include some you know the agent will probably fail at.
  2. Score by category. When the agent fails, note which Vakra category the failure falls in: tool selection, planning, or state tracking. This points you at the specific layer to fix.
  3. Don't aim for 100%. Even frontier models score below 60% on the hardest multi-step tasks. A useful target for production work is >80% on the 70% of tasks that are 'realistically doable'. The remaining 30% deserve human review, not agent autonomy.

Why are tool-selection errors so common?

Three structural reasons.

First, tool descriptions are usually written by API designers, not for LLM consumption. The disambiguation between 'search-by-name' and 'search-by-id' is obvious to a human reading the docs but the LLM is choosing between short tool descriptions that can sound interchangeable.

Second, when there are many tools (10+), the model's attention can scatter. Approaches like Anthropic's tool-search MCP server (which lets the model query for the right tool rather than seeing all tools at once) help materially here.

Third, LLM training data is heavier on natural-language tasks than on tool-use tasks. The reasoning about 'which API to call' is genuinely thinner in the training data than reasoning about 'what to write next'.

For practical mitigation: keep your tool descriptions short and disambiguating. If two tools sound similar, rewrite the descriptions to highlight the difference. Test agent tool-selection on the specific catalogue you give it.

What is changing fast in 2026?

Three patterns.

First, smaller and faster open evaluation harnesses. Anthropic shipped Claude evals support, OpenAI has their Evals SDK, and several open-source harnesses (Inspect, DeepEval) have matured. Running your own evals is now reasonable for individual developers.

Second, MCP and similar tool-use protocols are starting to standardise. As tool descriptions become more uniform, the tool-selection failure mode reduces. By end-2026 expect noticeable improvements on the same benchmarks just from cleaner tool description standards.

Third, agents are increasingly self-evaluating. Patterns like 'run a small evaluation suite before declaring a complex task complete' are showing up in production frameworks. This adds latency but materially reduces unrecoverable failures.

Frequently asked questions

Q01Should I use Vakra itself to test my agent?
Probably not directly - Vakra is a research benchmark with thousands of scenarios that take significant time and tokens to run. Use it for inspiration on what categories of failure to look for. For your own evaluation, write 20-50 of your own scenarios that match your actual production workload.
Q02What about smaller harnesses like Inspect or DeepEval?
Both are credible for individual developers in 2026. Inspect (from the UK AI Safety Institute) is solid for general LLM evaluation; DeepEval is more agent-oriented. Either is a good starting point for setting up your own evaluation pipeline.
Q03Does the benchmark show that bigger models always win?
Mostly yes for raw accuracy, but smaller models with good tool-use training can outperform larger general-purpose models on specific agent tasks. The IBM analysis shows GPT-4 / Claude 3.7 / Gemini 2.5 perform broadly similarly, with substantial scatter between scenarios. There is no single 'best agent model' that wins everything.
Q04How often should I re-evaluate my agent?
Whenever you change the model, change the tool catalogue, change the prompt, or update major dependencies. The evaluation suite should be cheap enough to run weekly during development and after every significant change.
Q05What is the role of human review if agents are this fallible?
Substantial. Even at 80% accuracy on realistic tasks, every fifth task is wrong. For high-stakes work (writes, payments, customer communication) human review remains essential. For lower-stakes work (research, drafts, brainstorming) the agent can run autonomously with periodic spot-checks.
Q06How does this compare to LLM-as-judge approaches?
LLM-as-judge (using another LLM to grade agent outputs) is complementary. It scales evaluation cheaply but introduces its own biases. Use LLM-as-judge for fast iteration and use human-in-the-loop testing for the final 'is this ready to ship?' decision.