A New Framework for Evaluating Voice Agents (EVA)

Introduction
Background and Motivation
EVA
Findings
Limitations
What's Next
Acknowledgements
Citation

Introduction

Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives — accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns — evaluating task success or conversational dynamics, but not both.

We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the first to jointly score task success and conversational experience. We release EVA with an initial airline dataset of 50 scenarios covering flight rebooking, cancellation handling, vouchers, and more — the first in a planned series of domains.

We also provide benchmark results for 20 cascade and audio-native systems, such as speech-to-speech (S2S) models and Large Audio Language Models (LALMs). Our biggest finding is that there is a consistent Accuracy-Experience tradeoff; agents that perform well on task completion tend to deliver worse user experiences, and vice versa.

🌐 Website — Explore the full framework, early results, and a demo.
💻 GitHub — Dive into the code, framework, and judge prompts.
📂 HF Dataset - Explore the dataset

Background and Motivation

The field currently lacks a framework that evaluates the full quality of voice agent interactions, as most existing efforts assess individual components in isolation. For example, AudioBench, SD-Eval, VoxEval, Kimi-Audio-Evalkit, VoiceBench and VoxDialogue evaluate core speech understanding capabilities for Speech-to-Text (STT) — transcription, paralinguistics, acoustic cues — but remain confined to single-turn, non-interactive settings. On the other hand, EmergentTTS-Eval and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Score). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics — interruptions, backchanneling, turn-taking — yet evaluate these in isolation from task-oriented tool use, leaving the relationship between dialogue quality and agentic capability unexamined. More recent efforts, notably VoiceAgentBench and CAVA, take steps towards evaluating the agentic capabilities of commercial voice agent systems, including tool-calling and complex instruction-following. However, these voice-agentic capabilities are not evaluated within complete conversational workflows that voice agents must navigate in practice: from initial user request through multi-step tool orchestration to final task resolution.

The lack of frameworks that jointly capture accuracy and experience underscores the need for a framework that treats voice agent quality as an integrated whole. This means evaluating not only whether the task succeeded, but whether the agent communicated accurately, concisely, and naturally throughout, and surfacing how these dimensions trade off against one another in realistic deployment conditions.

EVA

The Framework

End-to-end evaluation reveals interaction dynamics that are not apparent at the component level: whether the agent interrupts users during natural pauses in speech, whether it recovers smoothly when a user corrects a transcription error, or whether high latency disrupts the conversational flow enough to prompt users to repeat themselves or abandon the task entirely.

EVA simulates multi-turn spoken conversations over live audio in which the agent must invoke appropriate tools, adhere to task-specific policies, and reach a deterministically verifiable end state. EVA evaluates voice agents using a bot-to-bot audio architecture composed of five core components:

User Simulator — A conversational AI configured with a specific goal and persona that plays the role of a caller. It operates in audio using high-quality Text-to-Speech (TTS) models, ensuring the evaluation captures representative speech-understanding challenges in natural-sounding conversational speech and realistic turn-taking dynamics.
Voice Agent — The voice agent being evaluated, built with Pipecat, an open-source Python framework for real-time voice applications. EVA supports both cascade architectures (STT → LLM → TTS) and audio-native models (S2S or LALM → TTS).
Tool Executor — The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
Validators — A set of validation metrics that check that conversations are complete and that the user faithfully reproduced the intended behavior and speech, with no human annotation required. Any conversation that fails in this validation step is regenerated, ensuring that only valid, correctly executed conversations enter evaluation. This stands in contrast to approaches that rely on post-hoc human labeling to identify simulator errors.
Metrics Suite — A suite of metrics evaluates the voice agent using the conversation recording, transcript, and tool call logs.

Data

Each test case (scenario) in our framework is an evaluation record, structured to make tests reproducible:

User Goal — What the caller is trying to accomplish. Includes a highly specific user objective with an exact decision tree that guides the user simulator through the conversation, leaving no ambiguity about the intended outcome.
User Persona — How the caller should behave — their speaking style, patience level, and personality traits.
Scenario Database — The backend data the agent's tools will query.
Ground Truth — The expected final state of the scenario database after a successful conversation.

We release EVA with a synthetic airline dataset of 50 scenarios and 15 tools, spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. Scenarios are designed to test temporal reasoning, policy-following, constraint satisfaction, and named-entity handling.

See the full demo here.

Evaluation Methodology

EVA evaluates voice agents across two fundamental dimensions, EVA-A for accuracy, and EVA-X for experience. EVA also includes a set of diagnostic metrics. Unlike the primary metrics, these are not used directly to compare or rank models — rather, they offer granular insight into why a model scores the way it does, helping identify and understand specific failure modes (e.g., ASR, speech synthesis, etc.). We report pass@k (the probability that at least one of k runs succeeds) and pass^k (the probability that all k runs succeed) across three trials per scenario (k = 3), capturing both peak performance and behavioral consistency.

EVA uses two evaluation methods: deterministic code-based metrics, which compute scores directly from structured data and are fast; and LLM-as-Judge metrics, which use Large Language Models (LLMs) to assess qualitative aspects of the conversation, or Large Audio Language Models (LALM) to evaluate speech directly. Each judge-based metric uses the model that performs best on a curated evaluation dataset for that specific metric.

EVA-A: Accuracy

Task completion alone is a necessary but insufficient measure of accuracy. An agent can reach the correct end state while fabricating a policy detail, misreading a confirmation code aloud, or hallucinating a flight number mid-conversation. These failures are invisible to a binary pass/fail check but directly harm users. EVA-A therefore measures three dimensions of accuracy:

Task Completion [Deterministic] — Measures whether the agent correctly completed the task by comparing the expected end state of the scenario database against the actual end state after the conversation.
Faithfulness [LLM-as-Judge] — Measures whether the agent's responses were grounded in its instructions, policies, user inputs, and tool call results — flagging fabrications, misrepresentations, policy violations, and hallucinations.
Speech Fidelity [LALM-as-Judge] — Measures whether the speech system faithfully reproduced the intended text in spoken audio, with particular focus on entities critical to get right in a voice context, such as confirmation codes, flight numbers, and dollar amounts. This is the only metric in any end-to-end voice agent benchmark that evaluates the quality of the agent's own spoken output at the audio level.

EVA-X: Experience

Turn-taking timing matters, but it tells only part of the story. An agent can have perfect timing while overwhelming a caller with a wall of spoken options they cannot skim, or repeatedly asking for information already given. These failures degrade the experience without ever involving a mistimed response. EVA-X therefore measures three dimensions of experience:

Conciseness [LLM-as-Judge] — Measures whether the agent's responses were appropriately brief and focused for spoken delivery, since phone users cannot skim, re-read, or scroll back through long responses.
Conversation Progression [LLM-as-Judge] — Measures whether the agent moved the conversation forward effectively — avoiding repetition, retaining context across turns, and driving toward task completion without stalling.
Turn-Taking [LLM-as-Judge] — Measures whether the agent spoke at the right time — neither interrupting the user nor introducing excessive silence after they finish speaking.

Findings

We evaluated 20 systems — proprietary and open-source, cascade and audio-native — and find a consistent accuracy-experience tradeoff: agents that perform well on task completion tend to deliver worse user experiences, and vice versa — a tradeoff invisible to benchmarks that score only task completion. No single configuration dominates both axes, confirming that accuracy and experience must be measured jointly.

Additionally, we identified named entity transcription as a dominant failure mode. A single misheard character can cascade into an authentication failure and a full conversation breakdown. Also, multi-step workflows break agents in predictable ways. Rebooking a flight while preserving ancillary services — seats, baggage — is the dominant complexity breaker across all configurations. Finally, we observed that additional calibration is needed for real-world use cases. The gap between pass@3 and pass^3 is substantial across all configurations. Even agents that can complete a task often cannot do so consistently, which is critical for real-world success.

View the early results here.

Limitations

EVA is designed to provide rigorous, end-to-end evaluation of conversational voice agents, but several limitations are important to acknowledge, across the framework, data, and metrics dimensions:

Metrics — LLM-as-judge models carry inherent biases and may favor certain response styles independent of quality, with additional risk of systematic bias when the evaluated and judge models share a provider. While we validate our judges against labeled datasets and report accuracy measurements on our website, these alignment scores do not eliminate systematic bias entirely. Additionally, task completion is measured as binary, which does not capture partial credits and may understate the relative quality of systems that fail gracefully versus catastrophically.
Simulation — The current release covers 50 English-language scenarios in a single domain (airline); results may not generalize to other domains, languages, or accents. Also, the user simulator may not perfectly replicate real caller behavior (e.g., disfluencies, hesitations, emotions) or guarantee full policy adherence.
Framework — The user simulator relies on a single commercial provider whose voice characteristics may systematically favor certain ASR systems, and the bot-to-bot pipeline — including audio format conversions and real-time audio interfaces — may not fully represent production deployments. Also, full reproduction requires commercial API access, and latency measurements will vary across providers and infrastructure.

What's Next

On the evaluation side, we plan to add prosodic quality assessment (pronunciation, rhythm, expressiveness) — currently an open problem after finding very low alignment between LALM-as-Judge and human judgments. We also plan robustness testing under noisy conditions, diverse accents, multilingual users, and varied speaker behaviors, alongside affect-aware evaluation of how agents respond to user distress. In terms of data, we are developing additional domain datasets — each with distinct policy structures, named entity profiles, and conversational dynamics — and more complex scenarios involving compound requests, multi-step follow-ups, and longer conversational memory. On the tooling front, we will release a results and error analysis application that automatically identifies errors per metric and model, surfaces representative examples for exploration, and generates structured summaries of each model’s strengths and weaknesses. Finally, we intend to expand the leaderboard continuously to provide an up-to-date assessment of voice agent capabilities across the field.

View more details about limitations and our upcoming roadmap here.

Acknowledgements

Core contributors include Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, and Hari Subramani.

We also thank Lindsay Brin, Akshay Kalkunte, Joseph Marinier, Jishnu Nair, and Aman Tiwari for their careful data review and thoughtful contributions to the framework, and Fanny Riols, Anil Madamala, Sridhar Nemala, and Srinivas Sunkara for their management, leadership, and support throughout. We also extend our thanks to the PAVA and CLAE ServiceNow teams, whose prior work on evaluations and voice agents provided valuable inspiration for this project.

Citation

@misc{eva-2026, 
  title={A New End-to-end Framework for Evaluating Voice Agents (EVA)}, 
  author={Bogavelli, Tara and Gauthier Melançon, Gabrielle and Stankiewicz, Katrina and Bamgbose, Oluwanifemi and Nguyen, Hoang and Mehndiratta, Raghav  and Subramani, Hari}, 
  year={2026}, 
  url={https://github.com/ServiceNow/eva} 
}

Feeds