Conversational voice agents present a distinct evaluation challenge: they must simultaneously satisfy two objectives β accuracy (completing the user's task correctly and faithfully) and conversational experience (doing so naturally, concisely, and in a way appropriate for spoken interaction). These objectives are deeply intertwined: mishearing a confirmation code renders perfect LLM reasoning meaningless, a wall of options overwhelms a caller who can't skim spoken output, and delayed responses can pass every accuracy check while remaining unusable in practice. Existing frameworks treat these as separate concerns β evaluating task success or conversational dynamics, but not both.
We introduce EVA, an end-to-end evaluation framework for conversational voice agents that evaluates complete, multi-turn spoken conversations using a realistic bot-to-bot architecture. EVA produces two high-level scores, EVA-A (Accuracy) and EVA-X (Experience), and is designed to surface failures along each dimension. EVA is the first to jointly score task success and conversational experience. We release EVA with an initial airline dataset of 50 scenarios covering flight rebooking, cancellation handling, vouchers, and more β the first in a planned series of domains.
We also provide benchmark results for 20 cascade and audio-native systems, such as speech-to-speech (S2S) models and Large Audio Language Models (LALMs). Our biggest finding is that there is a consistent Accuracy-Experience tradeoff; agents that perform well on task completion tend to deliver worse user experiences, and vice versa.
The field currently lacks a framework that evaluates the full quality of voice agent interactions, as most existing efforts assess individual components in isolation. For example, AudioBench, SD-Eval, VoxEval, Kimi-Audio-Evalkit, VoiceBench and VoxDialogue evaluate core speech understanding capabilities for Speech-to-Text (STT) β transcription, paralinguistics, acoustic cues β but remain confined to single-turn, non-interactive settings. On the other hand, EmergentTTS-Eval and SHEET assess perceived speech quality using subjective listening tests (e.g., Mean Opinion Score). Beyond speech perception, FD-Bench, Talking Turns, Full-Duplex-Bench provide deeper analyses of conversational dynamics β interruptions, backchanneling, turn-taking β yet evaluate these in isolation from task-oriented tool use, leaving the relationship between dialogue quality and agentic capability unexamined. More recent efforts, notably VoiceAgentBench and CAVA, take steps towards evaluating the agentic capabilities of commercial voice agent systems, including tool-calling and complex instruction-following. However, these voice-agentic capabilities are not evaluated within complete conversational workflows that voice agents must navigate in practice: from initial user request through multi-step tool orchestration to final task resolution.
The lack of frameworks that jointly capture accuracy and experience underscores the need for a framework that treats voice agent quality as an integrated whole. This means evaluating not only whether the task succeeded, but whether the agent communicated accurately, concisely, and naturally throughout, and surfacing how these dimensions trade off against one another in realistic deployment conditions.
End-to-end evaluation reveals interaction dynamics that are not apparent at the component level: whether the agent interrupts users during natural pauses in speech, whether it recovers smoothly when a user corrects a transcription error, or whether high latency disrupts the conversational flow enough to prompt users to repeat themselves or abandon the task entirely.
EVA simulates multi-turn spoken conversations over live audio in which the agent must invoke appropriate tools, adhere to task-specific policies, and reach a deterministically verifiable end state. EVA evaluates voice agents using a bot-to-bot audio architecture composed of five core components:
User Simulator β A conversational AI configured with a specific goal and persona that plays the role of a caller. It operates in audio using high-quality Text-to-Speech (TTS) models, ensuring the evaluation captures representative speech-understanding challenges in natural-sounding conversational speech and realistic turn-taking dynamics.
Voice Agent β The voice agent being evaluated, built with Pipecat, an open-source Python framework for real-time voice applications. EVA supports both cascade architectures (STT β LLM β TTS) and audio-native models (S2S or LALM β TTS).
Tool Executor β The engine that provides deterministic, reproducible tool responses via custom Python functions. It dynamically queries and modifies a predefined per-scenario database.
Validators β A set of validation metrics that check that conversations are complete and that the user faithfully reproduced the intended behavior and speech, with no human annotation required. Any conversation that fails in this validation step is regenerated, ensuring that only valid, correctly executed conversations enter evaluation. This stands in contrast to approaches that rely on post-hoc human labeling to identify simulator errors.
Metrics Suite β A suite of metrics evaluates the voice agent using the conversation recording, transcript, and tool call logs.
Each test case (scenario) in our framework is an evaluation record, structured to make tests reproducible:
We release EVA with a synthetic airline dataset of 50 scenarios and 15 tools, spanning IRROPS rebooking, voluntary itinerary changes, cancellations, same-day standby, and compensation vouchers. Scenarios are designed to test temporal reasoning, policy-following, constraint satisfaction, and named-entity handling.
See the full demo here.
EVA evaluates voice agents across two fundamental dimensions, EVA-A for accuracy, and EVA-X for experience. EVA also includes a set of diagnostic metrics. Unlike the primary metrics, these are not used directly to compare or rank models β rather, they offer granular insight into why a model scores the way it does, helping identify and understand specific failure modes (e.g., ASR, speech synthesis, etc.). We report pass@k (the probability that at least one of k runs succeeds) and pass^k (the probability that all k runs succeed) across three trials per scenario (k = 3), capturing both peak performance and behavioral consistency.
EVA uses two evaluation methods: deterministic code-based metrics, which compute scores directly from structured data and are fast; and LLM-as-Judge metrics, which use Large Language Models (LLMs) to assess qualitative aspects of the conversation, or Large Audio Language Models (LALM) to evaluate speech directly. Each judge-based metric uses the model that performs best on a curated evaluation dataset for that specific metric.
Task completion alone is a necessary but insufficient measure of accuracy. An agent can reach the correct end state while fabricating a policy detail, misreading a confirmation code aloud, or hallucinating a flight number mid-conversation. These failures are invisible to a binary pass/fail check but directly harm users. EVA-A therefore measures three dimensions of accuracy:
Turn-taking timing matters, but it tells only part of the story. An agent can have perfect timing while overwhelming a caller with a wall of spoken options they cannot skim, or repeatedly asking for information already given. These failures degrade the experience without ever involving a mistimed response. EVA-X therefore measures three dimensions of experience:
We evaluated 20 systems β proprietary and open-source, cascade and audio-native β and find a consistent accuracy-experience tradeoff: agents that perform well on task completion tend to deliver worse user experiences, and vice versa β a tradeoff invisible to benchmarks that score only task completion. No single configuration dominates both axes, confirming that accuracy and experience must be measured jointly.
Additionally, we identified named entity transcription as a dominant failure mode. A single misheard character can cascade into an authentication failure and a full conversation breakdown. Also, multi-step workflows break agents in predictable ways. Rebooking a flight while preserving ancillary services β seats, baggage β is the dominant complexity breaker across all configurations. Finally, we observed that additional calibration is needed for real-world use cases. The gap between pass@3 and pass^3 is substantial across all configurations. Even agents that can complete a task often cannot do so consistently, which is critical for real-world success.
View the early results here.
EVA is designed to provide rigorous, end-to-end evaluation of conversational voice agents, but several limitations are important to acknowledge, across the framework, data, and metrics dimensions:
Metrics β LLM-as-judge models carry inherent biases and may favor certain response styles independent of quality, with additional risk of systematic bias when the evaluated and judge models share a provider. While we validate our judges against labeled datasets and report accuracy measurements on our website, these alignment scores do not eliminate systematic bias entirely. Additionally, task completion is measured as binary, which does not capture partial credits and may understate the relative quality of systems that fail gracefully versus catastrophically.
Simulation β The current release covers 50 English-language scenarios in a single domain (airline); results may not generalize to other domains, languages, or accents. Also, the user simulator may not perfectly replicate real caller behavior (e.g., disfluencies, hesitations, emotions) or guarantee full policy adherence.
Framework β The user simulator relies on a single commercial provider whose voice characteristics may systematically favor certain ASR systems, and the bot-to-bot pipeline β including audio format conversions and real-time audio interfaces β may not fully represent production deployments. Also, full reproduction requires commercial API access, and latency measurements will vary across providers and infrastructure.
On the evaluation side, we plan to add prosodic quality assessment (pronunciation, rhythm, expressiveness) β currently an open problem after finding very low alignment between LALM-as-Judge and human judgments. We also plan robustness testing under noisy conditions, diverse accents, multilingual users, and varied speaker behaviors, alongside affect-aware evaluation of how agents respond to user distress. In terms of data, we are developing additional domain datasets β each with distinct policy structures, named entity profiles, and conversational dynamics β and more complex scenarios involving compound requests, multi-step follow-ups, and longer conversational memory. On the tooling front, we will release a results and error analysis application that automatically identifies errors per metric and model, surfaces representative examples for exploration, and generates structured summaries of each modelβs strengths and weaknesses. Finally, we intend to expand the leaderboard continuously to provide an up-to-date assessment of voice agent capabilities across the field.
View more details about limitations and our upcoming roadmap here.
Core contributors include Tara Bogavelli, Gabrielle Gauthier MelanΓ§on, Katrina Stankiewicz, Oluwanifemi Bamgbose, Hoang Nguyen, Raghav Mehndiratta, and Hari Subramani.
We also thank Lindsay Brin, Akshay Kalkunte, Joseph Marinier, Jishnu Nair, and Aman Tiwari for their careful data review and thoughtful contributions to the framework, and Fanny Riols, Anil Madamala, Sridhar Nemala, and Srinivas Sunkara for their management, leadership, and support throughout. We also extend our thanks to the PAVA and CLAE ServiceNow teams, whose prior work on evaluations and voice agents provided valuable inspiration for this project.
@misc{eva-2026,
title={A New End-to-end Framework for Evaluating Voice Agents (EVA)},
author={Bogavelli, Tara and Gauthier MelanΓ§on, Gabrielle and Stankiewicz, Katrina and Bamgbose, Oluwanifemi and Nguyen, Hoang and Mehndiratta, Raghav and Subramani, Hari},
year={2026},
url={https://github.com/ServiceNow/eva}
}