Hugging Face Blog · 24 Mar

A New Framework for Evaluating Voice Agents (EVA)

researchagents

ServiceNow-AI has introduced EVA, the first end-to-end evaluation framework designed to jointly assess conversational voice agents on both task accuracy and user experience quality.

Voice agents face a dual challenge: completing tasks correctly while maintaining natural, concise spoken interactions appropriate for audio-based communication.

Existing evaluation frameworks typically address these concerns in isolation—testing either speech understanding capabilities or conversational dynamics, but rarely both together in realistic workflows.

EVA addresses this gap through a bot-to-bot audio architecture that simulates multi-turn conversations where agents must invoke tools, follow policies, and reach verifiable end states.

The framework comprises five core components: a User Simulator with specific goals and personas, the Voice Agent under evaluation, a Tool Executor for deterministic responses, and Validators that verify conversation completeness without human annotation.

ServiceNow-AI released EVA with an airline dataset of 50 scenarios and benchmarked 20 systems, including cascade architectures (STT → LLM → TTS) and audio-native models like speech-to-speech systems and Large Audio Language Models.

The most significant finding reveals a consistent Accuracy-Experience tradeoff—agents that perform well on task completion tend to deliver worse conversational experiences, highlighting a fundamental challenge in voice agent development.

Read original → huggingface.co