How to measure voice agent accuracy at scale
"Our agent is 95% accurate."
Accurate at what?
That's the question most teams can't answer. They have a single accuracy number, usually from a small manual test, applied to the entire agent. It's like saying a car is "95% reliable" without specifying whether you mean the engine, brakes, steering, or transmission.
Voice agent accuracy is multi-dimensional. Each layer of the stack has its own accuracy metrics, its own failure modes, and its own benchmarks.
Measuring them separately is how you find and fix problems.
Measuring them together is how you stay in the dark.
I've seen teams spend weeks debugging a "low accuracy" problem. They couldn't find it because they were looking at one number. When they broke accuracy into ASR, NLU, response, and TTS layers, the problem appeared instantly: their speech recognition was mangling phone numbers 15% of the time.
Here's a practical framework for measuring accuracy across every layer of your voice agent stack, at scale, without relying on manual spot checks.
The accuracy stack: what to measure at each layer
ASR accuracy (speech recognition)
The ASR layer converts audio to text. If it gets this wrong, nothing downstream can fix it.
Word Error Rate (WER) is the standard metric. Formula: (substitutions + insertions + deletions) / total words x 100. AssemblyAI's benchmarks show leading APIs hitting 3-4% WER on clean audio.
But clean audio isn't production reality. Test WER segmented by noise level, accent, and speaking speed.
A 4% WER on studio-quality audio means nothing if your callers are in cars and restaurants.
Production WER is almost always higher than test WER. Sometimes dramatically. I've measured production WER at 12-15% for callers in loud environments while the same provider scored 3% on benchmark audio.
Track production WER alongside test WER to know the real number.
One underrated technique: track WER on critical entities separately. Your overall WER might be 5%, but if WER on phone numbers is 15% and WER on appointment dates is 12%, those high-stakes errors are costing you more than the overall number suggests.
Some ASR providers offer word-level confidence scores. Use them.
A confidence score below 0.7 on a critical entity (name, date, account number) should trigger the agent to confirm with the caller: "I heard March 15th. Is that correct?"
Character Error Rate (CER) is useful for names, confirmation numbers, and other character-critical content. A single wrong digit in a phone number is a 100% failure for that interaction, even if overall WER looks fine.
Here's a practical approach: categorize your entities by risk level.
High-risk entities (account numbers, dates, dollar amounts) need 99%+ accuracy with mandatory confirmation. Medium-risk entities (names, addresses) need 95%+ with optional confirmation. Low-risk entities (general conversation) can tolerate 90%.
NLU accuracy (intent understanding)
The NLU layer determines what the caller wants. Getting this wrong sends the entire conversation down the wrong path.
Intent classification accuracy measures whether the agent correctly identified the caller's purpose. If a caller says "I want to cancel my appointment" and the agent classifies this as "schedule appointment," the conversation is already broken.
Measure precision and recall separately. High precision means when the agent thinks it detected an intent, it's usually right.
High recall means it catches most of the relevant intents. You need both.
A precision of 95% with recall of 60% means the agent is confident when it classifies, but misses 40% of relevant intents entirely. Those missed intents become confused callers who escalate to human agents.
Track intent confusion rates with a confusion matrix. Which intents does your agent mix up most often?
"Cancel appointment" and "reschedule appointment" are commonly confused. So are "check balance" and "make payment." Knowing the specific confusions tells you exactly where to improve.
Entity extraction accuracy measures whether the agent pulled the right details: dates, names, phone numbers, account numbers.
Test this with tricky inputs. "Next Thursday" when today is Thursday. "The 15th" without specifying the month.
"Smith, with a Y." These ambiguous inputs break agents that work perfectly on clean, unambiguous data.
Ambiguous dates are the single biggest entity extraction failure I see. "Next Friday" means different things depending on when you call.
"The end of the month" requires the agent to know the current date. Test these edge cases specifically.
Response accuracy (LLM output)
The LLM generates the agent's actual response. This is where hallucinations, incorrect information, and off-topic answers live.
Factual correctness checks whether the agent's statements are true and verifiable against your knowledge base. An agent that says "your balance is $1,250" when it's actually $1,520 has a factual accuracy problem.
These errors are especially dangerous in voice because callers trust spoken information more than text. When a human-sounding voice states a number confidently, the caller writes it down. If that number is wrong, you've created a worse outcome than just failing the call.
Hallucination detection flags responses that contain fabricated information. I recommend running hallucination checks on every response, not just a sample.
Tool call parameter accuracy checks whether the agent passed the right values to external APIs. The agent might understand "book Tuesday at 3pm" correctly but pass "2026-03-10" instead of "2026-03-11" to the booking API.
This is one of the sneakiest failure modes. Everything in the conversation sounds right. The caller hangs up satisfied.
But the booking is for the wrong day. You won't catch this from conversation transcripts alone. You need to compare tool call parameters against ground truth.
TTS accuracy (speech output)
The TTS layer converts the agent's text response into audio. Accuracy here means pronunciation correctness and natural prosody.
Pronunciation errors on names, addresses, and technical terms erode caller trust immediately. If the agent mispronounces the caller's name, the rest of the conversation starts from a credibility deficit.
Mean Opinion Score (MOS) is the standard measure for voice quality. Production voice agents should target 4.0/5 or higher. Below 3.5, callers start perceiving the voice as robotic or unpleasant.
Test pronunciation on your specific vocabulary. Medical terms, street names, product names, and foreign names all have common TTS failure patterns. Build a pronunciation test list from your most frequently used terms and check it after every TTS provider update.
TTS accuracy (speech output)
The TTS layer converts the agent's text response into audio. Accuracy here means pronunciation correctness and prosody.
Pronunciation errors on names, addresses, and technical terms erode caller trust. If the agent pronounces the caller's name wrong, the rest of the conversation starts from a deficit.
Prosody (rhythm, stress, intonation) affects how natural the voice sounds. A monotone delivery makes even correct responses feel robotic. Mean Opinion Score (MOS) benchmarks target 4.5/5 for production voice agents.
Deterministic vs LLM-based evaluation
Not every accuracy measurement needs the same approach. Knowing when to use each method saves time and improves reliability.
When to use deterministic checks
Use deterministic (exact-match or rule-based) evaluation when the correct answer is unambiguous.
Tool call verification: did the agent call the right API with the right parameters? This is a binary check. Either the date is correct or it isn't.
Entity extraction: did the agent extract "March 15" from the audio? Compare the extracted value against the ground truth.
Compliance script adherence: did the agent say the required disclosure language? Pattern matching works here.
Deterministic checks are fast, cheap, and 100% reliable. Use them for everything that has a clear right answer.
Most teams underuse deterministic evaluation. About 60-70% of accuracy measurements can be done deterministically if you structure your agent's outputs correctly.
Format tool call parameters as structured JSON. Require the agent to return specific confirmation codes.
These constraints make deterministic evaluation possible and save your LLM evaluation budget for the subjective dimensions that actually need it.
When to use LLM-as-judge
Use LLM-based evaluation when the correct answer is subjective or when there are many valid phrasings.
Response quality: was the agent's answer helpful, accurate, and complete? There's no single correct response, so you need a model that can evaluate semantic quality.
Conversation naturalness: did the conversation flow smoothly? Was the tone appropriate? These are judgment calls.
Bluejay uses configurable LLM scorers that assess intent accuracy, repetition, recovery behavior, and compliance in real time.
Building reliable LLM evaluators takes calibration. Your first prompt will probably need 3-5 revisions before it catches the right behaviors consistently.
Calibrate against a set of 50+ human-labeled examples. Track evaluator agreement with human judgments and retune when it drifts below 85% agreement.
One technique that works well: use a "golden set" of 100 conversations with human-verified scores. Run your LLM evaluator on this set weekly.
If agreement drops below 85%, the evaluator needs retuning. This takes about 2 hours per month but prevents your quality measurements from silently degrading.
Measuring accuracy at scale (not just spot checks)
Manual accuracy reviews work for 50 conversations. They don't work for 50,000. You need systems that evaluate every interaction automatically.
Sampling strategies
If you can't evaluate every conversation (which is ideal), sample strategically.
Random sampling gives you an unbiased baseline. Pull 1-5% of conversations and run full accuracy evaluation. This tells you your average accuracy.
Stratified sampling ensures coverage across intent types, caller demographics, and difficulty levels. Random sampling might oversample your easy cases and miss the hard ones.
Targeted sampling focuses on high-risk segments. Evaluate 100% of escalated calls, 100% of calls with detected errors, and 100% of calls in regulated categories. These are where accuracy problems matter most.
Combine all three approaches. Random sampling gives you the big picture. Stratified sampling ensures coverage.
Targeted sampling catches the failures that matter most.
Together, they give you a complete accuracy picture without evaluating every single conversation.
For statistical confidence, you need roughly 384 samples per segment to achieve 95% confidence with a 5% margin of error.
Plan your sampling accordingly. If you have 20 intent categories and want per-intent accuracy numbers, that's 384 x 20 \= 7,680 evaluated conversations.
At $0.02 per LLM evaluation, that's about $150 per measurement cycle. Budget for this as a recurring cost.
Automated evaluation pipelines
The goal is evaluating every conversation, automatically, within minutes of it happening.
Build a pipeline that triggers after each conversation ends. Extract the transcript, tool call logs, and outcomes.
Run deterministic checks first (they're fast and free). Then run LLM-based evaluations on the conversations that pass deterministic checks.
Store results in a dashboard that tracks accuracy trends over time. LangWatch and similar platforms simulate and evaluate thousands of conversations automatically, surfacing regressions before they compound.
Set up alerts when accuracy drops below thresholds. A 2% drop in intent classification accuracy might not seem like much.
Over 10,000 daily conversations, that's 200 more failed interactions per day.
Those 200 failed interactions mean 200 frustrated callers, some percentage of whom will churn. At a $500 average customer lifetime value, even a 5% churn rate from those failures costs $5,000 per day. Accuracy measurements pay for themselves quickly when you frame them in business terms.
Frequently asked questions
What accuracy rate should I target?
Task success rate: above 85% for production agents. Hallucination rate: below 2% for general use, 0% for regulated industries. WER: below 5% on clean audio, and track your production WER separately.
These are starting targets. As your agent matures, ratchet them tighter.
How do I handle subjective accuracy?
Use multiple LLM evaluators and calibrate them against human judgment. Run 3 independent LLM evaluators on the same conversation and use majority vote for the final score.
Periodically audit evaluator performance against human ratings. If evaluators drift, retune them. Budget 2-4 hours monthly for calibration maintenance.
Accuracy is a stack, not a number
Stop reporting a single accuracy number. Measure each layer independently: ASR accuracy, NLU accuracy, response accuracy, TTS accuracy. Use deterministic evaluation where possible and LLM-based evaluation where necessary.
Automate everything. A single manual review per week tells you nothing about the 10,000 conversations you didn't review.
Bluejay measures accuracy across every layer of the voice agent stack, automatically, on every conversation. See where your accuracy gaps are hiding and fix them before they reach your callers.

A practical framework for measuring voice agent accuracy across every layer of the stack. Covers ASR, NLU, LLM, & TTS accuracy with benchmarks & scale strategy.