Bluejay vs. Braintrust: Are You Measuring the Right Thing in Your AI Agent Stack?
April 02, 2026
Braintrust measures LLM output quality. Bluejay measures customer outcomes. See why task completion rate matters more than LLM scores for voice AI.
A voice AI agent can score 9 out of 10 on every LLM evaluation metric and still fail to complete a single customer's request—and the evaluation platform will have no record of the failure because the transcript looked fluent, coherent, and factually consistent. At Bluejay, we process approximately 24 million voice and chat conversations annually—roughly 50 per minute—across healthcare providers, financial institutions, food delivery platforms, and enterprise technology companies. At that volume, the pattern is unmistakable: the teams whose AI agents fail most visibly in production are often the same teams whose LLM quality scores look the healthiest in their evaluation dashboards. What you measure determines what you optimize. And for voice AI, optimizing the wrong metric means customers bear the cost of the gap. By the end of this article, you will know exactly why LLM evaluation scores are an incomplete proxy for voice AI performance, what customer outcome metrics actually predict agent reliability, and how Bluejay and Braintrust differ in what they measure—and why it matters to your team.
Key Takeaways
LLM-as-judge evaluation scores measure output quality—fluency, coherence, factual accuracy—but do not measure whether a caller's task was actually completed.
A voice AI agent can produce well-scored, fluent, factually consistent responses while silently failing to book appointments, process payments, or route callers correctly.
Braintrust is built around LLM quality score optimization; it has no native mechanism for tracking task completion rate, first-call resolution, or caller escalation rate.
Bluejay measures customer outcomes directly: task completion, escalation-to-human rate, CSAT, first-call resolution, and whether the caller's stated goal was achieved—across every production call.
Teams that measure LLM quality but not customer outcomes consistently discover critical failure patterns later, and through their customers rather than their monitoring stack.
The right evaluation architecture for voice AI combines outcome-based production monitoring with pre-deployment simulation—not LLM scoring alone.
The Measurement Gap Nobody Talks About
There is a meaningful difference between evaluating what an AI agent says and evaluating what an AI agent accomplishes. These are not the same thing, and conflating them is the source of some of the most costly and preventable voice AI failures in production.
LLM evaluation frameworks—including those built into platforms like Braintrust—typically score outputs across dimensions like fluency (does the response sound natural?), coherence (does it follow logically from the conversation?), factual consistency (does it align with retrieved context or known facts?), and task-relevance (does it address the question?). These are legitimate dimensions of LLM output quality, and measuring them helps teams iterate on prompt design and model selection for text-based applications.
The problem is that none of these dimensions measure what a voice AI caller actually cares about: did my appointment get booked? Did my payment go through? Did I get routed to the right department without being transferred three times? Did the agent complete what I called to do?
A 2023 study evaluating LLM-as-judge frameworks—"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (Zheng et al., arXiv:2306.05685)—found that even the best LLM-based judges showed significant inconsistencies across evaluation tasks and demonstrated measurable position bias and verbosity bias—favoring longer, more elaborate responses regardless of whether those responses would resolve the user's actual need. For text applications, this is a calibration problem. For voice applications where the stakes include missed medical appointments, failed payments, and compliance violations, it is a reliability problem.
We've found at Bluejay that the voice agents most likely to fail in production are not the ones producing incoherent outputs—they're the ones producing fluent, plausible, well-scored outputs that nonetheless lead callers down paths that don't complete their tasks. The transcript looks fine. The caller hangs up having accomplished nothing.
Industry Example: Context: A healthcare system deployed a voice AI agent to handle prescription refill requests. The agent was evaluated using LLM-as-judge scoring on fluency, clinical accuracy of responses, and adherence to call script structure. Trigger: After a pharmacy database integration update, the agent began confirming prescription refill requests with appropriate language—but the confirmation events were no longer triggering write-backs to the pharmacy system. Consequence: Patients received verbal confirmation of their refill. The refill was never processed. The agent's LLM evaluation scores were unchanged throughout the incident because the transcript outputs remained fluent and factually accurate. Lesson: No LLM evaluation metric measures downstream system task completion. Bluejay's outcome monitoring tracks whether a refill event was successfully logged—not just whether the agent said the right words.
What Braintrust Measures Well
Braintrust is a well-designed platform for iterating on LLM application quality, and it is worth being precise about where it adds real value.
Experiment tracking and prompt version comparison: Braintrust's experiment model—logging prompt inputs, scored outputs, and metadata across releases—gives engineering and product teams a structured environment to compare model and prompt iterations. For teams rapidly iterating on an LLM text application, this reduces the friction of tracking which prompt version performed better against which dataset.
LLM-as-judge scoring and custom evaluators: Braintrust allows teams to define custom scoring functions, configure LLM-based evaluators, and combine automated scores with human annotation. For text applications where output quality can be meaningfully captured by language-based evaluators, this is a real capability.
Dataset regression testing: Braintrust's dataset management supports maintaining curated test sets and running regression evaluations against them before each release. This helps catch prompt degradation and model regression on known failure patterns.
CI/CD integration: Braintrust integrates with GitHub Actions and supports automated evaluation gates triggered by pull requests. For LLM text applications where release quality can be validated against a dataset score threshold, this provides meaningful pre-release coverage.
These capabilities serve a real purpose. The issue is not that Braintrust measures the wrong things for its intended use case. The issue is that none of these capabilities transfer to measuring customer outcomes in voice AI deployments.
Where LLM Scores Diverge from Customer Outcomes in Voice AI
When teams apply a text-layer evaluation model to a voice AI product, the measurement gap between LLM quality and customer outcomes becomes operationally dangerous.
Task completion is not captured by LLM evaluation. Whether a caller successfully completed their intended task—booking an appointment, verifying an account balance, getting a prescription refilled, resolving a billing dispute—requires measuring an outcome, not a transcript quality score. Braintrust has no mechanism to capture this. Its evaluation model ends at the LLM output boundary.
First-call resolution is invisible to LLM scoring. First-call resolution (FCR) is one of the most important metrics in contact center operations—it measures whether a caller's issue was resolved on the first interaction, without requiring a callback or transfer. FCR has no representation in LLM evaluation frameworks. A voice agent that routes 30% of callers to a human because it silently cannot complete their task will score identically on LLM quality metrics to an agent that resolves 95% of callers without a transfer.
Escalation rate is a customer outcome metric, not an LLM metric. When a caller is transferred to a human agent, it is almost always because the AI agent failed to complete their task. Transfer-to-human rate is a direct measure of AI agent failure rate in production. It cannot be captured by evaluating transcript quality. Braintrust has no native concept of this metric.
CSAT is not predictable from LLM scores. Research published by Cohere, Anthropic, and academic institutions has consistently shown low correlation between automated LLM evaluation scores and human satisfaction ratings when the task involves completing a goal rather than producing a document. A fluent, well-paced voice agent that cannot actually complete a billing change will receive low CSAT scores regardless of how good its LLM outputs look.
Industry Example: Context: A food delivery platform deployed a voice agent to handle order modification and cancellation requests. LLM evaluation showed consistently high scores across fluency, instruction-following, and FAQ response accuracy. Trigger: During a high-traffic period, a timing issue in the order management API caused modification requests to fail silently—the agent confirmed the change verbally but the change was never applied. Consequence: Over 1,400 callers in a six-hour window received confirmation of a modification that was never processed. Escalation rate spiked to 38%. CSAT dropped 22 points in the same window. Lesson: Bluejay's production monitoring tracks task completion and escalation rate in real time. The escalation spike triggered an alert within the first 12 minutes of the incident. LLM evaluation scores showed nothing abnormal throughout the event.
What Outcome-Based Measurement Actually Looks Like
At Bluejay, we built our evaluation architecture around the question that matters to the teams deploying voice agents: did the caller accomplish what they called to do? This is not a text evaluation question—it is an outcomes question that requires monitoring at the call level, not the token level.
Task completion rate is the percentage of conversations in which the caller's stated goal was successfully completed by the end of the interaction—without requiring a transfer or callback. This is the primary metric we track across every production deployment. A drop in task completion rate is the earliest and most reliable signal of an agent regression.
Escalation-to-human rate measures how often the AI agent routes a caller to a human—either because the caller requested it or because the agent's logic determined it could not complete the task. Rising escalation rate is typically the first visible symptom of a systemic agent failure in production.
First-call resolution rate measures whether the caller's issue was resolved in a single interaction. This is a contact center operations standard that predicts customer retention and cost-per-interaction. We track it at the call level across every production deployment.
CSAT is computed using LLM-based evaluators trained on caller behavior signals—tone, sentiment, conversational friction, explicit feedback—rather than transcript quality scores. A caller who successfully completed their appointment booking in three clean turns has a fundamentally different conversation profile from a caller who tried and failed four times before being transferred.
Latency as experienced. We track end-to-end latency at the interaction level—not just the LLM inference latency. A 400ms response delay that a benchmark never registers creates an audible and disruptive pause to a real caller. Persistent latency issues are reliably predictive of elevated escalation rates, even when the agent's outputs are otherwise correct.
Bluejay's production monitoring surfaces all of these metrics in real time across every production call, with threshold-based alerts to Slack and Teams and daily prioritized failure reports. A 2025 empirical study on AI agent testing practices (arXiv:2509.19185) found that teams with real-time outcome monitoring consistently detected production failures at least three times faster than teams relying on periodic evaluation runs against static datasets—a finding that maps directly to what we see across the deployments we monitor.
Bluejay vs. Braintrust: Side-by-Side
LLM output quality scoring: Both platforms support LLM-as-judge scoring and custom evaluator configuration. Braintrust is more focused on this layer. Bluejay uses LLM-based evaluation as one signal among many—not as the primary quality measure.
Experiment tracking and prompt version comparison: Braintrust is purpose-built for this. Bluejay tracks evaluation results across simulation runs and production monitoring sessions but is not a prompt experimentation platform.
Task completion rate tracking: Native in Bluejay. Not available in Braintrust.
First-call resolution measurement: Native in Bluejay. Not available in Braintrust.
Escalation-to-human rate monitoring: Native in Bluejay's production monitoring. Not available in Braintrust.
CSAT evaluation: Bluejay computes CSAT using caller behavior signals at the conversation level. Braintrust supports custom evaluators but has no native CSAT model.
Real-time production monitoring with alerts: Bluejay monitors every production call in real time with Slack and Teams alerts on threshold breaches. Braintrust provides post-hoc evaluation and is not designed for real-time production alerting.
Pre-deployment caller simulation: Bluejay's simulation engine runs thousands of synthetic caller conversations before every release. Braintrust has no pre-deployment simulation capability.
Hallucination rate monitoring: Tracked by Bluejay as a production metric across every call. Braintrust surfaces this through LLM evaluation scoring on logged experiments.
CI/CD integration: Both platforms integrate with CI/CD pipelines. Braintrust gates on LLM score thresholds. Bluejay gates on outcome-based thresholds—task completion, escalation rate, simulation pass rate—before any build is promoted.
Which Platform Is Right for Your Stack?
Use Braintrust if your primary product is a text-based LLM application and your primary quality questions are about output fluency, factual accuracy, and prompt regression. Braintrust is well-suited for teams doing rapid LLM prompt iteration on text generation tasks.
Use Bluejay if your team is deploying a voice agent and your quality questions are about what your callers actually experience: Are they completing their tasks? Are they being transferred unnecessarily? Is CSAT dropping? Is task completion rate holding across agent releases? These are not LLM evaluation questions—they are customer outcome questions, and answering them requires an entirely different measurement architecture.
Use both if your stack includes a text LLM layer where prompt quality iteration matters alongside a voice layer where caller outcomes matter. Many teams use Braintrust to iterate on LLM prompt and model quality for underlying text components while relying on Bluejay's production monitoring to track what those components actually produce in real caller interactions. These are complementary layers, not competing ones.
The core principle here is not that LLM evaluation is wrong—it is that measuring the right thing requires matching your metrics to the outcomes your customers care about. For voice AI, those outcomes are operational: did the caller accomplish their goal? That question requires outcome-based monitoring, not token-level scoring.
Frequently Asked Questions
Does Braintrust measure task completion rate for AI agents?
No. Braintrust's evaluation model is built around LLM output quality scoring—fluency, factual accuracy, coherence, instruction-following. It does not have a native mechanism for tracking whether a caller's task was actually completed, whether a booking was confirmed, a payment was processed, or whether the caller was transferred to a human. These are outcome metrics that require monitoring at the call level, not the token level.
What metrics does Bluejay use to evaluate voice AI agent performance?
Bluejay tracks a combination of outcome-based metrics—task completion rate, first-call resolution, escalation-to-human rate, CSAT, and hallucination rate—alongside deterministic technical metrics including end-to-end latency, interruption detection, and STT accuracy. These are measured across every production call in real time through Bluejay's production monitoring, not sampled from logged experiments.
Can LLM evaluation scores predict whether a voice AI agent will perform well in production?
Not reliably for voice AI. Research on LLM-as-judge frameworks (Zheng et al., arXiv:2306.05685) found significant inconsistencies in LLM judge scoring across tasks, including verbosity bias and position bias that can inflate scores for agents producing fluent but task-incomplete responses. At Bluejay, we consistently see agents that score well on LLM quality metrics fail on customer outcome metrics in production.
Why does escalation-to-human rate matter for voice AI?
Escalation rate is the most direct production signal of AI agent failure. Every unnecessary transfer to a human agent represents a task the AI could not complete, a worse customer experience, and additional operational cost. Monitoring escalation rate in real time—with threshold alerts—allows teams to detect agent regressions within minutes rather than discovering them through customer complaints or weekly evaluation review cycles.
How does Bluejay's CSAT evaluation differ from LLM-based quality scoring?
Bluejay computes CSAT using behavioral signals from the full conversation—caller tone, sentiment patterns, conversational friction points, turn-taking anomalies, and explicit feedback moments—rather than scoring the LLM's output quality in isolation. A caller who tried the same request four times before being transferred will have a measurably different behavioral profile from a caller who completed their task in two turns, even if the agent's individual LLM outputs scored identically on quality metrics in both conversations.
Conclusion
What gets measured gets optimized. For teams deploying voice AI into production, optimizing LLM quality scores without measuring customer outcomes creates a gap that only becomes visible when callers start failing—and calling back, or escalating, or churning. Braintrust is a capable platform for the problem it was built to solve: helping teams iterate on LLM text quality in structured experiments. That is a genuinely useful layer of the stack for text-based LLM applications.
But voice AI is not a text evaluation problem. The outcomes that matter to a healthcare patient calling to confirm an appointment, a bank customer calling to dispute a charge, or a consumer calling to modify a delivery order cannot be captured by evaluating whether the agent's response was fluent and coherent. At Bluejay, we built our simulation engine and production monitoring specifically to measure what those callers actually experience—because after analyzing 24 million conversations, we know that the gap between a well-scored response and a completed task is exactly where the most costly production failures live. Teams that close that gap ship more reliable voice agents. Teams that measure only LLM quality close it later—usually when a customer finds it first.
