AI agent observability: a step-by-step setup guide
You can't fix what you can't see.
That's the core problem with most voice agent deployments. Teams build the agent, test it, ship it, and then hope for the best. When something breaks in production, they find out from customer complaints, not from their tooling.
AI agent observability changes that. It's the practice of instrumenting your agents so you have full visibility into what's happening, why it's happening, and what's going wrong, in real time.
This isn't the same as monitoring. Monitoring tells you something broke. Observability tells you why.
This guide walks through four steps to set up observability for your voice or chat agents, from zero to production-ready. By the end, you'll have traces across every layer, dashboards that surface real problems, alerts that fire before customers notice, and a feedback loop that makes your agent better every week.
What is AI agent observability?
Observability vs monitoring vs logging
These three terms get used interchangeably. They shouldn't be.
Logging is raw data. Every request, response, error, and timestamp dumped into a file or database.
It's the foundation, but it's not useful on its own. Nobody reads 10,000 log lines to find one bug.
Monitoring is dashboards and alerts built on top of that data. It tells you what happened: latency spiked, error rate increased, task success dropped. It answers "is something wrong?"
Observability goes further. It gives you the ability to understand internal system behavior from external outputs.
When latency spikes, observability lets you trace a single conversation through ASR, LLM, and TTS to pinpoint exactly where the delay occurred. It answers "why is this wrong?"
OpenTelemetry's 2025 GenAI semantic conventions standardize how this telemetry data should look across agent frameworks, whether you're using LangGraph, CrewAI, or a custom pipeline.
Why voice agents need specialized observability
Generic APM tools (Datadog, New Relic) work great for web apps. They fall short for voice agents.
Voice agents have real-time requirements that web apps don't. A 500ms delay in a web response is invisible. A 500ms delay in a voice response creates an awkward pause that callers notice immediately.
The multi-layer stack (ASR, LLM, TTS, tool calls) means a single conversation generates traces across 3-5 different systems. Generic tools can capture individual spans but struggle to stitch them into a coherent conversation-level view.
Non-deterministic outputs make debugging harder. The same input produces different responses on different calls.
You need evaluation-aware observability that can score response quality, not just track whether a response was returned.
And voice has a time dimension that web apps don't. A 500ms delay between ASR completion and LLM start might be fine.
But 500ms between LLM completion and TTS start creates a noticeable pause that callers interpret as the agent being confused.
You need millisecond-level timing traces across every component to find these gaps.
I've seen teams with perfectly functional agents that callers hated. The agent answered correctly every time, but the 1.5-second gap between each answer made it feel broken.
Standard APM tools showed green across the board. Only voice-specific timing analysis revealed the gap between LLM completion and TTS start was the problem.
Arize's agent observability platform and similar specialized tools trace the full decision path: what the agent heard, what it decided, what tools it called, and what it said back.
Step 1: Instrument your agent pipeline
Instrumentation is the foundation. Without it, everything else is guesswork.
Tracing across ASR, LLM, and TTS
Every conversation should generate a distributed trace that spans the entire pipeline.
At the ASR layer, capture the raw audio duration, transcription latency, the transcribed text, and a confidence score. If your ASR provider returns word-level confidence, capture that too.
Low-confidence words are where misunderstandings start. A word with 0.4 confidence on a critical entity like a date or account number is a ticking time bomb.
At the LLM layer, capture the prompt (or a hash of it for privacy), time-to-first-token, total generation time, token count, and the full response. Tag each trace with the detected intent and any entities extracted.
At the TTS layer, capture the text input, audio generation latency, and audio duration. If you're using streaming TTS, capture time-to-first-audio-byte separately.
Also capture the gap between layers. The time between ASR finishing and LLM starting, and between LLM finishing and TTS starting. These inter-layer gaps are invisible in component-level metrics but directly affect the caller's experience.
Use OpenTelemetry's GenAI semantic conventions for consistent attribute naming. This makes your traces compatible with any observability backend and avoids vendor lock-in.
The conventions define standard attribute names like gen_ai.request.model, gen_ai.usage.input_tokens, and gen_ai.response.finish_reason. Using these instead of custom names means you can switch observability backends without rewriting instrumentation.
Connect all three layers with a single trace ID per conversation turn. This lets you follow a single utterance from audio input to audio output in one view.
Without a unified trace ID, debugging feels like detective work. You see a latency spike in your LLM metrics and a timeout error in your tool call logs, but you can't tell if they're from the same conversation. Trace IDs eliminate that guesswork.
Capturing tool calls and external actions
If your agent books appointments, looks up accounts, or transfers calls, every external action needs instrumentation.
Log the tool name, input parameters, response body, latency, and HTTP status for every call. Tag failures with error codes and error messages.
This is where most production debugging happens. The agent heard the caller correctly, understood the intent correctly, but passed the wrong date format to the booking API. Without tool call traces, you'd never know where the failure occurred.
Also capture retries. If a tool call fails and the agent retries, log both attempts. Retry storms are a common source of latency spikes under load.
One pattern I see repeatedly: teams instrument the happy path beautifully but skip error paths. When a tool call fails, the agent's recovery behavior is often the least-tested and most-critical code path.
Instrument error handling as carefully as you instrument success.
Track tool call latency percentiles separately from overall agent latency. A booking API that averages 200ms but occasionally spikes to 3 seconds will wreck your P99 latency without affecting your P50.
You can't fix what you can't isolate.
One more thing: log the sequence of tool calls, not just individual calls. If your agent calls a lookup API, then a booking API, then a confirmation API, the full sequence matters.
A failure on step three means the first two calls were wasted. Sequence-level tracing reveals these cascading patterns that individual call metrics hide.
Step 2: Build your core dashboard
Raw traces are useless without visualization. Your dashboard is where observability becomes actionable.
Essential dashboard panels
Build these panels first. They cover 80% of production debugging needs.
Latency distribution: a histogram of end-to-end response times, broken into P50, P95, and P99. Include separate timelines for ASR, LLM, and TTS latency so you can see which component is contributing to delays.
Error rate: percentage of conversations with at least one error, broken down by error type (ASR failure, LLM timeout, tool call failure, TTS error). Trend this over time to catch gradual degradation.
Task success rate: the percentage of conversations where the agent accomplished the caller's goal. This requires evaluation logic, not just error tracking. A conversation can complete without errors and still fail the caller.
Active conversations: real-time count of concurrent sessions. Useful for correlating performance degradation with traffic spikes.
Escalation rate: percentage of calls transferred to human agents. Rising escalation usually means something is degrading.
Cost per conversation: track token usage, API costs, and compute for each conversation. If average cost suddenly doubles, the agent is probably stuck in a loop or generating unnecessarily long responses.
I always build the cost panel first when setting up a new dashboard. Cost anomalies catch bugs faster than error rates do. An agent with zero errors but three times the normal token usage is almost always doing something wrong.
Segmentation by dimension
Aggregate numbers hide problems. Segment every panel by at least these dimensions.
By intent type: your appointment scheduling flow might be perfect while your billing inquiry flow is broken. Aggregates hide this.
By language or accent group: a 90% task success rate might mean 95% for English speakers and 70% for Spanish speakers.
By time of day: performance often degrades during peak hours when infrastructure is under load.
By customer segment: enterprise callers might have different needs and expectations than consumer callers.
By agent version: if you run A/B tests or gradual rollouts, track metrics per version. A 2% overall drop might be a 10% drop in the new version masked by the stable old version.
Without segmentation, you're flying blind. I've watched teams celebrate a 90% task success rate for weeks while their Spanish-language callers experienced 60%.
The aggregate looked fine. The reality was not.
Step 3: Set up alerting
Dashboards are for investigation. Alerts are for detection. You need both.
Threshold-based alerts
Start with static thresholds on your most critical metrics.
Latency: alert if P95 end-to-end latency exceeds 3 seconds for more than 5 minutes. This catches infrastructure issues and load problems.
Error rate: alert if error rate exceeds 5% over a 10-minute window. Short windows catch sudden breaks. Longer windows (1 hour) catch gradual degradation.
Task success rate: alert if TSR drops below 80% over a 30-minute window. This is your most important alert because it directly measures caller experience.
Compliance violations: alert immediately on any violation. No rolling window. No threshold.
One is too many.
Route critical alerts to PagerDuty or your on-call system. Route informational alerts to a Slack channel.
Don't mix them. Alert fatigue from noisy channels causes teams to ignore the important ones.
Here's a rule of thumb I use: if an alert fires more than twice a week without requiring action, the threshold is too tight. Tighten or loosen until every alert means "someone should look at this right now."
Also set up auto-escalation. If a warning alert is not acknowledged within 30 minutes, it escalates to critical. This catches the cases where warnings are real problems that nobody noticed in the Slack channel.
Anomaly detection
Threshold-based alerts miss gradual drift. If your task success rate drops 0.5% per week, you won't hit a threshold for months. But after 3 months, you've lost 6 points.
Statistical anomaly detection compares current metrics against historical baselines and flags deviations. Most observability platforms (Braintrust, Maxim AI) support this out of the box.
Configure anomaly detection for task success rate, escalation rate, and sentiment scores. These are the metrics most likely to drift gradually.
Review anomaly alerts weekly even if they seem minor. Early drift detection is how you prevent slow-moving failures from becoming crises.
Set a reminder on your calendar. Every Friday at 3pm, spend 15 minutes reviewing anomaly flags from the week.
Most weeks nothing will stand out. But the week something does, you'll catch it before it snowballs.
One useful technique: set up a weekly automated comparison email. Compare this week's P50 latency, task success rate, and escalation rate against last week and last month.
No dashboards to open, no tools to log into. Just a plain email that says "latency is up 3% from last week, task success is down 1.5%."
That email takes 30 seconds to read and catches problems months before they become critical.
Step 4: Close the feedback loop
Observability without action is just expensive logging. The real value comes from connecting production insights back to your testing and development process.
Production failures to test suite
Every failed conversation should automatically become a test scenario.
When a caller escalates or the agent fails a task, flag that conversation. Extract the relevant details: the caller's utterances, the agent's responses, the tool calls, and the failure point.
Import these as new test cases in your regression suite. Before the next deploy, your test suite automatically includes every production failure you've seen. This is how your test coverage grows organically from real-world issues.
Over time, this loop makes your agent progressively harder to break. Every production failure strengthens the test suite that prevents the next one.
After 6 months of running this loop, one team I worked with had grown their test suite from 200 scenarios to 1,400. Most of those new scenarios covered failure modes they never would have imagined writing manually. Their production failure rate dropped from 12% to 4%.
Trend analysis for proactive improvement
Don't wait for alerts. Review your observability data weekly.
Look for slow-moving trends: is average latency creeping up? Is escalation rate rising for a specific intent? Are callers in a particular region experiencing worse task success rates?
These trends point to improvement opportunities before they become problems. A 2% drop in task success for billing inquiries might not trigger an alert, but it tells you exactly where to focus your next iteration.
Build a weekly review cadence. 30 minutes every Monday, reviewing dashboards, checking anomaly flags, and prioritizing the top 3 improvement items for the week.
Keep a running document of what you found and what you changed. After a few months, this becomes your operational playbook. New engineers can read the history and understand what typically breaks, how the team detected it, and how they fixed it.
Track your mean time to detection (MTTD) for every incident. How long did it take from when the problem started to when your team noticed?
That number should shrink month over month as your observability matures. If it's not shrinking, your alerts need tuning.
Frequently asked questions
Can I use general-purpose observability tools (Datadog, Grafana)?
For infrastructure metrics, yes. CPU, memory, network, and container health should go through your existing infrastructure monitoring.
But voice-specific evaluation requires specialized tools. General APM tools don't understand conversation quality, intent accuracy, or hallucination detection.
Use infrastructure tools for infrastructure. Use AI observability tools for agent behavior.
How much data should I retain?
At minimum, 30 days of full traces. This gives you enough history for anomaly detection baselines and trend analysis.
For compliance-regulated industries (healthcare, finance), retain full traces for 12-24 months depending on your regulatory requirements. Storage is cheap. Regulatory fines are not.
What's the cost of observability?
Far less than the cost of undetected failures. Budget 5-10% of your agent infrastructure cost for observability tooling and storage.
A single undetected production failure that runs for a week can cost more in customer churn and escalation costs than a full year of observability tooling.
See everything, fix everything
The four steps (instrument, dashboard, alert, feedback loop) take a production voice agent from "hope it works" to "know it works."
Instrumentation gives you the data. Dashboards make it visible.
Alerts make it actionable. The feedback loop makes your agent better every week.
Bluejay provides built-in observability for voice agents: distributed tracing, real-time dashboards, intelligent alerting, and automatic test case generation from production failures. No DIY setup required.

Step-by-step guide to setting up AI agent observability. Learn how to instrument, dashboard, alert, and create feedback loops for voice and chat agents.