Blog articles

12 Ways to Reduce Voice Agent Latency

Jan 19, 2026

Author: Faraz Siddiqi, Co-founder & CTO @ Bluejay

How to Reduce Latency in Your Voice Agent Pipeline

Latency is one of the most important features that customers use to decide whether a voice agent sounds "human". Nailing your voice agent's latency is one of the most important tasks that your engineering team will take on, and often, the task that requires the most research.

Here at Bluejay, we've helped our customers optimize their voice agent's latency time and time again. In an effort to standardize our learnings and benefit the voice AI community, here are 12 tried-and-tested ways to reduce latency in your voice agent pipeline.

This post is a practical, end-to-end guide to reducing latency in a modern voice agent pipeline. The focus is not “model speed” in isolation, but the metric users actually perceive: how long it takes to begin speaking after the user finishes a turn.

The key metric: end of user speech → start of agent audio

Many teams over-index on “LLM latency.” It matters, but it is rarely the whole story.

The user experiences one composite metric:

End of user speech → start of agent audio

I’ll refer to this as TTFA (time-to-first-audio). TTFA is a better north star than single-stage measurements because it captures the entire conversational system: turn detection, transcription finalization, reasoning, tool calls, synthesis, and playback.

Two implications follow:

You should optimize the earliest audible response, not just total completion time.
You should optimize p95/p99 as seriously as p50; voice UX is disproportionately shaped by tail latency.

Where the time goes in one turn

A typical turn looks like this:

Category	Component	Why it matters
Ingress	Audio Capture → Transport	The physical and network delay of getting voice data to the server.
Perception	STT Partials → Endpointing → STT Final	Determining what was said and when the user stopped talking.
Reasoning	LLM Start → TTFT	The "brain" processing time. This is where Time to First Token (TTFT) is measured.
Expansion	Tool Calls (Optional)	External API hits that can significantly increase latency (p99).
Egress	TTS Start → First Audio → Playback	Converting text back to sound and streaming it to the user.

A useful way to reason about this is core latency vs. optional latency:

Core latency: happens on every turn (endpointing, TTFT, TTS first audio, transport)
Optional latency: happens only sometimes (tool calls, retrieval, extra hops)

The goal is to make core latency consistently low, and to ensure optional latency cannot dominate p95/p99.

Where Bluejay fits

Reducing latency is easiest when you can observe and evaluate the full pipeline, not just a single stage.

Bluejay supports that loop in two ways:

Traces: end-to-end visibility into execution flow, with latency breakdowns across the system. Traces are designed to show where time is spent per turn and help correlate regressions to specific stages. Get started with Traces today: https://docs.getbluejay.ai/core-concepts/traces
Simulations + evaluations: run synthetic calls at scale, measure latency consistently, and set regression gates on p95/p99 so you can ship improvements with confidence.

The practical playbook: 12 concrete changes

Below are twelve changes that, in practice, produce the largest and most reliable reductions in TTFA. The exact gains depend on your stack, but these are broadly applicable.

1) Use Thinking Phrases

Thinking phrases are hard-coded utterances that you can force your agent to say while your agent makes a high-latency tool call or perform a high-latency action. Here's an example. Let's say you are building voice agents to man the front desk of healthcare clinics across America. In 20% of your calls, customers ask about the status of their prescription order, and impatiently mention that they've been waiting for the past week for their shipment. In order for your agent to check the status of the customer's prescription order, it will need to query a legacy prescription database — an operation that could take upwards of 8 seconds.

To offset the latency of your agent making the tool call, you can force your agent to say a hard-coded "Thanks for letting me know, let me go ahead and look that up for you…" utterance. This way, you've filled the silence with a phrase that sounds like your agent has immediately responded to the customer, and you have also bought yourself time to compute the tool call.

To take this one step further, you could randomly select from a list of hardcoded thinking phrases to reduce repetition, or even use a small "nano" model to come up with dynamic thinking phrases while your larger model handles the tool calls.

What matters here is reducing the user's percieved latency, which oftentime is easier to reduce than the the actual tool call latency.

2) Context Pre-loading

Wherever you can, pre-load your API calls / context.

If you know it takes time to pull customer data, pull all of it in the background as soon as you recieve an incoming call from their phone number.
If you know each call requires your agent to make five static tool calls, make the calls in the background, and save the results so your agent can retrieve them in-memory at the right time.
As a general principle, avoid real-time network requests whenever possible. Too much can go wrong: retries, delays, request droppage, etc.

In-memory retrieval is orders of magnitude faster than using a network request and querying a slow database for results.

3) Correlate everything by call_id, turn_id, and trace_id

Make correlation identifiers part of the design:

call_id
turn_id
trace_id

With consistent correlation, you can select a single slow turn and inspect the exact stage that caused the delay. This is made easy with Bluejay's Call Traces, which allow you to break down the STT, LLM, TTS, Tool Call, and other components in your voice agent pipeline that contribute to your overall latency. Learn more at https://docs.getbluejay.ai/core-concepts/traces.

4) Improve end-of-turn detection (endpointing)

Endpointing is frequently the largest contributor to "dead air" because it sits directly between user speech and system response.

Practical steps:

Review the silence threshold and tune it based on real conversation data.
Evaluate “smart” endpointing (model-assisted) if it improves correctness at lower thresholds.
Watch for a common failure pattern: p99 spikes that align closely with your silence timeout.

5) Use STT partials

Streaming STT is valuable only if downstream components can take advantage of partial results.

Implementation guidelines:

Stream audio in small chunks; avoid long buffers.
Consume partial hypotheses quickly.
Run lightweight intent/routing logic on partials.
Commit to actions only when stability/confidence crosses a threshold.

This enables earlier parallelism while preserving correctness.

6) Start retrieval (RAG) on stabilized STT partials

If you use retrieval, begin as soon as the partial transcript stabilizes:

Start retrieval early, but keep it inexpensive (tight filters, small top-k).
Cancel/restart retrieval if the user pivots.

The goal is to prepare relevant context without waiting for final transcription.

7) Optimize the LLM for TTFT

In voice, time-to-first-token is often more important than total generation time.

Practical levers:

Reduce prompt length (especially repeated boilerplate).
Cache stable system context and tool schemas.
Keep retrieved context small and relevant.
Reuse sessions and keep connections warm when possible.

8) Reduce tool-call overhead

Tool calls typically introduce variance. Optimize for both speed and predictability:

Batch APIs where possible.
Parallelize independent calls.
Co-locate services in the same region as media and inference.
Set clear timeouts and well-defined fallbacks.

9) Stream LLM output into TTS

Do not wait for a full response before beginning synthesis.

A robust pattern is:

Stream tokens from the LLM.
Segment into phrases/sentences.
Begin TTS on the first segment.
Continue feeding subsequent segments.

This reduces perceived latency substantially by overlapping stages.

10) Optimize for TTS first audio

Treat “first audio” as the TTS performance goal.

Recommendations:

Use streaming TTS.
Choose voices/models with strong first-audio performance.
Cache common short phrases (confirmations, transitions) where appropriate.

11) Use short, honest scaffolds only when needed

When a response will be materially delayed (e.g., long tool calls), a brief acknowledgement can prevent users from repeating themselves.

The constraints:

It should be truthful and context-aware.
It should be brief.
It should not block the real answer from starting as soon as it is available.

12) Engineer for tail latency (p99)

Production experience is shaped by p95 and p99.

Common sources of tail latency:

cold starts
queueing
retries
saturation
provider jitter

Mitigations:

warm pools for hot paths
admission control and backpressure
autoscaling tied to queue depth and latency
regression gates that block releases when TTFA p95/p99 regress

Use simulations that include noise, accents, interruptions, and dependency slowness to uncover tail behavior before it reaches users.

Launch checklist

Thinking Phrases enabled
TTFA (end-of-user → first audio) tracked p50/p95/p99
Endpointing latency tracked p50/p95/p99
TTFT tracked p50/p95/p99
TTS first-audio tracked p50/p95/p99
Tool calls have timeouts + fallbacks
Tool calls are batched / parallelized where safe
Pipeline uses partials (STT → LLM and LLM → TTS)
Traces show per-turn waterfall (call_id + turn_id + trace_id)
Dashboards + alerts on p95/p99 regressions
Synthetic calls in CI (noise, accents, interruptions)
Regression gate: TTFA p95/p99 cannot worsen past threshold

The four highest-leverage improvements

If you focus on only a few changes first, these tend to provide the most value:

Thinking Phrases
Context Pre-loading
Careful pipelining on partials
Instrumentation with p95/p99 as first-class metrics

When you are ready to tackle your voice agent's latency, Bluejay can help uncover exactly what needs to be fixed. Bluejay is designed to support this workflow through Traces and simulation-based latency evaluations:

Faraz Siddiqi
CTO, Bluejay

‹ How Inworld Is Crafting the Future of Real-Time Interaction — with CEO Kylan Gibbs

Building Tomorrow's Infra With LiveKit: A New Era In Computing ›

Stop Vibe Testing. Quality is Engineered.

Join our mailing list

You've been subscribed!

Stop Vibe Testing. Quality is Engineered.

Join our mailing list

You've been subscribed!

Stop Vibe Testing. Quality is Engineered.

Join our mailing list

You've been subscribed!