Mar 26, 2026

Voice AI Agent Architecture Patterns: How to Design Agents That Scale

Building a voice AI agent feels simple: record speech, transcribe it, run a language model, speak the response.

In reality, that's only step one. Once you move past your first prototype, you face architectural decisions that ripple across your entire system. Pipeline vs. end-to-end.

Real-time streaming vs. turn-based. Stateless vs. stateful. Get them wrong, and you'll hit latency walls, concurrency limits, or state management nightmares.

This guide breaks down the architectural patterns that work for voice agents at scale—from design to testing to deployment.

The Two Paradigms: Pipeline vs. End-to-End

Your first decision: do you build a cascading pipeline or go all-in on a speech LLM?

The Cascading Pipeline: STT → LLM → TTS

Most production voice agents today use a cascading pipeline. Here's the flow:

Speech-to-Text (STT) transcribes audio into text.
Language Model (LLM) reads the text and generates a response.
Text-to-Speech (TTS) converts the response back to audio.

Each component is modular. You can swap Deepgram for Azure Speech Services without touching your LLM logic. You can upgrade your TTS engine independently.

This flexibility is why 90% of production agents use pipelines today.

The downside: each handoff introduces latency. Deepgram STT achieves 150ms, ElevenLabs TTS at 75ms, but most agents hit 800ms–2 seconds due to stack latency compounding. Your network round trips add another 150–450ms if services live in different regions.

The End-to-End Approach: Speech-to-Speech Models

A newer paradigm skips the text bottleneck entirely. Models like Moshi and GPT-4o Realtime take speech in and produce speech out, with no intermediate text layer.

Moshi is a 7B-parameter foundation model trained to handle full-duplex conversations. It processes input and output audio streams jointly, allowing interruptions and overlapping speech—something a pipeline can't do natively. Theoretical latency: 160ms.

In practice: 200ms.

GPT-4o's voice mode is cloud-based, leveraging larger models for higher quality but requiring always-on connectivity.

The trade-off: End-to-end models are early. They excel at natural, fast exchanges but lack the debuggability and component flexibility of pipelines. If you need to swap a tool-calling layer or add custom logic, a speech LLM makes that harder.

In practice: Many teams run hybrid: use end-to-end for simple, low-latency exchanges and fall back to the pipeline for complex reasoning or external tool calls.

Orchestration Layer Design

Your orchestration layer manages conversation flow. It routes audio to STT, decides when to call the LLM, handles tool invocations, and streams TTS back to the caller.

Streaming Over Turn-Based

The biggest latency killer is waiting for final transcripts. Traditional ASR waits until the user stops talking, then returns a complete transcript. By then, 500–800ms has passed.

Modern streaming architectures process audio in chunks. STT emits partial transcripts as the user speaks. The LLM starts generating a response before the user finishes.

TTS begins synthesizing audio before the LLM completes, so the user hears speech while the model is still reasoning.

This requires full-duplex connections. WebRTC is the standard. It cuts buffering latency from 100–200ms (WebSocket) to 20–50ms.

Managing Conversation State

Your orchestration layer holds three critical pieces of state:

Transcript buffer: Raw and processed user input.
Turn state: The current turn's context—what tool was called, what parameters were provided.
Session state: The entire conversation history, user preferences, and decision made so far.

As conversations grow, your context window fills up. The LLM can only see the last 8,000 or 128,000 tokens (depending on the model). You must decide what to keep and what to summarize.

Tool Calling and Function Invocation

Your LLM won't solve everything. It needs to check your database, call an API, send an email.

Pipeline architectures handle this cleanly: the LLM outputs a structured tool request, your orchestration layer parses it, calls the tool, and feeds the result back to the LLM as the next turn's context.

End-to-end speech LLMs can handle tool calling, but debugging is harder because there's no text representation of the tool request.

State Management for Multi-Turn Conversations

A voice call with three turns feels natural. By turn ten, most voice agents lose the thread.

The Context Window Problem

Your LLM has a fixed context window. GPT-4 Turbo: 128K tokens. Claude 3.5 Sonnet: 200K tokens.

A typical minute of transcribed speech is \~150 tokens. A 30-minute call eats 4,500 tokens, plus your system prompt, plus tool results.

You can't just append every turn to the context. You'll hit the window limit, and appending old context degrades reasoning.

Hierarchical Summarization

Smart agents use summarization layers:

Recent turns stay verbatim in the context window.
Mid-conversation history gets summarized: "User discussed pricing, asked about setup time, expressed concern about integration complexity."
Older turns are archived and retrieved only if relevant.

This preserves key decisions while keeping context size manageable.

Session Persistence

Your session state lives somewhere:

In-memory: Fast, but dies on server restart.
Redis or similar: Fast, persistent across one deployment.
Durable database: Survives infrastructure changes but slower on each read.

For voice calls, latency matters more than durability during the call. Store session state in Redis while the call is active. After the call ends, write to durable storage.

Memory Types

Two memory mechanisms matter:

Conversation memory: The entire history of what was discussed—decisions made, preferences expressed, questions answered.
Turn memory: Just this exchange—what the user said, what you responded with, what tool was called.

Turn memory is cheap. Conversation memory requires careful management as calls grow.

Real-Time Audio Processing Architecture

WebRTC, WebSocket, full-duplex—these aren't just buzzwords. They're the foundation of sub-500ms voice agents.

WebRTC Advantages Over WebSocket

WebSocket is a text protocol that can carry audio. It works. But it has buffering overhead.

WebRTC is a real-time protocol. It was built for video calls. It minimizes buffering, handles packet loss gracefully, and provides bitrate adaptation for changing network conditions.

Latency comparison:

WebSocket: 100–200ms additional buffering.
WebRTC: 20–50ms additional buffering. That 150ms difference compounds when you sum all components.

Full-Duplex Audio Flow

A full-duplex connection lets you send and receive audio simultaneously. The user can hear the agent speak while they're speaking—just like a human phone call.

In a half-duplex system, the agent must wait for the user to finish before responding. This feels broken.

Full-duplex requires:

A real-time connection (WebRTC or similar).
Voice Activity Detection (VAD) that doesn't block: VAD must run on incoming audio without delaying output.
Non-blocking TTS: Don't wait for the entire response to synthesize before playing.

Latency Budget Breakdown

Your end-to-end latency is the sum of many small delays:

Mic capture: 10–50ms (hardware dependent).
Audio transmission: 20–50ms (WebRTC) to 100–200ms (WebSocket).
STT processing: 150–300ms (varies by provider and chunk size).
Network round trip to LLM: 50–150ms per API call.
LLM inference: 200–800ms (depends on model size and query complexity).
TTS synthesis: 50–200ms (parallelizable with LLM inference).
Audio playback buffering: 50–100ms.

The bottleneck is usually LLM inference. It's 60–70% of total latency. Parallel execution of STT and TTS helps, but you can't parallelize LLM reasoning with its input.

Network Optimization

Geographic proximity matters more for voice than for most applications.

Azure Speech Services running in the same region as your user cuts 50–100ms. If your LLM API and TTS live in different regions, add another 50ms per round trip.

Deploying edge nodes in major population centers (US East, EU West, Asia Pacific) reduces latency by routing calls to the nearest region.

Scaling Patterns: From 10 to 10,000 Concurrent Calls

You built a voice agent. It works great for 10 concurrent calls. Now you need 10,000.

Horizontal Scaling and Stateless Design

The key to scale: make your services stateless.

Each STT, LLM, and TTS instance should be identical. A load balancer distributes incoming calls across them. If one instance dies, traffic shifts to another.

If you need more capacity, spin up new instances.

This only works if your service doesn't hold state. If your STT instance remembers the last user it processed, rebalancing breaks it.

The corollary: state lives elsewhere. Session data goes to Redis or a database. Call context lives in a message queue.

The compute layer stays ephemeral.

Kubernetes and Auto-Scaling

Modern deployments use Kubernetes. Each service (STT, LLM, TTS) is a Deployment with auto-scaling rules.

Define: "Scale up if CPU > 70%. Scale down if CPU < 20%." Kubernetes watches these metrics and adjusts replica counts. You can go from 2 instances to 100 in minutes.

But auto-scaling has lag. If you get a sudden spike of 10,000 new calls, Kubernetes might not react fast enough. Plan for baseline capacity that handles your typical peak, with auto-scaling for spikes.

Load Balancing Strategies

Simple round-robin load balancing works if all instances are identical and calls are short. For voice, consider:

Least connections: Route to the instance with the fewest active calls. This works well for voice because calls are long (minutes) and you want even distribution.
Sticky sessions: Once a call lands on an instance, keep it there. This helps if the instance maintains some local state (not recommended for true horizontal scale, but common in practice).

Regional Deployment and Global Load Balancing

A single data center limits your reach. If all your infrastructure is in us-east-1 and your users are in Europe, you'll add 100–150ms of latency just in network RTT.

Deploy to multiple regions:

US East (primary for North America).
EU West (Europe).
Asia Pacific (Singapore, Tokyo, Sydney).

Global load balancers route calls to the nearest region. This cuts user latency dramatically.

Database and Queue Scaling

Your database becomes a bottleneck before your compute does.

Each call might write to the database 5–20 times (session updates, transcripts, tool results, call summaries). At 10,000 concurrent calls, that's 50,000–200,000 writes per second.

Most traditional databases can't handle that. Use:

Redis for session state: Sub-millisecond reads/writes.
Message queues (RabbitMQ, Kafka) for decoupled processing.
Time-series databases (ClickHouse, TimescaleDB) for analytics and call metadata.

GPU Clusters for LLM Inference

If you're running your own LLM inference (not using an API), you need GPUs.

Scaling from 100 to 10,000 concurrent calls typically requires 10–20 GPU nodes (NVIDIA A100 or H100). Each GPU can serve 10–100 concurrent LLM requests depending on model size and latency requirements.

Use vLLM or similar inference engines for batching and efficient memory management.

Testing Implications of Each Architecture

Your architecture determines how you test.

Pipeline Architecture Testing

Test each component independently:

STT testing: Record audio clips (clean, noisy, different accents). Measure word error rate (WER). Run regression tests when you switch models.
LLM testing: Unit tests for prompt logic. Regression tests for quality (does the model still understand intent?). Latency tests.
TTS testing: Test different voices and genders. Measure synthesis latency. Check SSML handling.

Then test the pipeline end-to-end: record full calls, measure P95 latency, check for errors that only appear in the full flow.

End-to-End Architecture Testing

Testing a speech LLM is holistic but harder to debug. You can't test STT independently because there's no intermediate text.

Focus on:

Full-call quality: Does the agent understand user intent? Are interruptions handled?
Latency distribution: P50, P95, P99 latencies.
Failure modes: What happens when the user doesn't speak for 10 seconds? When they interrupt in the middle of a response?

Without component-level visibility, root-causing latency issues is harder.

Latency Testing in Both

Regardless of architecture, latency testing matters.

Synthetic tests: Inject known audio clips, measure time to response.
Real-world tests: Run live calls in production and log latency metrics.
Distributed tracing: Use OpenTelemetry to tag each call with a trace ID. Log timestamps at every stage (STT start, LLM call, TTS return, playback). Build waterfall diagrams showing where time is spent.

Your goal: P95 latency ≤ 700–800ms after user stops speaking. Latency beyond 1.5 seconds feels unnatural. Beyond 3.5 seconds, users bail.

Load and Concurrency Testing

Before launching at scale, run load tests:

Baseline: 10 concurrent calls. Measure latency and error rate.
Ramp up: 100, 500, 1,000, 5,000, 10,000 concurrent calls. Watch for degradation.
Stress test: 2x your planned capacity. See where it breaks.
Soak test: Run at 50% capacity for 24 hours. Catch memory leaks and connection pool exhaustion.

Architecture Decision Checklist

When designing your voice agent, ask:

1. Accuracy vs. Speed?

Pipeline gives you accuracy through modularity.
End-to-end gives you speed.

2. Simplicity vs. Flexibility?

Pipeline is more flexible (swap components easily).
End-to-end is simpler (fewer moving parts).

3. Your Latency Target?

Under 300ms? You probably need end-to-end or a heavily optimized pipeline.
300–700ms? A well-tuned pipeline with streaming and parallel execution works.
Over 700ms? You have room for a straightforward cascading pipeline.

4. How Complex Is Your Logic?

Simple Q\&A? End-to-end works great.
Multi-step reasoning, tool calling, or complex state? Pipeline is cleaner.

5. How Many Concurrent Calls?

Under 100? Monolithic service is fine.
100–1,000? Microservices with basic load balancing.
1,000–10,000+? Kubernetes, regional deployment, GPU clusters.

6. How Long Are Your Calls?

Under 5 minutes? Simple state management in memory.
5–30 minutes? Use Redis for session state.
30+ minutes: Summarization and hierarchical memory required.

FAQ

Q: Is WebRTC required?

A: Not strictly, but it's the best choice for sub-500ms latency. WebSocket works if you're willing to accept 100–200ms of additional buffering.

Q: Can I start with a simple pipeline and migrate to end-to-end later?

A: Yes. The interfaces between components are standard (audio format, LLM API responses). You can swap implementations.

Migrating the entire orchestration layer is harder, so design for flexibility. Q: How do I know if my agent is CPU-bound or I/O-bound?

A: Monitor your infrastructure. If CPU is pinned at 100% but network utilization is low, you're CPU-bound (upgrade your model or add more instances). If network latency is high but CPU is 20%, you're I/O-bound (reduce API calls or deploy closer to your services).

Q: What's the minimum latency I can expect?

A: About 200–300ms for a well-tuned pipeline with streaming and full-duplex audio. This is the speed of light for digital communication plus processing overhead. You can't do better without hardware changes.

Q: Do I need my own GPU cluster?

A: Only if you're running your own LLM inference and have 1,000+ concurrent calls. For most teams, using an API (OpenAI, Anthropic, Groq) is simpler and cheaper. Q: How do I test latency before launching?

A: Use synthetic tests (recorded audio clips) to measure component latency. Run live calls in a staging environment. Use distributed tracing to pinpoint bottlenecks.

Measure P50, P95, and P99 latencies.

Architecture-Specific Testing with Bluejay

Testing voice agents is hard because they touch so many layers: audio quality, speech recognition, language reasoning, speech synthesis, and real-time streaming.

Whether you choose a pipeline or end-to-end architecture, you need to test:

Pre-deployment: Validate that your agent handles edge cases and latency targets before it hits production.
Post-deployment: Monitor live calls to catch quality degradation, latency creep, and failure modes.

Bluejay lets you simulate your entire voice agent architecture in pre-deployment testing. With 500+ variables (different STT accuracy rates, network latencies, LLM response times), you can stress-test your design before it touches a real user. Compress a month of production scenarios into minutes.

For monitoring, Bluejay watches production calls and alerts when latency, accuracy, or concurrency spikes.

The cost of architectural mistakes—cascading failures, latency wall, state corruption—is high. Bluejay catches them in testing, not production.

Prev: How to Test Voice AI Agents for Accent and Language Diversity

Next: Voice AI Agent Architecture Patterns: How to Design Agents That Scale