Speech LLMs explained: the technology powering voice AI agents

What if your AI could respond as naturally as a human on a phone call—understanding tone, emotion, and when to interrupt?

That's what speech LLMs promise. They're not just text models hooked up to a speaker. Speech LLMs are a completely different approach to voice AI. Instead of stitching together three separate models, they do everything in one.

I want to show you how this technology actually works, why companies are racing to build it, and what it means for you if you're testing or deploying voice AI.

What is a speech LLM?

A speech LLM is a language model that takes audio input and produces audio output, all within a single neural network. No conversion to text in the middle. No waiting for text-to-speech to kick in.

You talk → the model listens → the model thinks → the model talks back.

The traditional way is different. For years, voice AI worked like this: speech-to-text (STT) → language model (LLM) → text-to-speech (TTS). Three models. Three steps. Three chances for things to go wrong.

Speech LLMs eliminate that pipeline. They process audio tokens directly, similar to how regular LLMs process text tokens. The model learns what voices sound like, what emotions sound like, and how to produce responses that sound natural. All at once.

How speech LLMs differ from traditional voice AI

The traditional pipeline feels slower because it is. Moving audio to text, then text back to audio, adds latency at every step.

Here's what the numbers show. A traditional three-step pipeline runs at 500ms to 1000ms end-to-end latency. A speech LLM does it in 200ms to 300ms. That's faster than a human thinks.

But speed isn't the only difference.

When you convert speech to text, you lose things that matter. You lose the tone of someone's voice. You lose the rhythm. You lose whether they sound angry, curious, or bored.

Linguists call these paralinguistic cues. They live in how you say something, not just what you say.

A speech LLM preserves all of that. It hears your emotion and can respond to it. It hears your tone and can match it. It hears you hesitate and can pause, or push forward depending on context.

The trade-off exists, though. With three separate models, you can swap out each component if you want. You can pick the best STT, the best LLM, the best TTS. With a speech LLM, you get one integrated system. You have less control over the individual pieces, and the cost per token might be higher.

Moshi: the open-source speech LLM changing the game

If you want to understand where speech LLM technology is heading, look at Moshi, built by Kyutai research lab.

Moshi is a full-duplex speech-text foundation model. Full-duplex means it can listen and respond at the same time—like a real phone conversation where people interrupt each other and talk over one another. It achieved 160ms theoretical latency and 200ms practical latency. That's fast enough to feel natural.

Here's how it works inside. The audio comes in. Moshi tokenizes that audio—breaks it into discrete chunks the model can understand. Then it feeds those audio tokens through a transformer, the same architecture that powers text LLMs. The transformer produces output tokens, which get decoded back into audio.

Moshi uses what researchers call the "Inner Monologue" method. The model generates a text transcript internally while producing audio. This helps it stay coherent and plan what it's going to say before it says it. You get the speed and emotion of audio processing with the reasoning power of language.

Kyutai released Moshi under a Creative Commons license. You can run it yourself. Implementations exist in PyTorch for research, MLX for Apple devices, and Rust for production deployment. That matters if you're building something and you want to own the code.

GPT-4o and the multimodal AI race

OpenAI's approach with GPT-4o Realtime API is different. It's not open-source. It's native. Built directly into the model.

GPT-4o handles audio natively as part of its multimodal training. You can connect via WebSocket, WebRTC, or SIP protocols. The model streams responses back, handles interruptions mid-sentence, and even supports function calling while you're talking to it.

That last part is important. You can say "Check my calendar for tomorrow" and the model can trigger an API call to your calendar service without asking you to wait. It's executing code while responding to you in real time.

You're paying per token with GPT-4o Realtime. OpenAI has been aggressive with pricing and access. If you need high-volume voice AI at scale, the math matters.

Gemini 2.5 and emotional AI from Google

Google's Gemini 2.5 does something Moshi and GPT-4o also do, but with different emphasis: it responds to the tone of your voice, not just your words.

If you sound frustrated, Gemini picks up on it. If you sound calm, it adjusts. If background noise is happening—a door slamming, someone talking in the next room—Gemini can tell what's foreground speech and what you should ignore.

Gemini supports 8 voices out of the box and can handle translation into 70+ languages in real time. But the steerable TTS is the interesting part. You don't configure voices with sliders and settings anymore. You describe them in natural language.

"Make this voice sound friendlier" or "This should sound more professional." The model takes that instruction and adjusts.

Hume EVI 3: the speech LLM that understands emotion

Hume has built something called EVI 3, a speech-language model with emotional intelligence baked in.

EVI doesn't just detect what you're saying. It measures vocal modulations: the tune of your voice, the rhythm, the timbre.

Think of it like this. If I say "That's great," the words are the same whether I'm excited or sarcastic. My voice is different though. EVI hears that difference.

Hume trained EVI on thousands of real conversations. It developed state-of-the-art end-of-turn detection—it knows when you're done talking and it's time for the model to respond. It knows this from your tone, not just silence. No awkward delays while the model waits to see if you're still talking.

EVI supports 100,000 custom voices. You can build a voice that sounds exactly like you want it to sound.

The model understands when to speak and when not to speak. That matters more than it sounds.

A voice that never stops talking is annoying. A voice that knows when to listen is intelligent.

Full-duplex conversation with NVIDIA PersonaPlex

NVIDIA built PersonaPlex, a full-duplex model that listens and speaks simultaneously. It's not just responsive. It learns behavior.

The model learns when to pause, when to interrupt naturally (not rudely), and when to backchannel—those small sounds and words people use to show they're listening. "Mm-hmm," "I see," "Yeah." Those aren't random. They happen at specific moments in conversation, and PersonaPlex learned when.

This matters because real human conversation isn't turn-based. We overlap. We interrupt each other. We make small acknowledgments while listening. Bots trained on turn-based data feel robotic because they are.

Speech LLMs vs. traditional pipelines: which should you choose?

For enterprise, the answer today is still often "traditional pipeline." Companies have built systems around STT → LLM → TTS. Those systems integrate with their tools. They know how to measure accuracy. They've solved the hard problems of deployment.

Speech LLMs lead in consumer applications and startups. The latency wins matter for user experience. The emotional understanding matters for satisfaction. The cost per transaction might be lower even if the cost per token is higher.

The future is probably hybrid. You'll see systems that use a speech LLM for initial conversation, then swap to a traditional pipeline for tool integration when needed. You'll see speech LLMs handling the conversation layer and deterministic systems handling the action layer.

What matters is testing. You need to measure whether your voice AI actually sounds natural. You need to know your latency. You need to understand whether the emotional tone is working. That's different from testing text-based AI.

Why latency matters more than you think

I mentioned the latency numbers earlier—200ms to 300ms for speech LLMs versus 500ms to 1000ms for traditional pipelines. That's not just a tech spec. That's the difference between conversation feeling natural or feeling like you're talking to a machine.

Human response time in conversation is around 200ms on average. Below 500ms feels normal. Above 1000ms and people start feeling like they're talking to something that's thinking hard.

With speech LLMs hitting 200ms to 300ms, you're in human territory. The other person doesn't notice the latency. The conversation flows.

If you're deploying a voice AI, measure this. Don't just assume your system is fast enough. Test it end-to-end, including network, processing, and audio output. I've seen companies launch with theoretical latency that looked good, then discovered real-world latency was two or three times higher because of infrastructure they didn't account for.

The challenge of testing speech LLMs

Testing a speech LLM is harder than testing a text LLM. You can't just check whether the words are correct.

You need to evaluate audio quality. Does the voice sound natural, or does it have artifacts? Does it have that robot voice quality, or does it sound human? These are subjective, but they matter for user experience.

You need to measure latency properly. End-to-end latency from audio input to audio output, not just model inference time.

You need new metrics for emotion and tone accuracy. A text LLM either got the right answer or it didn't. A speech LLM needs to respond appropriately to tone. That's harder to measure automatically.

You might want to measure end-of-turn detection accuracy. Is the model interrupting users? Is it waiting too long? Is it understanding conversational flow?

This is why observability matters. You need visibility into what the model is hearing, what it's understanding, and what it's producing. You need to see the audio, see the transcription, see the emotion detection, see the latency at each stage.

Speech LLMs and the future of voice AI

Speech LLMs aren't replacing traditional pipelines tomorrow. But they're winning in new applications because they're better at something specific: feeling natural.

They preserve emotion. They reduce latency. They understand paralinguistic cues. They can handle interruption. They can learn conversational behavior.

The technology is still new enough that companies haven't figured out the best way to combine them with traditional tools and services. That's a software problem, not a model problem, and software problems get solved.

FAQ: common questions about speech LLMs

What's the difference between a speech LLM and a traditional voice AI?

Traditional voice AI is three models working in sequence: one converts speech to text, one generates a response, and one converts text to speech. A speech LLM does all three jobs in a single model. It's faster and preserves emotion and tone.

Can speech LLMs understand multiple languages?

Yes. Gemini 2.5 handles 70+ languages with real-time translation. Most modern speech LLMs are multilingual, though some are tuned for specific languages.

Are speech LLMs more expensive than traditional pipelines?

Not necessarily. The cost per token might be higher, but the latency savings and reduced infrastructure complexity can offset that. It depends on your use case and scale.

Can I run a speech LLM myself, or do I have to use an API?

It depends on the model. Moshi is open-source and you can run it yourself. GPT-4o Realtime and Gemini 2.5 are closed and available only as APIs. Some companies are building their own implementations.

How do speech LLMs handle background noise?

Modern speech LLMs like Gemini 2.5 can distinguish foreground speech from background noise. They're trained on real conversations where noise exists. That said, very loud or unexpected noise can still cause problems.

What's full-duplex and why does it matter?

Full-duplex means the model can listen and speak at the same time, like a real phone conversation. Traditional systems are half-duplex—you finish talking, then the model responds. Full-duplex feels more natural.

How Bluejay tests speech LLMs in production

Speech LLMs introduce new failure modes that traditional testing misses. Tone mismatches. Emotional misfires. Latency spikes under real network conditions. Interruptions that confuse the model.

Bluejay's Mimic platform generates realistic caller simulations with 500-plus real-world variables: accents, background noise, emotional delivery, and vocal characteristics. It stress-tests your speech LLM the way your customers will use it.

Skywatch monitors production calls in real time. You see latency, hallucinations, sentiment accuracy, and interruption handling across every live conversation. When something breaks, you know before your customers complain.

Whether you're running a speech LLM, a traditional pipeline, or a hybrid setup, the testing question is the same: does your voice AI sound natural to real people in real conditions?

Start testing your voice AI properly

If you're building or testing voice AI, here's what I want you to do right now.

Measure your actual end-to-end latency in production. Not theoretical latency. Real latency from audio input to audio output, including network and infrastructure.

If you're considering a speech LLM, test it against your traditional pipeline. Put it in front of real users and measure both the latency and the user satisfaction.

Set up observability. You need to see what the model is hearing, what it's understanding, and what it's producing. That visibility will catch problems that latency and transcription accuracy won't catch.

Voice AI is good enough now to ship. But it's not good enough to deploy blindly. Bluejay gives you the testing and monitoring infrastructure to ship with confidence. Test. Measure. Observe. Then deploy something that actually works.

References and further reading:

Speech LLMs process audio to audio in a single model, cutting latency and improving natural conversation. Learn how they work and what sets them apart.