From PHP Books to Neural Networks: Assembly AI CEO Dylan Fox on the Evolution of Voice AI

His son was screaming "I want pasta" from the backseat. The voice agent thought he was the primary speaker. Then he hung up.

That is the exact failure mode that defines where voice AI still breaks down today.

That's what Dylan Fox, Founder and CEO of AssemblyAI, walked me through.

Today I took the wheel back from Faraz and got Dylan to tell the whole story from the passenger seat for Bluejay's Skywatch podcast.

Before AssemblyAI, he was teaching himself machine learning from textbooks at night.

Doing NLP work at Cisco by day.

And every time he tried to build something real with voice, he hit a wall.

Nuance made him sign an enterprise evaluation agreement just to try their API.

Then mailed him a CD-ROM.

That frustration became a thesis.

Voice infrastructure should feel like Stripe or Twilio. Idea to prototype, instantly. No CDs required.

A few highlights from this conversation I haven't stopped thinking about:

STT is not transcription. It is an intelligent listening layer. One moment of background noise and the whole thing falls apart.
Nobody using voice AI cares about the architecture. Speech-to-speech, cascaded, whatever. They want it to work. Dylan's whole framework starts with the user and works backwards.
As AI literacy grows, sounding robotic might signal trust, not failure. The Waymo voice is already pointing there.

If you are building in voice, real-time infra, or developer tooling, you don't want to miss this one.

Spotify: https://open.spotify.com/episode/7kAoHItpUtzDjUWEjdpKE5?si=1zARui2bRjmx_0jDT7OpLQ

Youtube: