Understanding when to speak

What makes a conversation feel natural? Beyond the words we choose, it's the rhythm—knowing when to speak and when to listen. These split-second judgments happen effortlessly in human dialogue: we sense the difference between a thoughtful pause and a completed thought, between "I'm flying from San Francisco..." trailing off for more, and "I'm flying from San Francisco" as a final answer to a different question. This intuition is so fundamental to communication that we rarely notice it, yet it's one of the hardest problems for voice AI systems to solve.

The Turn Detection Problem

At the heart of every voice assistant is a deceptively simple question: is the person done speaking? Get it wrong, and the consequences cascade. Interrupt too early, and you create frustration—the user must repeat themselves, breaking the conversational flow. Wait too long, and the interaction feels sluggish and unresponsive, undermining the promise of natural dialogue.

Current approaches to turn detection fall into two camps, each with fundamental limitations:

Audio-only systems analyze vocal patterns—pitch, pace, and intonation—to predict turn completion. When you ask "What are your departure and arrival airports?" and the user responds "I'm departing from San Francisco," the audio signal alone cannot distinguish between a pause before continuing ("...and arriving in New York") and a genuinely incomplete answer. The intonation might be identical, but the semantic context makes all the difference.

Text-only systems take the opposite approach, analyzing transcribed words and punctuation patterns. But strip away the audio, and you lose critical information. Consider the question "What is your patient ID number?" The responses "123 456" and "123 456 789" look nearly identical as text, but the speaker's intonation makes their intent unmistakable—one rises at the end, signaling more to come; the other falls with finality. Text-only systems also tend to overfit to punctuation marks, making them brittle across different speech-to-text providers with varying punctuation styles.

A Multimodal Solution

Human conversation is multimodal by nature—we simultaneously process acoustic cues and semantic meaning. A robust turn detection system must do the same. We need models that can hear the rising question in "123 456" while also understanding that "I'm departing from San Francisco" is a contextually incomplete response to a question about two airports.

This is why we built our turn detection model from the ground up to combine both audio and text signals, creating a system that understands not just how you're speaking, but what you're saying and what the context demands. The result is a model that achieves 94.1% accuracy while running fast enough for real-time voice applications—because natural conversation cannot wait.

VoTurn-135M: State-of-the-Art Turn Detection for Voice Agents

Understanding when to speak

The Turn Detection Problem

A Multimodal Solution