If you're building a voice AI agent today, you have more options than ever. Managed platforms get you to a demo in hours. Open-source models like NVIDIA Nemotron and Qwen3-ASR match proprietary quality. Orchestration frameworks like LiveKit handle the real-time plumbing.
But every team that scales voice AI hits the same wall. The managed platform gets expensive. The proprietary STT doesn't support your language. The latency isn't good enough. And suddenly you're building infrastructure instead of product.
I built through every stage. Here's the journey, with real code and real numbers.
Stage 1: Use a Managed Platform
Sign up for Retell, Vapi, or Bland. Configure an agent. Get a phone number. Done in hours, zero code. STT (usually Deepgram), LLM, TTS, phone provisioning — all bundled.
This is great for validating product-market fit. But when voice becomes core to your product: cost scales linearly with volume, you can't swap the STT model, you can't tune endpointing, and you can't self-host for compliance.
Stage 2: Use an Orchestrator + Swap Providers
Switch to LiveKit or Pipecat. Keep using managed STT/LLM/TTS APIs, but choose which ones. Mostly configuration, minimal code:
session = AgentSession(
stt=deepgram.STT(), # or assemblyai, groq, etc.
vad=silero.VAD.load(),
llm=openai.LLM(model="gpt-4o"), # or groq, google, etc.
tts=elevenlabs.TTS(), # or cartesia, groq, etc.
)Most voice AI startups live here. It works well up to moderate scale. But you're still paying per-minute for STT, still locked into proprietary models, and can't customize inference.
Stage 3: Bring Your Own Inference
This is where it gets interesting. Run open-source speech models on your own GPUs, plug them into LiveKit as a custom STT provider. 10-20x cost reduction at scale, full model choice, total control.
The problem: going from “I have a model” to “I have a production streaming STT endpoint” is a cliff.
3.1 — Deploy a Model
I deployed NVIDIA's nemotron-speech-streaming-en-0.6b on Baseten — an inference platform with GPU hosting and native WebSocket support. Two files:
# config.yaml
model_name: nemotron-streaming-asr
resources:
accelerator: L4
use_gpu: true
runtime:
transport:
kind: websocket# model.py — load model, receive audio over WebSocket, transcribe
class Model:
def load(self):
self._model = nemo_asr.models.ASRModel.from_pretrained(
"nvidia/nemotron-speech-streaming-en-0.6b"
)
async def websocket(self, websocket):
audio_buffer = bytearray()
while True:
message = await websocket.receive()
if message["text"] == "END":
pcm = np.frombuffer(bytes(audio_buffer), dtype=np.int16)
.astype(np.float32) / 32768.0
result = self._model.transcribe([pcm], batch_size=1)
await websocket.send_json({"text": result[0].text})
break
audio_buffer.extend(message["bytes"])Deploy: truss push. Then wire it into LiveKit with a custom STT plugin (~200 lines) that forwards audio frames from WebRTC to our Baseten WebSocket.
Stage 1+2 Results: Batch Inference in a Voice Agent
Same voice agent, same LiveKit pipeline, two different STT backends:
| Metric | Groq STT (Whisper) | Baseten STT (Nemotron batch) |
|---|---|---|
| Time to first transcript | TODO | TODO |
| Avg inference latency | TODO | TODO |
| Transcript accuracy | TODO | TODO |
| Partial results | No | No |
| Conversation feel | TODO | TODO |
Both are batch inference — LiveKit's VAD collects the full utterance, sends it to STT, waits for the result. The latency is similar. The accuracy difference comes from our Baseten model getting isolated chunks vs Groq's Whisper getting full utterances. Neither provides partial results while the user speaks.
The model supports streaming — it's literally called “nemotron-speech-streaming.” But model.transcribe()doesn't use any streaming capabilities. It's like buying a sports car and driving it in first gear.
3.2 — Cache-Aware Streaming Pipeline
The Nemotron encoder is a 17-layer FastConformer with cache-aware attention. It's designed to process 80ms audio chunks incrementally, carrying forward activation caches between chunks.
To use this, I needed ~2800 lines of pipeline code:
# ~150 lines of config dataclasses
@dataclass
class StreamingConfig:
sample_rate: int = 16000
att_context_size: list[int] = field(default_factory=lambda: [70, 1])
num_slots: int = 256 # max concurrent streams
# Build the pipeline, pre-cast to bfloat16
pipeline = StreamingPipelineBuilder.build_pipeline(config)
pipeline.asr_model.asr_model.to(torch.bfloat16)
# Per-chunk inference: 80ms audio → encoder with cache
stream_id = pipeline.open_streaming_session(options)
pipeline.append_streaming_audio(stream_id, audio_chunk) # 80ms
ready = pipeline.get_ready_frames_and_features({stream_id})
pipeline.encode_batch(frames, features, rpad) # carries cache forward3.3 — Voice Activity Detection
Silero VAD integrated into the inference loop. Runs batched across all concurrent streams — 10 streams × 2 windows = 1 GPU call, not 10 sequential calls. ~200 lines.
class BatchVADProcessor:
def drain_and_write(self, feed_audio_fn):
# Drain staging queues from all client recv_loops
# Stack windows across streams → batch tensor
# GPU: probs = silero_model(batch_tensor, sr)
# Per-stream: speech → feed to ASR, silence → flush encoder3.4 — Beam Decoder with Partial Results
RNN-T beam search tracking multiple hypotheses per stream. Emits partial transcripts as prefixes stabilize. ~2200 lines across two implementations (streaming and tensor-batched).
class StreamingBeamDecoder:
def step_batch(self, encoder_output, stream_ids):
# Expand beam hypotheses via joint network
# Prune to top-K per stream
# Emit stable prefix as partial transcript
# "I want" → "I want to schedule" → "I want to schedule a meeting"3.5 — Concurrent Batching & Production Server
The full production WebSocket server: one inference loop serving all concurrent streams, batched encoder calls, per-stream result routing. Auth, metrics, health monitoring. ~1100 lines.
async def _inference_loop(self):
while True:
# 1. VAD: batched across all streams
self.vad.drain_and_write(self.model.feed_audio)
# 2. Batch encode — one GPU call for ALL streams
pipeline.encode_batch(all_frames, all_features, rpad)
# 3. Beam decode per stream
beam_decoder.step_batch(encoder_output, stream_ids)
# 4. Route results to correct WebSocket connectionsWent from 1 stream per GPU to 70-80 concurrent streams on a single L4.
3.6 — GPU Optimizations
Custom Triton kernel fusing log-softmax + top-K in the joint network. Vectorized feature buffer updates. ~300 lines.
# Before: 6 separate CUDA kernels, read/write full vocab tensor each time log_probs = torch.log_softmax(joint_output, dim=-1) topk_vals, topk_ids = torch.topk(log_probs, k) # After: 1 fused Triton kernel, single read, single write topk_vals, topk_ids = fused_joint_topk(joint_output, k)
Stage 3 Results: After Building the Streaming Runtime
| Metric | Batch (Stage 1) | Streaming (Stage 3) |
|---|---|---|
| Chunk size | Full utterance | 80ms |
| Per-chunk latency | TODO | ~15ms |
| Partial results | No | Yes (streaming) |
| Concurrent streams (1x L4) | 1 | 70-80 |
| Lines of code | ~115 | ~7000 |
The Total Cost
| Stage | Lines | Time | What You Get |
|---|---|---|---|
| Deploy model | ~115 | Hours | Batch inference |
| LiveKit integration | ~260 | 1-2 days | Voice agent with batch STT |
| Streaming pipeline | ~2800 | 2-3 days | 80ms chunk processing |
| VAD | ~200 | 1-2 days | Speech detection |
| Beam decoder | ~2200 | 2-3 days | Partial results, accuracy |
| Server + batching | ~1100 | 3-5 days | 70+ concurrent streams |
| Optimization | ~300 | 3-5 days | Custom Triton kernels |
| Total | ~7000 | 4-6 weeks | Production streaming ASR |
And that's for one model (FastConformer + RNN-T). Adding Whisper (attention decoder), Qwen3-ASR (LLM-based), or future architectures means redoing the decoder integration.
Where This Is Going
Every voice AI team that reaches Stage 3 builds the same thing. The streaming pipeline, the VAD, the beam decoder, the batching — it's the same engineering, repeated independently, for every model.
The LLM ecosystem converged: one architecture, standardized serving (vLLM), config-only deployment. Baseten deploys an LLM with zero code. Speech hasn't converged — multiple architectures, no standard runtime, no config-only path.
The gap between “deploy a model” and “production streaming inference” is 7000 lines and 4-6 weeks. Every team. Every model. Speech doesn't have its vLLM yet.
That's what we're building at Monosemantic.