← Back

April 2026

Building a Voice AI Agent From Scratch

What it actually takes to go from an open-source speech model to a production voice agent — stage by stage, with real code and real numbers.

If you're building a voice AI agent today, you have more options than ever. Managed platforms get you to a demo in hours. Open-source models like NVIDIA Nemotron and Qwen3-ASR match proprietary quality. Orchestration frameworks like LiveKit handle the real-time plumbing.

But every team that scales voice AI hits the same wall. The managed platform gets expensive. The proprietary STT doesn't support your language. The latency isn't good enough. And suddenly you're building infrastructure instead of product.

I built through every stage. Here's the journey, with real code and real numbers.

Stage 1: Use a Managed Platform

Sign up for Retell, Vapi, or Bland. Configure an agent. Get a phone number. Done in hours, zero code. STT (usually Deepgram), LLM, TTS, phone provisioning — all bundled.

This is great for validating product-market fit. But when voice becomes core to your product: cost scales linearly with volume, you can't swap the STT model, you can't tune endpointing, and you can't self-host for compliance.

Stage 2: Use an Orchestrator + Swap Providers

Switch to LiveKit or Pipecat. Keep using managed STT/LLM/TTS APIs, but choose which ones. Mostly configuration, minimal code:

session = AgentSession(
    stt=deepgram.STT(),              # or assemblyai, groq, etc.
    vad=silero.VAD.load(),
    llm=openai.LLM(model="gpt-4o"),  # or groq, google, etc.
    tts=elevenlabs.TTS(),             # or cartesia, groq, etc.
)

Most voice AI startups live here. It works well up to moderate scale. But you're still paying per-minute for STT, still locked into proprietary models, and can't customize inference.

Stage 3: Bring Your Own Inference

This is where it gets interesting. Run open-source speech models on your own GPUs, plug them into LiveKit as a custom STT provider. 10-20x cost reduction at scale, full model choice, total control.

The problem: going from “I have a model” to “I have a production streaming STT endpoint” is a cliff.

3.1 — Deploy a Model

I deployed NVIDIA's nemotron-speech-streaming-en-0.6b on Baseten — an inference platform with GPU hosting and native WebSocket support. Two files:

# config.yaml
model_name: nemotron-streaming-asr
resources:
  accelerator: L4
  use_gpu: true
runtime:
  transport:
    kind: websocket
# model.py — load model, receive audio over WebSocket, transcribe
class Model:
    def load(self):
        self._model = nemo_asr.models.ASRModel.from_pretrained(
            "nvidia/nemotron-speech-streaming-en-0.6b"
        )

    async def websocket(self, websocket):
        audio_buffer = bytearray()
        while True:
            message = await websocket.receive()
            if message["text"] == "END":
                pcm = np.frombuffer(bytes(audio_buffer), dtype=np.int16)
                      .astype(np.float32) / 32768.0
                result = self._model.transcribe([pcm], batch_size=1)
                await websocket.send_json({"text": result[0].text})
                break
            audio_buffer.extend(message["bytes"])

Deploy: truss push. Then wire it into LiveKit with a custom STT plugin (~200 lines) that forwards audio frames from WebRTC to our Baseten WebSocket.

Stage 1+2 Results: Batch Inference in a Voice Agent

Same voice agent, same LiveKit pipeline, two different STT backends:

MetricGroq STT (Whisper)Baseten STT (Nemotron batch)
Time to first transcriptTODOTODO
Avg inference latencyTODOTODO
Transcript accuracyTODOTODO
Partial resultsNoNo
Conversation feelTODOTODO

Both are batch inference — LiveKit's VAD collects the full utterance, sends it to STT, waits for the result. The latency is similar. The accuracy difference comes from our Baseten model getting isolated chunks vs Groq's Whisper getting full utterances. Neither provides partial results while the user speaks.

The model supports streaming — it's literally called “nemotron-speech-streaming.” But model.transcribe()doesn't use any streaming capabilities. It's like buying a sports car and driving it in first gear.

3.2 — Cache-Aware Streaming Pipeline

The Nemotron encoder is a 17-layer FastConformer with cache-aware attention. It's designed to process 80ms audio chunks incrementally, carrying forward activation caches between chunks.

To use this, I needed ~2800 lines of pipeline code:

# ~150 lines of config dataclasses
@dataclass
class StreamingConfig:
    sample_rate: int = 16000
    att_context_size: list[int] = field(default_factory=lambda: [70, 1])
    num_slots: int = 256  # max concurrent streams

# Build the pipeline, pre-cast to bfloat16
pipeline = StreamingPipelineBuilder.build_pipeline(config)
pipeline.asr_model.asr_model.to(torch.bfloat16)

# Per-chunk inference: 80ms audio → encoder with cache
stream_id = pipeline.open_streaming_session(options)
pipeline.append_streaming_audio(stream_id, audio_chunk)  # 80ms
ready = pipeline.get_ready_frames_and_features({stream_id})
pipeline.encode_batch(frames, features, rpad)  # carries cache forward

3.3 — Voice Activity Detection

Silero VAD integrated into the inference loop. Runs batched across all concurrent streams — 10 streams × 2 windows = 1 GPU call, not 10 sequential calls. ~200 lines.

class BatchVADProcessor:
    def drain_and_write(self, feed_audio_fn):
        # Drain staging queues from all client recv_loops
        # Stack windows across streams → batch tensor
        # GPU: probs = silero_model(batch_tensor, sr)
        # Per-stream: speech → feed to ASR, silence → flush encoder

3.4 — Beam Decoder with Partial Results

RNN-T beam search tracking multiple hypotheses per stream. Emits partial transcripts as prefixes stabilize. ~2200 lines across two implementations (streaming and tensor-batched).

class StreamingBeamDecoder:
    def step_batch(self, encoder_output, stream_ids):
        # Expand beam hypotheses via joint network
        # Prune to top-K per stream
        # Emit stable prefix as partial transcript
        # "I want" → "I want to schedule" → "I want to schedule a meeting"

3.5 — Concurrent Batching & Production Server

The full production WebSocket server: one inference loop serving all concurrent streams, batched encoder calls, per-stream result routing. Auth, metrics, health monitoring. ~1100 lines.

async def _inference_loop(self):
    while True:
        # 1. VAD: batched across all streams
        self.vad.drain_and_write(self.model.feed_audio)
        # 2. Batch encode — one GPU call for ALL streams
        pipeline.encode_batch(all_frames, all_features, rpad)
        # 3. Beam decode per stream
        beam_decoder.step_batch(encoder_output, stream_ids)
        # 4. Route results to correct WebSocket connections

Went from 1 stream per GPU to 70-80 concurrent streams on a single L4.

3.6 — GPU Optimizations

Custom Triton kernel fusing log-softmax + top-K in the joint network. Vectorized feature buffer updates. ~300 lines.

# Before: 6 separate CUDA kernels, read/write full vocab tensor each time
log_probs = torch.log_softmax(joint_output, dim=-1)
topk_vals, topk_ids = torch.topk(log_probs, k)

# After: 1 fused Triton kernel, single read, single write
topk_vals, topk_ids = fused_joint_topk(joint_output, k)

Stage 3 Results: After Building the Streaming Runtime

MetricBatch (Stage 1)Streaming (Stage 3)
Chunk sizeFull utterance80ms
Per-chunk latencyTODO~15ms
Partial resultsNoYes (streaming)
Concurrent streams (1x L4)170-80
Lines of code~115~7000

The Total Cost

StageLinesTimeWhat You Get
Deploy model~115HoursBatch inference
LiveKit integration~2601-2 daysVoice agent with batch STT
Streaming pipeline~28002-3 days80ms chunk processing
VAD~2001-2 daysSpeech detection
Beam decoder~22002-3 daysPartial results, accuracy
Server + batching~11003-5 days70+ concurrent streams
Optimization~3003-5 daysCustom Triton kernels
Total~70004-6 weeksProduction streaming ASR

And that's for one model (FastConformer + RNN-T). Adding Whisper (attention decoder), Qwen3-ASR (LLM-based), or future architectures means redoing the decoder integration.

Where This Is Going

Every voice AI team that reaches Stage 3 builds the same thing. The streaming pipeline, the VAD, the beam decoder, the batching — it's the same engineering, repeated independently, for every model.

The LLM ecosystem converged: one architecture, standardized serving (vLLM), config-only deployment. Baseten deploys an LLM with zero code. Speech hasn't converged — multiple architectures, no standard runtime, no config-only path.

The gap between “deploy a model” and “production streaming inference” is 7000 lines and 4-6 weeks. Every team. Every model. Speech doesn't have its vLLM yet.

That's what we're building at Monosemantic.

Code: github.com/monosemantic/voice-ai-demo