April 2026

Open Source Speech Models Need a Specialist

Open-source ASR models have gotten remarkably good. NVIDIA's FastConformer and Nemotron, Meta's MMS, OpenAI's Whisper — the accuracy gap with commercial APIs like Deepgram and Google Speech has nearly closed. But there's a catch: getting these models into production is still brutally hard.

The problem isn't the model. It's everything around it. You need streaming support (most models are batch-only out of the box), you need to handle WebSocket connections at scale, you need GPU memory management for concurrent streams, and you need sub-200ms latency for real-time voice agents. None of this ships with the model checkpoint.

We ran benchmarks comparing Nemotron, Whisper, and Deepgram Nova-3 on LibriSpeech test-clean. The WER numbers were within a point of each other. But Deepgram's P95 latency was 3x better out of the box — not because their model is faster, but because they've spent years optimizing the inference stack around it.

That's the gap. Open-source speech models need a dedicated inference layer — one that handles streaming, batching, GPU scheduling, and protocol translation (WebSocket, gRPC, Pipecat, LiveKit). Existing serving frameworks like vLLM and TensorRT-LLM are built for LLMs. Speech has different constraints: fixed-length audio chunks, CTC/RNN-T decoding, real-time factor targets, and audio-specific preprocessing.

We wrote custom Triton kernels (a fused log-softmax + top-K for joint network decoding) and deployed them into our streaming pipeline. The kernel itself was a modest speedup. But it taught us something important: the wins in speech inference come from fusing operations across the entire pipeline, not from optimizing individual model layers.

The companies spending $1M+/year on Deepgram aren't paying for a better model. They're paying for a production-grade inference stack. Open source can match the model. It needs a specialist to match the stack.

That's what we're building at Monosemantic.