Two Philosophies of Speech Recognition
OpenAI Whisper and Deepgram represent two distinct approaches to building a speech recognition system. Whisper was designed as a universal, multi-language model trained on a vast corpus of internet audio. Deepgram was built as a commercial API-first product, optimized for speed and developer integration. Both are excellent. Neither is universally better.
Understanding which suits a particular use case requires looking at the architecture, benchmarks, pricing model, and practical implications for different workloads.
Architecture
Whisper
Whisper is an encoder-decoder transformer model trained by OpenAI on 680,000 hours of multilingual audio scraped from the web. The architecture processes audio as log-mel spectrogram features, passes them through a convolutional encoder, and decodes to text using a language model decoder.
The model is available in multiple sizes: tiny, base, small, medium, large-v2, and large-v3. The large-v3 model used by Telvr is the most accurate but also the heaviest — running locally requires a capable GPU or significant CPU time.
A key characteristic: Whisper was trained on diverse, noisy audio from the internet. This gives it remarkable robustness to accents, background noise, and informal speech. The tradeoff is that it is not the fastest model and does not offer the streaming/real-time architecture that some use cases require.
Deepgram
Deepgram built its own end-to-end deep learning architecture optimized for real-time streaming transcription. Their Nova-3 model is trained specifically for spoken English (with strong multilingual support added over time) and is architecturally designed to produce low-latency outputs token-by-token.
Deepgram's model is not publicly available as open-source. It runs only via Deepgram's API or on self-hosted Deepgram enterprise deployments. The training data, while extensive, is more curated than Whisper's internet-scale corpus.
Accuracy Benchmarks
Accuracy comparisons are notoriously context-dependent. Both models perform well; the differences emerge in specific conditions.
Word Error Rate (WER) on standard benchmarks:
- Whisper large-v3 and Deepgram Nova-3 are competitive on standard English benchmarks, both achieving WER below 5% on clean audio.
- Whisper large-v3 outperforms Nova-3 on heavily accented speech and mixed-language input.
- Nova-3 outperforms Whisper on streaming use cases where partial results are needed before the utterance is complete.
Real-world conditions where Whisper excels:
- Mixed-language speech (code-switching)
- Non-native English with strong accents
- Technical vocabulary without training
- Background noise from varied sources (streets, cafes)
Real-world conditions where Deepgram excels:
- Call center audio with known speaker profiles
- Real-time streaming where first-token latency matters
- American English in clean or semi-clean environments
- Speaker diarization (identifying who said what)
Speed and Latency
Whisper (via Groq API, as used by Telvr): Under 1 second for the transcription step alone. Groq's inference hardware is purpose-built for transformer models, enabling Whisper large-v3 to run far faster than local GPU inference.
Whisper (local, Apple M3): 3-6 seconds for a 30-second audio clip. Smaller models run faster.
Deepgram Nova-3 (streaming): 300-500ms for first word appearance in streaming mode. For batch transcription of a complete audio file, total latency is similar to Whisper via API.
The streaming capability is Deepgram's standout advantage for real-time applications. For push-to-talk workflows (record, stop, get result), the latency difference between Whisper via Groq and Deepgram is minimal in practice.
Language Support
Whisper large-v3: Supports 99 languages. Performance degrades gracefully for lower-resource languages rather than failing completely. Automatic language detection is built in.
Deepgram Nova-3: Strong English support, with additional languages added over time. As of 2026, around 35 languages with varying quality levels. English accuracy is excellent; many other languages are still below Whisper's level.
For multilingual workflows, Whisper is the clear choice. For English-primary applications where speed and streaming matter, Deepgram is competitive.
Pricing
Whisper (OpenAI API): $0.006 per minute. No streaming option.
Whisper (via Groq API): Varies by tier. Fast inference, competitive pricing for developer workloads.
Deepgram Nova-3: Starting at $0.0043 per minute for pay-as-you-go. Volume discounts available. Streaming incurs the same rate.
Telvr's usage cost: EUR 0.03 per minute, which reflects the combined cost of transcription plus AI enrichment processing. Raw Deepgram or Whisper API is cheaper per minute, but those are raw APIs without the application layer.
Developer Experience
Whisper (OpenAI API):
- Simple REST endpoint, standard audio file upload
- No streaming
- Audio file size limits (25MB free, 100MB paid)
- Response time suitable for push-to-talk workflows, not real-time captioning
Deepgram:
- WebSocket API for real-time streaming
- REST API for batch files
- More features: speaker diarization, keyword boosting, custom vocabulary
- Better developer docs for real-time use cases
Self-hosted Whisper:
- Fully open-source, Docker-deployable
- No API costs
- Requires GPU infrastructure
- Maximum flexibility for custom pipelines
Which to Use for Which Use Case
Push-to-talk desktop apps: Whisper large-v3 via a fast inference API. The accuracy and language support make it the better choice, and latency is comparable to Deepgram once you factor in the full pipeline.
Real-time captioning / live transcription: Deepgram streaming API. The sub-500ms first-token latency is necessary for readable live captions.
Call center / phone audio: Deepgram with custom vocabulary and speaker diarization features.
Multilingual applications: Whisper. No alternative matches its 99-language coverage with automatic detection.
Privacy-sensitive, local deployment: Self-hosted Whisper. Deepgram's self-hosted option exists but is enterprise-only.
Cost-sensitive, high-volume English transcription: Deepgram Nova-3 at $0.0043/min edges out OpenAI's $0.006/min.
What Telvr Uses
Telvr uses Whisper large-v3 via Groq's inference API. The choice was deliberate: large-v3 provides the highest accuracy across languages, Groq's hardware brings latency down to under one second for the transcription step, and the automatic language detection means users do not need to configure anything when switching languages.
The enrichment layer that follows — AI post-processing to clean output, format emails, structure notes — is not part of either Whisper or Deepgram. It is a separate LLM step that transforms raw transcription into formatted, usable text.
Conclusion
Whisper and Deepgram are not direct competitors so much as different tools for different jobs. Whisper large-v3 is the accuracy leader for multilingual, noisy, real-world audio. Deepgram Nova-3 is the speed and streaming leader for English-primary, real-time applications.
For a desktop productivity tool where quality matters over real-time streaming, Whisper large-v3 via a fast inference API is the better foundation. For applications where you need words to appear as the user speaks, Deepgram's streaming architecture is purpose-built for that use case.