Press a hotkey anywhere on your system, speak naturally, and Telvr transcribes your voice in real-time using Whisper. The finished text is automatically inserted at your cursor position — no copy-paste, no app switching.

Which languages are supported?

Telvr supports 50+ languages via OpenAI's Whisper large-v3 model. Language detection is automatic — just speak in your preferred language and Telvr handles the rest.

Do I need a subscription?

No. Telvr uses a pay-as-you-go model: EUR 3 per month infrastructure fee plus EUR 0.03 per minute of usage. No lock-in, no auto-renewal. You top up your balance and use it at your own pace.

Does Telvr work offline?

Currently, Telvr requires an internet connection for cloud-based transcription via Groq. A Community Edition with local processing using your own API key is planned for the future.

Which apps are supported?

Telvr works system-wide — it inserts text at the cursor position in any application. Email clients, chat apps, code editors, browsers, word processors — if you can type in it, Telvr works there.

All data is transmitted via TLS encryption. Audio recordings are not permanently stored after transcription. Groq processes your audio under a Data Processing Agreement (DPA). We do not sell or share your data.

← Blog2026-02-23

Whisper vs Deepgram: Which Speech Engine Is Better in 2026?

Two Philosophies of Speech Recognition

OpenAI Whisper and Deepgram represent two distinct approaches to building a speech recognition system. Whisper was designed as a universal, multi-language model trained on a vast corpus of internet audio. Deepgram was built as a commercial API-first product, optimized for speed and developer integration. Both are excellent. Neither is universally better.

Understanding which suits a particular use case requires looking at the architecture, benchmarks, pricing model, and practical implications for different workloads.

Architecture

Whisper

Whisper is an encoder-decoder transformer model trained by OpenAI on 680,000 hours of multilingual audio scraped from the web. The architecture processes audio as log-mel spectrogram features, passes them through a convolutional encoder, and decodes to text using a language model decoder.

The model is available in multiple sizes: tiny, base, small, medium, large-v2, and large-v3. The large-v3 model used by Telvr is the most accurate but also the heaviest — running locally requires a capable GPU or significant CPU time.

A key characteristic: Whisper was trained on diverse, noisy audio from the internet. This gives it remarkable robustness to accents, background noise, and informal speech. The tradeoff is that it is not the fastest model and does not offer the streaming/real-time architecture that some use cases require.

Deepgram

Deepgram built its own end-to-end deep learning architecture optimized for real-time streaming transcription. Their Nova-3 model is trained specifically for spoken English (with strong multilingual support added over time) and is architecturally designed to produce low-latency outputs token-by-token.

Deepgram's model is not publicly available as open-source. It runs only via Deepgram's API or on self-hosted Deepgram enterprise deployments. The training data, while extensive, is more curated than Whisper's internet-scale corpus.

Accuracy Benchmarks

Accuracy comparisons are notoriously context-dependent. Both models perform well; the differences emerge in specific conditions.

Word Error Rate (WER) on standard benchmarks:

Whisper large-v3 and Deepgram Nova-3 are competitive on standard English benchmarks, both achieving WER below 5% on clean audio.
Whisper large-v3 outperforms Nova-3 on heavily accented speech and mixed-language input.
Nova-3 outperforms Whisper on streaming use cases where partial results are needed before the utterance is complete.

Real-world conditions where Whisper excels:

Mixed-language speech (code-switching)
Non-native English with strong accents
Technical vocabulary without training
Background noise from varied sources (streets, cafes)

Real-world conditions where Deepgram excels:

Call center audio with known speaker profiles
Real-time streaming where first-token latency matters
American English in clean or semi-clean environments
Speaker diarization (identifying who said what)

Speed and Latency

Whisper (via Groq API, as used by Telvr): Under 1 second for the transcription step alone. Groq's inference hardware is purpose-built for transformer models, enabling Whisper large-v3 to run far faster than local GPU inference.

Whisper (local, Apple M3): 3-6 seconds for a 30-second audio clip. Smaller models run faster.

Deepgram Nova-3 (streaming): 300-500ms for first word appearance in streaming mode. For batch transcription of a complete audio file, total latency is similar to Whisper via API.

The streaming capability is Deepgram's standout advantage for real-time applications. For push-to-talk workflows (record, stop, get result), the latency difference between Whisper via Groq and Deepgram is minimal in practice.

Language Support

Whisper large-v3: Supports 99 languages. Performance degrades gracefully for lower-resource languages rather than failing completely. Automatic language detection is built in.

Deepgram Nova-3: Strong English support, with additional languages added over time. As of 2026, around 35 languages with varying quality levels. English accuracy is excellent; many other languages are still below Whisper's level.

For multilingual workflows, Whisper is the clear choice. For English-primary applications where speed and streaming matter, Deepgram is competitive.

Pricing

Whisper (OpenAI API): $0.006 per minute. No streaming option.

Whisper (via Groq API): Varies by tier. Fast inference, competitive pricing for developer workloads.

Deepgram Nova-3: Starting at $0.0043 per minute for pay-as-you-go. Volume discounts available. Streaming incurs the same rate.

Telvr's usage cost: EUR 0.03 per minute, which reflects the combined cost of transcription plus AI enrichment processing. Raw Deepgram or Whisper API is cheaper per minute, but those are raw APIs without the application layer.

Developer Experience

Whisper (OpenAI API):

Simple REST endpoint, standard audio file upload
No streaming
Audio file size limits (25MB free, 100MB paid)
Response time suitable for push-to-talk workflows, not real-time captioning

Deepgram:

WebSocket API for real-time streaming
REST API for batch files
More features: speaker diarization, keyword boosting, custom vocabulary
Better developer docs for real-time use cases

Self-hosted Whisper:

Fully open-source, Docker-deployable
No API costs
Requires GPU infrastructure
Maximum flexibility for custom pipelines

Which to Use for Which Use Case

Push-to-talk desktop apps: Whisper large-v3 via a fast inference API. The accuracy and language support make it the better choice, and latency is comparable to Deepgram once you factor in the full pipeline.

Real-time captioning / live transcription: Deepgram streaming API. The sub-500ms first-token latency is necessary for readable live captions.

Call center / phone audio: Deepgram with custom vocabulary and speaker diarization features.

Multilingual applications: Whisper. No alternative matches its 99-language coverage with automatic detection.

Privacy-sensitive, local deployment: Self-hosted Whisper. Deepgram's self-hosted option exists but is enterprise-only.

Cost-sensitive, high-volume English transcription: Deepgram Nova-3 at $0.0043/min edges out OpenAI's $0.006/min.

What Telvr Uses

Telvr uses Whisper large-v3 via Groq's inference API. The choice was deliberate: large-v3 provides the highest accuracy across languages, Groq's hardware brings latency down to under one second for the transcription step, and the automatic language detection means users do not need to configure anything when switching languages.

The enrichment layer that follows — AI post-processing to clean output, format emails, structure notes — is not part of either Whisper or Deepgram. It is a separate LLM step that transforms raw transcription into formatted, usable text.

Conclusion

Whisper and Deepgram are not direct competitors so much as different tools for different jobs. Whisper large-v3 is the accuracy leader for multilingual, noisy, real-world audio. Deepgram Nova-3 is the speed and streaming leader for English-primary, real-time applications.

For a desktop productivity tool where quality matters over real-time streaming, Whisper large-v3 via a fast inference API is the better foundation. For applications where you need words to appear as the user speaks, Deepgram's streaming architecture is purpose-built for that use case.