Why Raw Transcription Is Not Enough
Imagine speaking a thought out loud and having every "um," "uh," "you know," and false start captured verbatim. That is raw speech transcription. The Whisper model — among the most accurate available — faithfully records what you say, including everything you would rather it ignore.
The edited version of that thought, as you would write it in an email or document, looks completely different. Better punctuation. Removed fillers. Appropriate structure. Professional register.
The gap between those two versions is what AI text enrichment bridges.
What Happens Between Your Voice and the Text
A speech-to-text pipeline with AI enrichment has two distinct stages:
Stage 1: Transcription. Your audio is processed by a speech recognition model — in Telvr's case, Whisper large-v3. This converts audio waveforms to text with high accuracy. The output is a raw transcript: what you said, including all the natural imperfections of spoken language.
Stage 2: Enrichment. The raw transcript is passed to a language model with a specific prompt describing what to do with it. The language model transforms the transcript into formatted output — removing fillers, restructuring sentences, applying formatting rules, and adapting the register to the target context.
The enrichment step is not a simple find-and-replace for "um" and "uh." It applies genuine language understanding to produce output that reads as if a thoughtful person wrote it.
The Six Enrichment Modes Explained
Raw Transcription
The simplest mode: minimal post-processing, output close to what Whisper produces. Useful when you need the literal transcript — quoting something, recording exact words, or capturing a specific phrasing you want to preserve.
Input: "the uh the main issue here is that we're seeing, uh, three times normal error rates on the payment endpoint starting from around two pm yesterday"
Output: "The main issue here is that we're seeing three times normal error rates on the payment endpoint starting from around 2 PM yesterday."
Even in Raw mode, basic cleanup (capitalization, numbers formatted consistently) applies.
Clean and Correct
The everyday workhorse mode. Removes disfluencies (um, uh, like, you know), fixes grammar, adds proper punctuation, and produces clean prose that reads as if it were typed carefully. This is the right default for most general-purpose text: Slack messages, casual emails, notes, documentation.
Input: "so basically what I'm trying to say is that the, um, current approach to, like, handling authentication is gonna cause problems when we scale up because we're using, you know, the session table and that's gonna, uh, become a bottleneck"
Output: "The current approach to handling authentication will cause problems at scale. We are using the session table for auth state, which will become a bottleneck."
Notice the enrichment went beyond just removing fillers — it restructured the thought into two clear sentences.
Professional Email
Transforms spoken content into a complete email with subject line, greeting, body, and closing. The tone shifts to professional but natural. Appropriate for business communication where you want to dictate the substance and have the AI handle the format.
Input: "hey can you tell Jake that we need the server migration plan by end of week and also remind him about the security audit that's scheduled for the 15th, ask him if his team is ready"
Output:
Subject: Server Migration Plan and Security Audit Update
Hi Jake,
Could you please send over the server migration plan by end of this week? I also wanted to check in about the security audit scheduled for the 15th — is your team prepared?
Thank you.
Meeting Notes
Structures spoken content into a standard meeting notes format: decisions, action items, context, and next steps. Designed for post-meeting summaries spoken immediately after the meeting while memory is fresh.
Input: "product meeting today, decided to push the mobile launch to Q3, Tom is updating the roadmap, we also need a legal review of the new terms before launch, Sarah owns that, we'll check in next week"
Output:
Decisions
- Mobile launch rescheduled to Q3
Action Items
- Tom: Update product roadmap — ongoing
- Sarah: Initiate legal review of new terms before launch
Next Check-in
- Next week
2-3 Sentence Summary
Condenses longer speech into a tight two to three sentence summary. Useful for executive summaries, status updates, and situations where you need to distill a lot of content into a brief statement.
Input: [A 3-minute spoken explanation of a project status, covering background, current blockers, planned solutions, and expected timeline]
Output: "The project is currently blocked on API access from the third-party provider, delaying the integration by approximately two weeks. The team is implementing a mock API layer in the interim to continue development. Full integration is now expected by end of month."
Dev Task
Structures spoken content as a development task with context, requirements, and acceptance criteria. The output matches common formats for GitHub issues, Linear tasks, and Jira tickets.
Input: "we need to add rate limiting to the API, right now there's no protection so someone could hammer the endpoints, should limit to 100 requests per minute per user, return 429 if exceeded, use Redis for the counter"
Output:
Task: Add API Rate Limiting
Context The API currently has no rate limiting, leaving endpoints vulnerable to abuse.
Requirements
- Limit each authenticated user to 100 requests per minute
- Return HTTP 429 (Too Many Requests) when limit is exceeded
- Use Redis for the rate counter implementation
Acceptance Criteria
- [ ] Rate limit enforced per user at 100 req/min
- [ ] 429 response returned with appropriate headers
- [ ] Redis counter resets correctly after one minute
How the Enrichment Is Implemented
The enrichment step uses a large language model with a carefully designed system prompt for each mode. The prompt defines the role ("You are a professional text editor"), the task ("Transform the following raw speech transcription into a professional email"), the rules ("Remove filler words, fix grammar, add subject line and greeting"), and the expected output format.
The raw Whisper transcript is then appended as the user message. The LLM produces the formatted output in a single inference pass.
This architecture is why enrichment adds only about one second to the total latency — a well-prompted LLM inference on an efficient model is fast.
Choosing the Right Mode
The right mode depends on the context you are writing for:
- Any general text, Slack, notes: Clean mode
- Email in a professional context: Email mode
- Post-meeting documentation: Meeting Notes mode
- Status updates, TLDRs, abstracts: Summary mode
- GitHub issues, Linear, Jira tasks: Dev Task mode
- Custom workflow: Custom mode with your own system prompt
Switching modes in Telvr takes one click on the mode selector. For users who have a consistent primary use case, the last selected mode persists between sessions so you do not need to reselect it.
Enrichment vs Simple Cleanup
The distinction between "enrichment" and "cleanup" matters. Simple cleanup tools remove filler words and fix capitalization — a relatively mechanical operation that any text processing script could approximate.
Genuine enrichment applies language understanding. It restructures sentences for clarity, not just correctness. It identifies action items in a stream of speech and formats them with owners and deadlines. It takes "I'm writing to ask about the..." and converts it to "I would like to inquire about..." in Email mode.
The difference is visible in the output: mechanically cleaned text reads like speech with the ums removed. Enriched text reads like something a person wrote.