Telvr 如何工作？

按下快捷键，自然说话，Telvr 实时转录。文本直接插入光标所在位置。

支持哪些语言？

通过 Whisper large-v3 支持 50 多种语言。

不需要。按量计费：每月 EUR 3 + 每分钟 EUR 0.03。

支持离线使用吗？

目前基于云端。社区版本支持本地部署在规划中。

支持哪些应用？

系统范围，适用任何应用。

TLS 加密，无永久存储，与 Groq 签署数据处理协议。

← 博客2026-02-16

AI文本增强：从原始语音到完美文本

Q: 支持哪些应用？

系统范围，适用任何应用。

Why Raw Transcription Is Not Enough

Imagine speaking a thought out loud and having every "um," "uh," "you know," and false start captured verbatim. That is raw speech transcription. The Whisper model — among the most accurate available — faithfully records what you say, including everything you would rather it ignore.

The edited version of that thought, as you would write it in an email or document, looks completely different. Better punctuation. Removed fillers. Appropriate structure. Professional register.

The gap between those two versions is what AI text enrichment bridges.

What Happens Between Your Voice and the Text

A speech-to-text pipeline with AI enrichment has two distinct stages:

Stage 1: Transcription. Your audio is processed by a speech recognition model — in Telvr's case, Whisper large-v3. This converts audio waveforms to text with high accuracy. The output is a raw transcript: what you said, including all the natural imperfections of spoken language.

Stage 2: Enrichment. The raw transcript is passed to a language model with a specific prompt describing what to do with it. The language model transforms the transcript into formatted output — removing fillers, restructuring sentences, applying formatting rules, and adapting the register to the target context.

The enrichment step is not a simple find-and-replace for "um" and "uh." It applies genuine language understanding to produce output that reads as if a thoughtful person wrote it.

The Six Enrichment Modes Explained

Raw Transcription

The simplest mode: minimal post-processing, output close to what Whisper produces. Useful when you need the literal transcript — quoting something, recording exact words, or capturing a specific phrasing you want to preserve.

Input: "the uh the main issue here is that we're seeing, uh, three times normal error rates on the payment endpoint starting from around two pm yesterday"

Output: "The main issue here is that we're seeing three times normal error rates on the payment endpoint starting from around 2 PM yesterday."

Even in Raw mode, basic cleanup (capitalization, numbers formatted consistently) applies.

Clean and Correct

The everyday workhorse mode. Removes disfluencies (um, uh, like, you know), fixes grammar, adds proper punctuation, and produces clean prose that reads as if it were typed carefully. This is the right default for most general-purpose text: Slack messages, casual emails, notes, documentation.

Input: "so basically what I'm trying to say is that the, um, current approach to, like, handling authentication is gonna cause problems when we scale up because we're using, you know, the session table and that's gonna, uh, become a bottleneck"

Output: "The current approach to handling authentication will cause problems at scale. We are using the session table for auth state, which will become a bottleneck."

Notice the enrichment went beyond just removing fillers — it restructured the thought into two clear sentences.

Professional Email

Transforms spoken content into a complete email with subject line, greeting, body, and closing. The tone shifts to professional but natural. Appropriate for business communication where you want to dictate the substance and have the AI handle the format.

Input: "hey can you tell Jake that we need the server migration plan by end of week and also remind him about the security audit that's scheduled for the 15th, ask him if his team is ready"

Output:

Subject: Server Migration Plan and Security Audit Update

Hi Jake,

Could you please send over the server migration plan by end of this week? I also wanted to check in about the security audit scheduled for the 15th — is your team prepared?

Thank you.

Meeting Notes

Structures spoken content into a standard meeting notes format: decisions, action items, context, and next steps. Designed for post-meeting summaries spoken immediately after the meeting while memory is fresh.

Input: "product meeting today, decided to push the mobile launch to Q3, Tom is updating the roadmap, we also need a legal review of the new terms before launch, Sarah owns that, we'll check in next week"

Output:

Decisions

Mobile launch rescheduled to Q3

Action Items

Tom: Update product roadmap — ongoing

Sarah: Initiate legal review of new terms before launch

Next Check-in

Next week

2-3 Sentence Summary

Condenses longer speech into a tight two to three sentence summary. Useful for executive summaries, status updates, and situations where you need to distill a lot of content into a brief statement.

Input: [A 3-minute spoken explanation of a project status, covering background, current blockers, planned solutions, and expected timeline]

Output: "The project is currently blocked on API access from the third-party provider, delaying the integration by approximately two weeks. The team is implementing a mock API layer in the interim to continue development. Full integration is now expected by end of month."

Dev Task

Structures spoken content as a development task with context, requirements, and acceptance criteria. The output matches common formats for GitHub issues, Linear tasks, and Jira tickets.

Input: "we need to add rate limiting to the API, right now there's no protection so someone could hammer the endpoints, should limit to 100 requests per minute per user, return 429 if exceeded, use Redis for the counter"

Output:

Task: Add API Rate Limiting

Context The API currently has no rate limiting, leaving endpoints vulnerable to abuse.

Requirements

Limit each authenticated user to 100 requests per minute

Return HTTP 429 (Too Many Requests) when limit is exceeded

Use Redis for the rate counter implementation

Acceptance Criteria

[ ] Rate limit enforced per user at 100 req/min

[ ] 429 response returned with appropriate headers

[ ] Redis counter resets correctly after one minute

How the Enrichment Is Implemented

The enrichment step uses a large language model with a carefully designed system prompt for each mode. The prompt defines the role ("You are a professional text editor"), the task ("Transform the following raw speech transcription into a professional email"), the rules ("Remove filler words, fix grammar, add subject line and greeting"), and the expected output format.

The raw Whisper transcript is then appended as the user message. The LLM produces the formatted output in a single inference pass.

This architecture is why enrichment adds only about one second to the total latency — a well-prompted LLM inference on an efficient model is fast.

Choosing the Right Mode

The right mode depends on the context you are writing for:

Any general text, Slack, notes: Clean mode
Email in a professional context: Email mode
Post-meeting documentation: Meeting Notes mode
Status updates, TLDRs, abstracts: Summary mode
GitHub issues, Linear, Jira tasks: Dev Task mode
Custom workflow: Custom mode with your own system prompt

Switching modes in Telvr takes one click on the mode selector. For users who have a consistent primary use case, the last selected mode persists between sessions so you do not need to reselect it.

Enrichment vs Simple Cleanup

The distinction between "enrichment" and "cleanup" matters. Simple cleanup tools remove filler words and fix capitalization — a relatively mechanical operation that any text processing script could approximate.

Genuine enrichment applies language understanding. It restructures sentences for clarity, not just correctness. It identifies action items in a stream of speech and formats them with owners and deadlines. It takes "I'm writing to ask about the..." and converts it to "I would like to inquire about..." in Email mode.

The difference is visible in the output: mechanically cleaned text reads like speech with the ums removed. Enriched text reads like something a person wrote.