openquack

SPEC-007 — LLM transcript polish

Status: draft (M2.5 — promoted as next-priority chunk after v0 launch) Owner: OpenQuackKit/Polish/ (extends the existing TextPolisher) Last updated: 2026-04-27

Goal

After Whisper produces a raw transcript, an optional local LLM step cleans it up — fixes punctuation, removes verbal tics and false starts, organises multi-idea utterances into bullets, normalises stop-word noise, and preserves proper nouns / technical terms exactly. Stronger than the regex-based TextPolisher; off by default; explicitly local.

The pipeline becomes:

audio → Whisper → raw transcript
                  │
                  ├── (optional) LLM polish ◀── this spec
                  │
                  └── Regex polish (TextPolisher)
                  │
                  paste at cursor

Why this is M2.5 priority

User asked for it explicitly on 2026-04-27. The regex polish landed in M2 catches the easy stuff (capitalisation, end-punctuation, fillers) but real dictation often produces multi-clause runs that need restructuring, which a 1–3 B local LLM can do in well under a second on Apple Silicon. This is also the foundational LLM infra that SPEC-006 (agent dispatch) will reuse, so doing polish first proves the local-LLM stack against a narrower problem before agents land on top.

Non-goals

Public surface (sketch)

public protocol TextPolishEngine: AnyObject {
    static var engineName: String { get }
    var requiresNetwork: Bool { get }   // surfaced in the recording overlay
    var modelLabel: String { get }       // for status row

    /// Returns cleaned-up text. Should be idempotent on already-clean input.
    /// On any failure (network, OOM, model not loaded), throw — the caller
    /// falls back to the regex pipeline.
    func polish(_ raw: String, context: PolishContext) async throws -> String
}

public struct PolishContext: Sendable {
    public let language: String?      // engine hint, e.g. "en"
    public let foregroundApp: String? // best-effort, may be nil
    public let timestamp: Date
}

public enum PolishEngineKind: String, CaseIterable, Sendable {
    case off          // skip the LLM step entirely
    case ollama       // local Ollama HTTP
    case mlxLM        // in-process via mlx-swift-lm
}

Engines (implementation order)

1. OllamaPolishEngine — fastest path to a working feature

2. MLXLMPolishEngine — best privacy story

3. WhisperKitLLMPolishEngine (later)

If argmax-oss-swift’s text-models stack matures, mirror its API. Skip for now.

Prompt template (port from v0.1 thinker.py)

You reorganise raw voice transcriptions into clean, structured text.

You MUST:
- Respond in the SAME language as the input.
- Add correct punctuation (。,for Chinese; periods, commas for English; etc.).
- Remove filler words, verbal tics, false starts, and repetitions.
- Remove garbled or nonsensical text (transcription errors / artefacts).
- Organise multiple ideas into bullet points (use • or -).
- Keep it concise — shorter than the input.
- Preserve all technical terms, proper nouns, and names exactly as spoken.
- Output ONLY the reorganised text — no commentary, labels, or markdown fences.

Per-call options:

Pipeline integration

In AppDelegate.stopAndTranscribe:

let raw = try await transcriber.transcribe(...)

let polished: String
if polishEngineKind != .off, let engine = polishEngine {
    do {
        polished = try await engine.polish(raw, context: ctx)
    } catch {
        // Fall back to regex-only polish.
        polished = TextPolisher.polish(raw)
    }
} else {
    polished = TextPolisher.polish(raw)
}

// Paste / clipboard ...

Order matters: LLM polish runs first (it produces better-structured output), then regex polish handles any leftovers (trailing whitespace, stray casing). If LLM is off, regex polish is the only step.

Settings

New tab or section:

Quality gates

Bench-able. A new SPM target openquack-polish-bench drives engine × model × corpus → bench/out/polish/<host-tag>/. WER is not the right metric — polish intentionally rewrites — so we measure across three dimensions plus latency and RAM.

Three quality dimensions (scored separately, not averaged)

  1. Transcription error correction — does the model fix homophone / proper-noun errors common in Whisper output? Examples: “cloud code” → “Claude Code”, “income tax” → “in-context”, “Gemma three” → “Gemma 3”. Scored via must_contain / must_not_contain substring checks against the corpus case.
  2. Rephrase, organise, format — fillers stripped, false starts removed, multi-idea runs split into bullets, sentence-end punctuation, length not bloated. Scored via auto micro-metrics:
    • filler-token removal rate (um|uh|like|you know|... regex pre/post)
    • punctuation completeness (% sentences ending in .!? / 。!?)
    • length ratio (output / input — should be ≤ 1.0 per the prompt)
    • idempotency on already-clean input (polish(clean) ≈ clean; edit distance ≤ 5%).
  3. In-context rewrite — same raw, different app_context slot (Slack / Mail / VS Code / Pages) → distinct, context-appropriate outputs. Detail: see SPEC-008.

LLM-as-judge (final ranking signal)

Auto micro-metrics screen out broken outputs but don’t discriminate between models that all strip fillers correctly. The ranking signal is a judge LLM that scores each candidate output 1–5 against the multi- reference list, conditioned on the dimension being scored.

This is bench-only — judge calls happen on the developer’s machine when producing a report, never in the runtime hot path. The privacy contract covers what ships in the app, not what generates the README table.

Latency / memory pressure

The optimisation target is headroom, not fit. A model that just fits at idle will thrash under real user load (Whisper warm, polish warm, browser + Slack open) and the cost shows up as P95/P99 latency spikes — the user feels stutter even when the mean is fine.

Recommendation hierarchy

Smaller-that-clears-quality > faster > more accurate. If the smallest candidate flunks only one quality dimension (e.g. category-1 proper- noun correction) and is otherwise good, prefer it + an out-of-band fallback (regex / domain-term dictionary) over scaling up the model.

Corpus shape

bench/polish_corpus/cases.jsonl — one case per line:

{
  "id": "en_trans_001",
  "category": "transcription_errors",
  "language": "en",
  "raw": "we should use cloud code to open a PR for this branch",
  "app_context": null,
  "references": [
    "Use Claude Code to open a PR for this branch.",
    "We should use Claude Code to open a PR for this branch."
  ],
  "must_contain": ["Claude Code"],
  "must_not_contain": ["cloud code"],
  "notes": "Whisper hears 'Claude' as 'cloud'"
}

Target ~80 cases at first launch:

Bucket Count Notes
transcription_errors (en) 20 Real Whisper mishearings, proper nouns
rephrase_organize (en) 20 Fillers, false starts, multi-clause runs
in_context paired sets 30 10 raws × 3 contexts = 30 (SPEC-008)
Multilingual (zh/ja/es/fr/de) 10 2 per language across both en buckets

Cases drawn from: actual WhisperKit-medium output on the existing 177-clip corpus (bench/out/M4-16GB/report.csv) where the polish step has work to do, plus hand-crafted cases for failure modes Whisper doesn’t reach.

Open questions

References