Status: draft (M2.5 — promoted as next-priority chunk after v0 launch)
Owner: OpenQuackKit/Polish/ (extends the existing TextPolisher)
Last updated: 2026-04-27
After Whisper produces a raw transcript, an optional local LLM step
cleans it up — fixes punctuation, removes verbal tics and false starts,
organises multi-idea utterances into bullets, normalises stop-word noise,
and preserves proper nouns / technical terms exactly. Stronger than the
regex-based TextPolisher; off by default; explicitly local.
The pipeline becomes:
audio → Whisper → raw transcript
│
├── (optional) LLM polish ◀── this spec
│
└── Regex polish (TextPolisher)
│
paste at cursor
User asked for it explicitly on 2026-04-27. The regex polish landed in M2 catches the easy stuff (capitalisation, end-punctuation, fillers) but real dictation often produces multi-clause runs that need restructuring, which a 1–3 B local LLM can do in well under a second on Apple Silicon. This is also the foundational LLM infra that SPEC-006 (agent dispatch) will reuse, so doing polish first proves the local-LLM stack against a narrower problem before agents land on top.
public protocol TextPolishEngine: AnyObject {
static var engineName: String { get }
var requiresNetwork: Bool { get } // surfaced in the recording overlay
var modelLabel: String { get } // for status row
/// Returns cleaned-up text. Should be idempotent on already-clean input.
/// On any failure (network, OOM, model not loaded), throw — the caller
/// falls back to the regex pipeline.
func polish(_ raw: String, context: PolishContext) async throws -> String
}
public struct PolishContext: Sendable {
public let language: String? // engine hint, e.g. "en"
public let foregroundApp: String? // best-effort, may be nil
public let timestamp: Date
}
public enum PolishEngineKind: String, CaseIterable, Sendable {
case off // skip the LLM step entirely
case ollama // local Ollama HTTP
case mlxLM // in-process via mlx-swift-lm
}
OllamaPolishEngine — fastest path to a working featurehttp://localhost:11434/api/chat (configurable URL).gemma3:1b or qwen2.5:3b-instruct — small, fast,
multilingual, low RAM. Final default settled by a SPEC-007 bench run
per docs/BENCHMARKS.md.keep_alive: -1 so the model stays in GPU memory across calls; we pay
the cold-start once.think: false for thinking-capable models (Gemma 3, etc.) — voice
cleanup doesn’t need chain-of-thought, and thinking budget eats output
tokens (we hit this in v0.1).requiresNetwork = false — loopback isn’t network for the privacy
indicator.MLXLMPolishEngine — best privacy storymlx-swift-lm.
No subprocess, no Ollama install required.Qwen2.5-1.5B-Instruct-4bit (~1 GB) or similar.~/Library/Application Support/OpenQuack/MLX/ cache pattern
we already use for Whisper.WhisperKitLLMPolishEngine (later)If argmax-oss-swift’s text-models stack matures, mirror its API. Skip for now.
thinker.py)You reorganise raw voice transcriptions into clean, structured text.
You MUST:
- Respond in the SAME language as the input.
- Add correct punctuation (。,for Chinese; periods, commas for English; etc.).
- Remove filler words, verbal tics, false starts, and repetitions.
- Remove garbled or nonsensical text (transcription errors / artefacts).
- Organise multiple ideas into bullet points (use • or -).
- Keep it concise — shorter than the input.
- Preserve all technical terms, proper nouns, and names exactly as spoken.
- Output ONLY the reorganised text — no commentary, labels, or markdown fences.
Per-call options:
temperature: 0.3 for short input (≤ 50 words), 0.5 for longer.num_predict: min(max(wordCount * 2, 80), 1024).think: false (Ollama) — prevents thinking-mode models eating budget.In AppDelegate.stopAndTranscribe:
let raw = try await transcriber.transcribe(...)
let polished: String
if polishEngineKind != .off, let engine = polishEngine {
do {
polished = try await engine.polish(raw, context: ctx)
} catch {
// Fall back to regex-only polish.
polished = TextPolisher.polish(raw)
}
} else {
polished = TextPolisher.polish(raw)
}
// Paste / clipboard ...
Order matters: LLM polish runs first (it produces better-structured output), then regex polish handles any leftovers (trailing whitespace, stray casing). If LLM is off, regex polish is the only step.
New tab or section:
ollama list models)| “When polish fails” → fall back to regex only (default) | show error in popover |
requiresNetwork is shown here; the mlxLM and
ollama engines both report false (loopback / in-process), so the
Privacy pane stays green.Bench-able. A new SPM target openquack-polish-bench drives engine ×
model × corpus → bench/out/polish/<host-tag>/. WER is not the right
metric — polish intentionally rewrites — so we measure across three
dimensions plus latency and RAM.
must_contain /
must_not_contain substring checks against the corpus case.um|uh|like|you know|... regex pre/post).!? / 。!?)raw, different app_context slot
(Slack / Mail / VS Code / Pages) → distinct, context-appropriate
outputs. Detail: see SPEC-008.Auto micro-metrics screen out broken outputs but don’t discriminate between models that all strip fillers correctly. The ranking signal is a judge LLM that scores each candidate output 1–5 against the multi- reference list, conditioned on the dimension being scored.
claude-haiku-4-5-20251001 (cheap, fast).claude-sonnet-4-6.claude-opus-4-7.This is bench-only — judge calls happen on the developer’s machine when producing a report, never in the runtime hot path. The privacy contract covers what ships in the app, not what generates the README table.
The optimisation target is headroom, not fit. A model that just fits at idle will thrash under real user load (Whisper warm, polish warm, browser + Slack open) and the cost shows up as P95/P99 latency spikes — the user feels stutter even when the mean is fine.
vm_stat deltas: pages-compressed, pageins, pageouts.host_statistics64(HOST_VM_INFO64).memorystatus API).
RSS alone undercounts CoreML / Metal / ANE working sets — it’s kept
for continuity but is not the primary metric.Smaller-that-clears-quality > faster > more accurate. If the smallest candidate flunks only one quality dimension (e.g. category-1 proper- noun correction) and is otherwise good, prefer it + an out-of-band fallback (regex / domain-term dictionary) over scaling up the model.
bench/polish_corpus/cases.jsonl — one case per line:
{
"id": "en_trans_001",
"category": "transcription_errors",
"language": "en",
"raw": "we should use cloud code to open a PR for this branch",
"app_context": null,
"references": [
"Use Claude Code to open a PR for this branch.",
"We should use Claude Code to open a PR for this branch."
],
"must_contain": ["Claude Code"],
"must_not_contain": ["cloud code"],
"notes": "Whisper hears 'Claude' as 'cloud'"
}
Target ~80 cases at first launch:
| Bucket | Count | Notes |
|---|---|---|
transcription_errors (en) |
20 | Real Whisper mishearings, proper nouns |
rephrase_organize (en) |
20 | Fillers, false starts, multi-clause runs |
in_context paired sets |
30 | 10 raws × 3 contexts = 30 (SPEC-008) |
| Multilingual (zh/ja/es/fr/de) | 10 | 2 per language across both en buckets |
Cases drawn from: actual WhisperKit-medium output on the existing 177-clip
corpus (bench/out/M4-16GB/report.csv) where the polish step has work
to do, plus hand-crafted cases for failure modes Whisper doesn’t reach.
gemma3:1b, gemma3:4b-it-qat,
qwen2.5:1.5b-instruct, qwen2.5:3b-instruct, llama3.2:1b,
llama3.2:3b. The v0.1 incumbent gemma3n:e2b (~5.6 GB; aliased
gemma4:e2b in v0.1’s config) is included as a reference data point
even though it busts the ≤ 2 GB cap.MLXLMPolishEngine on mlx-swift-lm cannot ship Gemma 4
today, for three reasons:
mlx-swift / mlx-swift-lm don’t register the gemma4
architecture — loading fails with “Model type gemma4 not
supported” (mlx-swift#389,
still open; Python mlx_lm/mlx-vlm got it on launch day).ionoxffionoxff…). Only
“PLE-safe” quants work and they keep PLE + encoders in bf16 →
~7.6 GB at 4-bit (multimodal), not the 3.14 GB text-only GGUF.llama.cpp, not
MLX. SPEC-007a’s “TurboQuant ~4× less memory” claim does not hold for
Gemma 4 (PLE breaks it). For this model GGUF is currently smaller and
working while MLX is bigger / broken / Swift-unsupported, so the
cleanest in-process path — no Ollama daemon dependency, the same
privacy story MLX promised — is to embed llama.cpp via a Swift
binding (e.g. mattt/llama.swift
or LocalLLMClient, which
wraps both llama.cpp and MLX behind one API). Runtime roadmap:
OllamaPolishEngine (HTTP to a local
Ollama daemon); validates the prompt + pipeline with zero new deps.llama.cpp engine implementing
the same TextPolishEngine protocol; no Ollama install required.mlx-swift#389 landing and a
small PLE-safe text-only Gemma 4 build appearing.
All three implement TextPolishEngine, so swapping engines leaves
PolishPipeline, settings, and app wiring untouched.thinker.py — concepts only; the Swift port is fresh code.think: False lesson and CJK word-counting heuristic from v0.1
must be ported, they’re real bug fixes, not stylistic choices.