Status: parked (M2.5 — superseded by SPEC-007 v3 prompt-only approach)
Owner: OpenQuackKit/Polish/
Last updated: 2026-05-04
TL;DR — This spec is parked. We tried three rounds of LoRA-distilling a small student model (Gemma 3 1B) from various teachers (Gemma 4 4.6B, Claude Opus). Real-use feedback rated all three rounds at 3/10 because the distilled student inherits whatever bias the synthetic dataset encoded — including “drop information aggressively.” Distillation only makes sense once we have real captured (raw, what-you-actually-wanted) pairs from real use — not synthetic ones.
The current shipping behavior is the off-the-shelf 4.6B Gemma 4 (
gemma4-textonly:Q4_K_M) running with a tight formatting-only prompt. Documented in SPEC-007 §”Runtime prompt v3” andbench/distill_corpus/EXPERIMENTS.md.This document remains as a record of what was tried and why we paused.
The original SPEC-007a tier matrix found gemma4-textonly:Q4_K_M (4.6 B,
3.14 GB resident) as the Standard-tier polish default. The hypothesis was:
distill that 4.6 B teacher into a 1 B student so the polish step fits
8 GB Macs comfortably and runs faster.
| Step | Tool | Output |
|---|---|---|
| Generate raw inputs | bench/distill_corpus/generate_raws.py (and v3 variants) |
Synthetic Whisper-like dictation seeds across 6 languages |
| Run teacher | bench/distill_corpus/teacher_polish.py (or v3_run_teacher.py for Opus) |
(raw, polished) pairs in MLX-LM chat format |
| LoRA fine-tune | mlx_lm.lora against gemma-3-1b-it-bf16 |
Adapter safetensors |
| Fuse | mlx_lm.fuse |
Single fused safetensors |
| Convert to GGUF | convert_hf_to_gguf.py (with tokenizer fix-up) |
Q4_K_M GGUF |
| Import to Ollama | Modelfile with Gemma chat template | openquack-polish:vN |
End-to-end: ~30 minutes per iteration on M4 / 16 GB.
gemma4-textonly:Q4_K_M (the Standard-tier model itself)ctx_002_* deploy cases hallucinated “Let me know
if you have a second.” for “the deploy is done” inputs. 3 of 34
cases.fr_reorg_001 translated to English where v1 kept French).claude -p CLI, parallel batch).
Different prompt for teacher (conservative passthrough-default) than
for runtime — a subtle mismatch we identified later as wrong.hey_just
produced truncated output "let'")Whisper-medium strips fillers (um/uh), strips most stutters, adds basic punctuation, capitalizes sentence starts. The original SPEC-007 prompt told the LLM to “Remove filler words, verbal tics, false starts, and repetitions” — but those weren’t in the input by the time the LLM saw it. The instruction primed the model to find them anyway, and when it couldn’t, it removed content instead.
Direct instruction to drop information. Caused most of the v1/v2/v3 “drops useful info” failures. Removed in the v3 runtime prompt.
The student inherits the teacher’s distribution. If the dataset teaches aggressive concision, the student will be aggressive — and no amount of inference-time prompting fully reverses it. Get the dataset right before distilling, not after.
v3 scored composite 3.35 (better than v1’s 3.18) and felt like 3/10 in real use because the bench corpus didn’t exercise the long-tail patterns where the model damaged content. Bench → ship is risky without a real-use loop.
This was the user’s thesis and the bench confirmed it. The off-the-shelf
4.6 B Gemma 4 (3.14 GB resident, no fine-tuning) scored 18/18 on the
canonical case set with the right prompt, while the v3 distilled 1 B
(1.0 GB resident, 30 min training, careful Opus-generated dataset)
fails on hey_just and damages real dictations.
The implication for OpenQuack’s roadmap: prefer the latest off-the- shelf model + careful prompt over a distilled smaller model on synthetic data. Distillation moves to the queue only when:
The v1/v2/v3 datasets were generated by inserting fillers into clean seeds. But Whisper would have stripped those fillers before they reached the LLM in production. The training distribution didn’t match the production distribution. The student learned a task that doesn’t exist in real use.
We measured Gemma 4 E2B at:
All preserve the model’s quality reasonably well. Distillation to 1 GB was supposed to beat all of these but didn’t, because distillation inherits dataset bias. Quantization is much safer for size reduction.
The unexplored win: actual MLX TurboQuant DWQ (mlx_lm.dwq), which
claims ~4× memory reduction at parity quality. Not yet benched.
| Mode | What runs | Latency added on top of Whisper transcribe |
|---|---|---|
| Off (regex only) | TextPolisher regex pipeline |
~0 ms |
| Standard (current default off → on) | gemma4-textonly:Q4_K_M | ~0.65 s mean, ~1.2 s P95 |
| (Distilled student v3, no longer default) | openquack-polish:v3 | ~0.74 s mean |
Polish only fires when the user has the toggle on. End-to-end perceived latency for a 5-second dictation:
Recording 5.0 s (user-controlled)
Whisper transcribe 0.5–1.5 s
Polish (if on) 0.65 s mean
Paste ~0 s
Total 5.5–7.0 s
The polish step adds noticeable but tolerable latency. The hardware-tier gate (SPEC-007a) restricts this to 16 GB+ Macs by default.
If anyone wants to revive distillation:
bench/distill_corpus/generate_raws.py — v1 raw generator (synthetic, filler-heavy)bench/distill_corpus/generate_informational_raws.py — v2 expansion (informational seeds)bench/distill_corpus/v3_generate_raws.py — v3 generator (more diverse, 14 categories × 6 languages)bench/distill_corpus/teacher_polish.py — local-model teacher runner (uses Ollama)bench/distill_corpus/v3_run_teacher.py — Opus-as-teacher runner (uses claude -p)bench/distill_corpus/bench_student.py — runs a fused mlx model over the polish corpusbench/distill_corpus/v3_dataset/{train,valid,test}.jsonl — the v3 datasetbench/distill_corpus/runtime_cases.jsonl — canonical 18-case test corpus (used to validate any model+prompt combo)bench/distill_corpus/test_runtime_prompt.py — runs the test corpus, prints scorecardbench/distill_corpus/EXPERIMENTS.md — running experiment logIn Ollama (kept around for A/B):
openquack-polish:v1 (1.0 GB)openquack-polish:v2 (1.0 GB)openquack-polish:v3 (1.0 GB) — was default brieflyopenquack-polish:v3a (1.0 GB, same as v3)gemma4-textonly:Q4_K_M (3.1 GB) — current default--mask-prompt in mlx_lm.lora so the student learns response-only
behavior, not prompt memorization).If those conditions are met, revive this spec and re-run.