openquack

SPEC-016 — Distilled polish model (PARKED)

Status: parked (M2.5 — superseded by SPEC-007 v3 prompt-only approach) Owner: OpenQuackKit/Polish/ Last updated: 2026-05-04

TL;DR — This spec is parked. We tried three rounds of LoRA-distilling a small student model (Gemma 3 1B) from various teachers (Gemma 4 4.6B, Claude Opus). Real-use feedback rated all three rounds at 3/10 because the distilled student inherits whatever bias the synthetic dataset encoded — including “drop information aggressively.” Distillation only makes sense once we have real captured (raw, what-you-actually-wanted) pairs from real use — not synthetic ones.

The current shipping behavior is the off-the-shelf 4.6B Gemma 4 (gemma4-textonly:Q4_K_M) running with a tight formatting-only prompt. Documented in SPEC-007 §”Runtime prompt v3” and bench/distill_corpus/EXPERIMENTS.md.

This document remains as a record of what was tried and why we paused.

Why we tried distillation

The original SPEC-007a tier matrix found gemma4-textonly:Q4_K_M (4.6 B, 3.14 GB resident) as the Standard-tier polish default. The hypothesis was: distill that 4.6 B teacher into a 1 B student so the polish step fits 8 GB Macs comfortably and runs faster.

What we built (the v1/v2/v3 distillation pipeline)

Step Tool Output
Generate raw inputs bench/distill_corpus/generate_raws.py (and v3 variants) Synthetic Whisper-like dictation seeds across 6 languages
Run teacher bench/distill_corpus/teacher_polish.py (or v3_run_teacher.py for Opus) (raw, polished) pairs in MLX-LM chat format
LoRA fine-tune mlx_lm.lora against gemma-3-1b-it-bf16 Adapter safetensors
Fuse mlx_lm.fuse Single fused safetensors
Convert to GGUF convert_hf_to_gguf.py (with tokenizer fix-up) Q4_K_M GGUF
Import to Ollama Modelfile with Gemma chat template openquack-polish:vN

End-to-end: ~30 minutes per iteration on M4 / 16 GB.

Three rounds of training, three rounds of disappointment

v1 (2026-05-03)

v2 (2026-05-03)

v3 (2026-05-04)

What we discovered (the genuinely useful insights)

1. Whisper already does most of what the prompt was asking for

Whisper-medium strips fillers (um/uh), strips most stutters, adds basic punctuation, capitalizes sentence starts. The original SPEC-007 prompt told the LLM to “Remove filler words, verbal tics, false starts, and repetitions” — but those weren’t in the input by the time the LLM saw it. The instruction primed the model to find them anyway, and when it couldn’t, it removed content instead.

2. “Keep it concise — shorter than the input” is the most damaging line of prompt we wrote

Direct instruction to drop information. Caused most of the v1/v2/v3 “drops useful info” failures. Removed in the v3 runtime prompt.

3. Distilling baked-in bad behavior is permanent

The student inherits the teacher’s distribution. If the dataset teaches aggressive concision, the student will be aggressive — and no amount of inference-time prompting fully reverses it. Get the dataset right before distilling, not after.

4. Composite quality scores can lie about real use

v3 scored composite 3.35 (better than v1’s 3.18) and felt like 3/10 in real use because the bench corpus didn’t exercise the long-tail patterns where the model damaged content. Bench → ship is risky without a real-use loop.

5. The latest model usually beats a fine-tuned smaller model

This was the user’s thesis and the bench confirmed it. The off-the-shelf 4.6 B Gemma 4 (3.14 GB resident, no fine-tuning) scored 18/18 on the canonical case set with the right prompt, while the v3 distilled 1 B (1.0 GB resident, 30 min training, careful Opus-generated dataset) fails on hey_just and damages real dictations.

The implication for OpenQuack’s roadmap: prefer the latest off-the- shelf model + careful prompt over a distilled smaller model on synthetic data. Distillation moves to the queue only when:

6. Synthetic raw inputs ≠ what production sees

The v1/v2/v3 datasets were generated by inserting fillers into clean seeds. But Whisper would have stripped those fillers before they reached the LLM in production. The training distribution didn’t match the production distribution. The student learned a task that doesn’t exist in real use.

7. Quantization vs distillation — for size, prefer quantization

We measured Gemma 4 E2B at:

All preserve the model’s quality reasonably well. Distillation to 1 GB was supposed to beat all of these but didn’t, because distillation inherits dataset bias. Quantization is much safer for size reduction.

The unexplored win: actual MLX TurboQuant DWQ (mlx_lm.dwq), which claims ~4× memory reduction at parity quality. Not yet benched.

Latency impact (current shipping behavior)

Mode What runs Latency added on top of Whisper transcribe
Off (regex only) TextPolisher regex pipeline ~0 ms
Standard (current default off → on) gemma4-textonly:Q4_K_M ~0.65 s mean, ~1.2 s P95
(Distilled student v3, no longer default) openquack-polish:v3 ~0.74 s mean

Polish only fires when the user has the toggle on. End-to-end perceived latency for a 5-second dictation:

Recording          5.0 s   (user-controlled)
Whisper transcribe 0.5–1.5 s
Polish (if on)     0.65 s mean
Paste              ~0 s
Total              5.5–7.0 s

The polish step adds noticeable but tolerable latency. The hardware-tier gate (SPEC-007a) restricts this to 16 GB+ Macs by default.

Real artifacts on disk

If anyone wants to revive distillation:

In Ollama (kept around for A/B):

What would unpark this spec

  1. A capture mechanism in the app that logs (raw, what-you-pasted) pairs locally for users who opt in. Real preferences from real use.
  2. At least 100 captured pairs before the next distillation attempt.
  3. Match training prompt to runtime prompt exactly (and use --mask-prompt in mlx_lm.lora so the student learns response-only behavior, not prompt memorization).
  4. A clear product reason to want a smaller model — currently the 3 GB Gemma 4 fits 16 GB Macs comfortably and there’s no urgent pressure to ship at 1 GB.

If those conditions are met, revive this spec and re-run.

References