openquack

SPEC-016 — Distilled polish model (PARKED)

Status: parked (M2.5 — superseded by SPEC-007 v3 prompt-only approach) Owner: OpenQuackKit/Polish/ Last updated: 2026-05-04

TL;DR — This spec is parked. We tried three rounds of LoRA-distilling a small student model (Gemma 3 1B) from various teachers (Gemma 4 4.6B, Claude Opus). Real-use feedback rated all three rounds at 3/10 because the distilled student inherits whatever bias the synthetic dataset encoded — including “drop information aggressively.” Distillation only makes sense once we have real captured (raw, what-you-actually-wanted) pairs from real use — not synthetic ones.

The current shipping behavior is the off-the-shelf 4.6B Gemma 4 (gemma4-textonly:Q4_K_M) running with a tight formatting-only prompt. Documented in SPEC-007 §”Runtime prompt v3” and bench/distill_corpus/EXPERIMENTS.md.

This document remains as a record of what was tried and why we paused.

Why we tried distillation

The original SPEC-007a tier matrix found gemma4-textonly:Q4_K_M (4.6 B, 3.14 GB resident) as the Standard-tier polish default. The hypothesis was: distill that 4.6 B teacher into a 1 B student so the polish step fits 8 GB Macs comfortably and runs faster.

What we built (the v1/v2/v3 distillation pipeline)

Step	Tool	Output
Generate raw inputs	`bench/distill_corpus/generate_raws.py` (and v3 variants)	Synthetic Whisper-like dictation seeds across 6 languages
Run teacher	`bench/distill_corpus/teacher_polish.py` (or `v3_run_teacher.py` for Opus)	(raw, polished) pairs in MLX-LM chat format
LoRA fine-tune	`mlx_lm.lora` against gemma-3-1b-it-bf16	Adapter safetensors
Fuse	`mlx_lm.fuse`	Single fused safetensors
Convert to GGUF	`convert_hf_to_gguf.py` (with tokenizer fix-up)	Q4_K_M GGUF
Import to Ollama	Modelfile with Gemma chat template	`openquack-polish:vN`

End-to-end: ~30 minutes per iteration on M4 / 16 GB.

Three rounds of training, three rounds of disappointment

v1 (2026-05-03)

Teacher: gemma4-textonly:Q4_K_M (the Standard-tier model itself)
Dataset: 299 train + 35 valid + 17 test, synthetic — generated by inserting fillers into clean seeds, then running the teacher
Hyperparameters: LoRA, 16 layers, lr 1e-4, batch 2, 300 iters, ~15 min training
Result: Composite 3.18 (Claude-as-judge, 34-case bench), beat every non-Gemma-4 candidate (gemma3:1b base 2.67), approached the teacher (3.55) at ~28% of the parameters.
Failure mode: ctx_002_* deploy cases hallucinated “Let me know if you have a second.” for “the deploy is done” inputs. 3 of 34 cases.

v2 (2026-05-03)

Teacher: same
Dataset: 390 train (added 91 informational examples designed to fix the v1 ctx_002_* failure)
Hyperparameters: same as v1, 350 iters
Result: Composite 3.35 (+0.17). D4 in-context jumped 2.75 → 3.88 (the ctx_002_* cases are fixed).
Failure mode: introduced a French-translation regression (fr_reorg_001 translated to English where v1 kept French).

v3 (2026-05-04)

Teacher: claude-opus-4-7 (via claude -p CLI, parallel batch). Different prompt for teacher (conservative passthrough-default) than for runtime — a subtle mismatch we identified later as wrong.
Dataset: 351 curated pairs across 14 categories × 6 languages
Hyperparameters: same
Result by composite: v3 (iter-200) ≈ on-par with v2; some specific wins (multilingual preserved), some specific losses (hey_just produced truncated output "let'")
Result by user feedback: 3/10 in real use — the model dropped useful information across many real dictations. Composite score didn’t predict real-use quality.

What we discovered (the genuinely useful insights)

1. Whisper already does most of what the prompt was asking for

Whisper-medium strips fillers (um/uh), strips most stutters, adds basic punctuation, capitalizes sentence starts. The original SPEC-007 prompt told the LLM to “Remove filler words, verbal tics, false starts, and repetitions” — but those weren’t in the input by the time the LLM saw it. The instruction primed the model to find them anyway, and when it couldn’t, it removed content instead.

2. “Keep it concise — shorter than the input” is the most damaging line of prompt we wrote

Direct instruction to drop information. Caused most of the v1/v2/v3 “drops useful info” failures. Removed in the v3 runtime prompt.

3. Distilling baked-in bad behavior is permanent

The student inherits the teacher’s distribution. If the dataset teaches aggressive concision, the student will be aggressive — and no amount of inference-time prompting fully reverses it. Get the dataset right before distilling, not after.

4. Composite quality scores can lie about real use

v3 scored composite 3.35 (better than v1’s 3.18) and felt like 3/10 in real use because the bench corpus didn’t exercise the long-tail patterns where the model damaged content. Bench → ship is risky without a real-use loop.

5. The latest model usually beats a fine-tuned smaller model

This was the user’s thesis and the bench confirmed it. The off-the-shelf 4.6 B Gemma 4 (3.14 GB resident, no fine-tuning) scored 18/18 on the canonical case set with the right prompt, while the v3 distilled 1 B (1.0 GB resident, 30 min training, careful Opus-generated dataset) fails on hey_just and damages real dictations.

The implication for OpenQuack’s roadmap: prefer the latest off-the- shelf model + careful prompt over a distilled smaller model on synthetic data. Distillation moves to the queue only when:

We have ≥100 real captured pairs from production use, AND
The latest off-the-shelf option doesn’t fit the hardware tier

6. Synthetic raw inputs ≠ what production sees

The v1/v2/v3 datasets were generated by inserting fillers into clean seeds. But Whisper would have stripped those fillers before they reached the LLM in production. The training distribution didn’t match the production distribution. The student learned a task that doesn’t exist in real use.

7. Quantization vs distillation — for size, prefer quantization

We measured Gemma 4 E2B at:

Q4_K_M (Ollama default): 3.14 GB resident
Text-only Q4_K_M (unsloth GGUF stripped of multimodal): 3.14 GB
UD-Q3_K_XL: 2.91 GB
UD-IQ2_M: 2.31 GB
MLX 4-bit: ~3.6 GB

All preserve the model’s quality reasonably well. Distillation to 1 GB was supposed to beat all of these but didn’t, because distillation inherits dataset bias. Quantization is much safer for size reduction.

The unexplored win: actual MLX TurboQuant DWQ (mlx_lm.dwq), which claims ~4× memory reduction at parity quality. Not yet benched.

Latency impact (current shipping behavior)

Mode	What runs	Latency added on top of Whisper transcribe
Off (regex only)	`TextPolisher` regex pipeline	~0 ms
Standard (current default off → on)	gemma4-textonly:Q4_K_M	~0.65 s mean, ~1.2 s P95
(Distilled student v3, no longer default)	openquack-polish:v3	~0.74 s mean

Polish only fires when the user has the toggle on. End-to-end perceived latency for a 5-second dictation:

Recording          5.0 s   (user-controlled)
Whisper transcribe 0.5–1.5 s
Polish (if on)     0.65 s mean
Paste              ~0 s
Total              5.5–7.0 s

The polish step adds noticeable but tolerable latency. The hardware-tier gate (SPEC-007a) restricts this to 16 GB+ Macs by default.

Real artifacts on disk

If anyone wants to revive distillation:

bench/distill_corpus/generate_raws.py — v1 raw generator (synthetic, filler-heavy)
bench/distill_corpus/generate_informational_raws.py — v2 expansion (informational seeds)
bench/distill_corpus/v3_generate_raws.py — v3 generator (more diverse, 14 categories × 6 languages)
bench/distill_corpus/teacher_polish.py — local-model teacher runner (uses Ollama)
bench/distill_corpus/v3_run_teacher.py — Opus-as-teacher runner (uses claude -p)
bench/distill_corpus/bench_student.py — runs a fused mlx model over the polish corpus
bench/distill_corpus/v3_dataset/{train,valid,test}.jsonl — the v3 dataset
bench/distill_corpus/runtime_cases.jsonl — canonical 18-case test corpus (used to validate any model+prompt combo)
bench/distill_corpus/test_runtime_prompt.py — runs the test corpus, prints scorecard
bench/distill_corpus/EXPERIMENTS.md — running experiment log

In Ollama (kept around for A/B):

openquack-polish:v1 (1.0 GB)
openquack-polish:v2 (1.0 GB)
openquack-polish:v3 (1.0 GB) — was default briefly
openquack-polish:v3a (1.0 GB, same as v3)
gemma4-textonly:Q4_K_M (3.1 GB) — current default

What would unpark this spec

A capture mechanism in the app that logs (raw, what-you-pasted) pairs locally for users who opt in. Real preferences from real use.
At least 100 captured pairs before the next distillation attempt.
Match training prompt to runtime prompt exactly (and use --mask-prompt in mlx_lm.lora so the student learns response-only behavior, not prompt memorization).
A clear product reason to want a smaller model — currently the 3 GB Gemma 4 fits 16 GB Macs comfortably and there’s no urgent pressure to ship at 1 GB.

If those conditions are met, revive this spec and re-run.

References

SPEC-007 — parent polish spec
SPEC-007a — model + quantization tier matrix
SPEC-007b — runtime UX design
bench/distill_corpus/EXPERIMENTS.md — running experiment log
bench/distill_corpus/runtime_cases.jsonl — canonical test corpus
bench/out/polish/M4-16GB-distilled-q4/ — v1 raw bench data (gitignored)
bench/out/polish/M4-16GB-distilled-q4-v2/ — v2 raw bench data (gitignored)

This site is open source. Improve this page.