Status: draft (M2.5 — Wave 2 candidate, target ship 2026-05-18)
Owner: OpenQuackKit/Polish/ — extends SPEC-007, no new harness
Last updated: 2026-05-03
Settle SPEC-007’s open question — “Default model — settle once we benchmark the candidate set on the three-dimension corpus above” — with Gemma 4 added as a first-class candidate, benched at multiple quantisation tiers across the M-series Mac hardware spectrum.
The deliverable is two artefacts:
docs/BENCHMARKS-polish.md — that can ship
alongside a Wave-2 release one day before Google I/O 2026 keynote
(2026-05-19, 10:00 PT).This is an addendum, not a new spec. SPEC-007 already defines:
the harness (openquack-polish-bench), the three quality dimensions,
the LLM-as-judge scoring (Haiku 4.5 primary, Sonnet 4.6 adversarial,
Opus 4.7 tiebreak), the latency / memory-pressure protocol, the corpus
shape. None of that changes. This addendum only:
Three signals align:
The original SPEC-007 candidate list (gemma3:1b, gemma3:4b-it-qat,
qwen2.5:*, llama3.2:*, gemma3n:e2b reference) predates Gemma 4 by
weeks. Refreshing it now is the cheap part; the marketing-window
alignment is the asymmetric upside.
openquack-polish-bench runs unchanged;
this addendum extends its candidate matrix.Four models × three quant tiers × three Mac hardware tiers. Many cells are NA (model + quant > available memory after Whisper); realistic cell count is ~20.
| Model | Native | Q4 (TurboQuant / GGUF) | Q5_K_M (GGUF) | Why included |
|---|---|---|---|---|
| Gemma 4 E2B | ~10 GB | ~3.2 GB | ~4.0 GB | Marketing-aligned hero candidate |
| Gemma 4 E4B | ~16 GB | ~5.0 GB | ~6.2 GB | Quality ceiling within reach |
| Phi-4-mini (3.8 B) | ~7.6 GB | ~2.4 GB | ~3.0 GB | Reasoning-tuned competitor |
| Qwen 3 (3 B) | ~6 GB | ~1.9 GB | ~2.4 GB | Multilingual competitor |
The original SPEC-007 list (gemma3:1b, gemma3:4b-it-qat,
qwen2.5:1.5b-instruct, qwen2.5:3b-instruct, llama3.2:1b,
llama3.2:3b, gemma3n:e2b reference) is kept as historical baseline
in bench/out/polish/historical/ but does not inform the M2.5 default
choice. Newer weights, newer instruction-tuning, newer judge calibration
make a fresh run the honest comparison.
Q3 / Q2 not benched in this round. Polish is structurally a constrained task (filler strip + light rewrite), so Q3 may hold up — but adding it expands the matrix beyond what’s testable in the May 6–17 window. Defer to a follow-up bench post.
| Tier | Spec | Headroom for polish | Default candidate ceiling |
|---|---|---|---|
| Compatibility | M-series 8 GB | ~5 GB free w/ Whisper medium loaded | Q4 ≤ 2.5 GB |
| Standard | M-series 16 GB | ~12 GB free | Q4 ≤ 5 GB |
| Premium | M-series 24 GB+ | ~20 GB free | Q5 ≤ 7 GB; tier opens up E4B Q5 |
Auto-detect at first launch; override available in Settings → Polish.
SPEC-007 §”Recommendation hierarchy” reads *“Smaller-that-clears-quality
faster > more accurate”* with an implicit ≤ 2 GB cap that a v0.1 heuristic produced. Gemma 4 E2B Q4 at ~3.2 GB busts this on the Compatibility tier and fits comfortably on Standard / Premium.
Resolution: the ≤ 2 GB ceiling applies to the Compatibility tier only. Standard and Premium tiers default to whatever the bench identifies as the quality-per-MB Pareto winner within their headroom budget. SPEC-007’s hierarchy ordering still holds within each tier.
If the Compatibility tier has no candidate that clears the quality bar under 2.5 GB — i.e. all candidates require ≥ 3 GB to score acceptably — the tier ships with polish off by default and a Settings toggle to enable a degraded-quality polish at the user’s choice. We do not silently downgrade quality on hardware that can’t host it.
The single-most-important test, copied from SPEC-007 verbatim:
Hold WhisperKit medium + polish model warm, idle 60 s with a synthetic background allocation (~2 GB) to simulate a user’s other apps, then measure polish latency on a fresh utterance.
For Gemma E2B Q4 on a Standard tier (M-series 16 GB) with Whisper medium warm + 2 GB synthetic load + Cursor / Mail open: the test target is < 1.2 s mean polish latency on a 50-word utterance. If P95 exceeds 2.5 s the candidate is disqualified for that tier.
docs/BENCHMARKS-polish.mdMirror the structure of docs/BENCHMARKS.md. Required sections:
This bench is the technical substrate for OpenQuack’s Wave 2 launch. Pacing:
| Date | Action |
|---|---|
| 2026-05-06 | SPEC-007a merged; bench harness expanded with new candidates |
| 2026-05-06 → 09 | Bench runs across host tags; Gemma + Phi + Qwen |
| 2026-05-10 | Quality scores compiled; tier defaults selected |
| 2026-05-11 → 14 | Integration into MLXLMPolishEngine; Settings UI |
| 2026-05-15 → 17 | Dogfood, polish post drafted, viral assets generated |
| 2026-05-18 | Ship v0.2.0-polish with default model per tier |
| 2026-05-19 | Wave 2 launch — bench post + integration tweet, Google I/O day |
The bench post is publishable regardless of which model wins. If Gemma wins, Wave 2 leads with Google amplification mechanics (@osanseviero, @awnihannun, Gemmaverse submission). If Gemma loses, Wave 2 leads with the comparison data itself — “we benched four 2-4 B models across three quant formats on three Mac tiers, here’s the Pareto frontier” — and Gemma ships as a supported alternative rather than the default. Either path produces a durable artefact.
| Risk | Mitigation |
|---|---|
| Gemma loses to Phi/Qwen on polish quality | Ship the winner; bench post still publishes; Gemma stays as a Settings option |
| Compatibility tier (8 GB) has no viable Q4 candidate | Polish ships off-by-default at that tier; Settings toggle exists |
| TurboQuant builds for Gemma 4 unstable on M-series at bench time | Fall back to Q4_K_M GGUF + GGUF runner; document the gap; revisit |
| Apache-2.0 model files redistributed by us trigger bundle-size concerns | Models download on first polish use, not bundled. Same pattern as Whisper. |
| Bench harness can’t load all four models in the May 6–10 window | Drop Llama 4 (already not in candidate set); keep Gemma + Phi + Qwen |
| I/O timing slips (delayed announcement, etc.) | Wave 2 detaches from I/O calendar; ships when ready, loses the timing bonus only |
mlx-vlm or mlx-lm be the runtime? SPEC-007 picked
mlx-swift-lm. Gemma 4 is multimodal so mlx-vlm is the canonical
loader, but for text-only polish mlx-lm may suffice and avoid a
fatter dependency. Settle during integration.bench/polish_corpus/cases.jsonl is currently in-tree. Worth
promoting to a separate openquack/polish-corpus repo for
community contribution? Defer to a v0.3+ decision.