Update (2026-05-31): Finding 1 has a resolution, and it’s a good lesson in not trusting your own conclusion. The “WhisperKit hallucinates on non-English audio” result was a config bug on my end —
DecodingOptions.detectLanguagedefaults tofalse, so my engine never ran detection and the decoder translated non-English speech to English. One line (detectLanguage = true) took WhisperKitmediumfrom 253% to 16.7% WER, matching Lightning. The bench did its job — it made the bug impossible to ignore — but my read of it (“auto-detect is fundamentally fragile”) was wrong. Original text left intact below; SPEC-021 / #63 has the details.
I needed to pick a default Whisper size for a Mac dictation app I built, and I didn’t trust the published numbers. Most Whisper benchmarks publish one WER on one corpus on one machine. The choice of model size, the choice of engine implementation, and the kinds of audio someone actually feeds a dictation app — none of that gets crossed in a single matrix.
So I built one. Five Whisper sizes (tiny, base, small, medium, distil-large-v3), two engines (WhisperKit and Lightning-Whisper-MLX), and 177 clips spanning real human speech, multi-accent English TTS, six non-English languages, and noise-augmented variants at three SNRs. Single host, M4 / 16 GB / macOS 15. The numbers and the raw CSVs are all in the repo.
This post is what I learned. There were three findings I didn’t expect.
The corpus is 177 WAV files, 16 kHz mono PCM, partitioned into five buckets:
dev-clean (CC BY 4.0, openslr.org/12). The closest thing in the set to what dictation feels like in practice.Engines: WhisperKit (argmaxinc/argmax-oss-swift 0.18+) and Lightning (lightning-whisper-mlx 0.0.10 driven from a long-running Python subprocess). Models: the five sizes above. Metrics: WER, CER, RTF (wall / audio), cold start (engine init to first transcribe), and peak RSS sampled at 100 ms via mach_task_basic_info.
What this bench doesn’t cover: real conversational speech with overlap, very long clips under sustained system load, anything but one host. Single-host data has an obvious ceiling. The bench script smoke-passes in CI; if you have a non-M4 Mac, PRs to bench/out/ are the most useful contribution this project can take right now.
WhisperKit medium, on M4 / 16 GB, on real human speech:
| Bucket | WER | CER | RTF | Peak RSS |
|---|---|---|---|---|
| LibriSpeech (real human) | 2.6 % | 1.3 % | 0.22x | 197 MB |
| Voices (multi-accent TTS) | 1.3 % | 0.3 % | 0.31x | 197 MB |
| Noisy (white + pink, 5-20 dB SNR) | 6.3 % | 2.7 % | 0.31x | 197 MB |
That’s the row I’d quote. RTF 0.22x means transcription runs roughly five times faster than the audio plays, on a baseline laptop chip. Cold start is 27 seconds the first time, then never again — model caches to disk after the download.
The full per-engine, per-model, per-bucket matrix is in docs/BENCHMARKS.md. What follows is what I’d missed before I ran it.
This is the one that changed the product.
The multilingual bucket has 12 short clips (averaging ~3 seconds) in six languages, with auto-detect on. Lightning’s medium got 16.7% WER across the set: degraded but usable. WhisperKit’s medium got 253.2% WER. WhisperKit’s base got 284.3%. Same Whisper weights, very different output.
A WER of 253% means the transcript is almost three times longer than the reference. The model isn’t failing to recognise; it’s hallucinating. Spot-checking the per-clip CSVs, the most common failure mode is: a 2-second Spanish clip produces a paragraph of fluent English that has nothing to do with the audio. The same audio fed to Lightning at the same model size produces “el camión está aquí” or close to it.
Same weights, very different multilingual robustness. That makes this an implementation difference, not a fundamental Whisper limitation. The decoder parameters are the prime suspect: suppress_tokens, no_speech_threshold, language-token forcing, sample length. I haven’t traced it down yet. If anyone has shipped a LangID pre-pass in production or has a working WhisperKit decoder config for short multilingual audio, I’d like to compare notes (issue or PR welcome).
Resolved (see the update up top): the cause was simpler than any tuning knob —
detectLanguagewas off by default, so detection never ran. One line fixed it; WhisperKitmediumwent 253% → 16.7%.
The pragmatic fix shipped before the explanation did: an explicit Language picker in Settings, still the right call for anyone who dictates in one language. But with detection actually enabled, auto-detect is a sound default — which is what alpha.17 ships.
distil-large-v3 is faster than medium and the marketing around it suggests “same quality, better latency.” On real human speech it’s close: 3.2% WER vs 2.6% for medium, RTF 0.12x vs 0.22x. Faster, with a small accuracy hit.
On multi-accent English TTS the gap widens: 5.3% vs 1.3%. Four times the WER, for the same audio family. The Voices bucket is synthetic, but it’s also the cleanest signal we have for accent robustness — you control everything except the voice.
The take I came away with: distil is the right pick when latency dominates and the speech is clean. medium is the right pick when accuracy matters and the domain has unfamiliar terms (proper nouns, project names, transcription of speech that isn’t from a podcast host). For a dictation app that runs in the background of normal work, medium won. For a real-time captioning scenario, distil might. They aren’t interchangeable.
tiny and base collapse on noiseThe clean-speech numbers for tiny and base look workable: 7-9% WER on LibriSpeech, 12-15% on Voices. You could read those and think “fine for an 8 GB Mac if accuracy isn’t critical.”
Then add 10 dB SNR of pink noise:
| Model | LibriSpeech WER | Noisy WER |
|---|---|---|
medium |
2.6% | 6.3% |
small |
4.1% | 11.4% |
base |
5.9% | 24.0% |
tiny |
7.2% | 27.3% |
tiny and base go from “marginal” to “unusable” the moment you introduce real-world noise. People dictate near A/C, in cafés, with HVAC running. Quoting only the clean-speech number for these sizes would have been misleading. On 8 GB Macs, small is the conservative starting point — it’s the smallest size that stays under 12% noisy WER. The gap versus medium (6.3% vs 11.4% on noise) is significant, and memory isn’t the bottleneck: medium peaks at 197 MB RSS, roughly 2% of 8 GB. The open question is RTF — whether WhisperKit medium runs at comparable speed on M1/M2 8 GB chips as it does on M4. That data doesn’t exist yet. See the contribute section if you’re on older hardware.
Three concrete decisions came out of the bench:
medium is the default — it’s the clearest winner in the bench. At 197 MB peak RSS it fits easily on any current Mac; the open question for 8 GB hardware is RTF on older chips, not memory headroom. M1/M2 bench results would settle this.medium was the default, the end-of-dictation wait got bad on long clips. A 5-minute clip finishes offline in 34.4 seconds wall-clock; without streaming the user experience was 30 seconds of staring at a “transcribing…” indicator.That last one is its own measurement. Streaming the audio in chunks while you speak, then finalising the tail when you stop, gave:
| Length | Offline | Post-stop | Speedup |
|---|---|---|---|
| 30s | 4.36s | 1.55s | 2.8x |
| 1-min | 7.03s | 1.65s | 4.3x |
| 2-min | 13.76s | 2.41s | 5.7x |
| 5-min | 34.44s | 2.77s | 12.4x |
The post-stop wait stops growing with length. The 12x speedup at 5 minutes is what’s interesting; at 30 seconds the speedup is modest (you can’t beat a fast clip by much), but as audio gets long, streaming dominates more and more. WER delta vs offline is within +/- 0.5 pp across all length buckets.
The local-LLM polish step — taking a raw transcript and tightening it into something you’d press send on, without a cloud round-trip — is on the bench list. Same methodology: engines, models, corpus, matrix. I’ll post when the numbers land.
docs/BENCHMARKS.mdbench/out/M4-16GB/bench/out/stream/M4-16GB-paced/report.mdbench/corpus/If you have a non-M4 Mac (M1, M2, M3, Intel, 8 GB, 24 GB+), the bench script smoke-passes in CI and the corpus is checked in. PRs to bench/out/ are the most useful contribution this project can take. The matrix is one host wide right now; it should be many.
The most wanted data point right now: medium RTF on M1/M2 8 GB. Memory isn’t the constraint (197 MB fits easily); the question is whether older Apple Silicon runs medium at comparable speed. If you have an M1 or M2 Mac, that’s the run we’re missing.