Living document. We add a row to the host matrix every time someone runs the bench and submits the result. Numbers below are real, reproducible from bench/out/<host-tag>/, and supersede earlier preliminary runs.
WhisperKit
mediumis the recommended v2 default for English on 16 GB+ Macs. 2.6 % WER on LibriSpeech, 1.3 % on multi-voice English TTS, 6.3 % on noise-augmented speech. RTF 0.22–0.31× (≥3× realtime). Peak RSS 197 MB. Cold start 27 s (cache miss only — once per Mac).
For 8 GB Macs, whisperkit small is the practical choice (165 MB RSS, 4.1 % LibriSpeech WER, but degrades to 11 % on noise).
Non-English auto-detect works. WhisperKit
mediumtranscribes the multilingual bucket at 16.7 % WER / 3.7 % CER on auto-detect — on par with Lightning, and a non-issue for non-English users. (Builds before alpha.17 shipped a config bug that left detection off and translated non-English audio to English; fixed in SPEC-021.)
| Host tag | Chip | GPU | Memory | macOS | Date |
|---|---|---|---|---|---|
M4-16GB |
Apple M4 | 8-core | 16 GB | 15.6 (24G84) | 2026-04-26 |
argmaxinc/argmax-oss-swift 0.18+, primary) and Lightning (lightning-whisper-mlx 0.0.10 via long-running Python subprocess, comparison-only).librispeech/ — 20 real human read speech clips from LibriSpeech dev-clean (CC BY 4.0, openslr.org/12). The closest thing in this set to real-world dictation accuracy.voices/ — 20 multi-accent English TTS (Samantha US, Daniel UK, Karen AU, Fred US-robotic, 5 sentences each).noisy/ — 120 noise-augmented clips (the 20 voices × white + pink noise × 5 / 10 / 20 dB SNR).multilingual/ — 12 native-language clips (zh, ja, ko, es, fr, de × 2).short/ — 5 single-voice TTS sanity-check clips.tiny, base, small, medium, distil-large-v3. (large-v3-turbo is unavailable in either engine under that name; tracked as an open question.)mach_task_basic_info.M4-16GB (2026-04-26)| Engine | Model | WER | CER | RTF | Wall (avg) |
|---|---|---|---|---|---|
| whisperkit | medium |
2.6 % | 1.3 % | 0.22× | 1.98 s |
| lightning | medium |
2.8 % | 1.3 % | 0.34× | 2.47 s |
| whisperkit | distil-large-v3 |
3.2 % | 1.3 % | 0.12× | 0.89 s |
| lightning | distil-large-v3 |
3.2 % | 1.3 % | 0.48× | 3.27 s |
| whisperkit | small |
4.1 % | 1.7 % | 0.05× | 0.50 s |
| lightning | small |
4.6 % | 2.3 % | 0.10× | 0.74 s |
| lightning | base |
5.6 % | 2.8 % | 0.04× | 0.29 s |
| whisperkit | base |
5.9 % | 2.3 % | 0.02× | 0.20 s |
| whisperkit | tiny |
7.2 % | 3.0 % | 0.01× | 0.13 s |
| lightning | tiny |
8.7 % | 4.4 % | 0.02× | 0.18 s |
Read: 2.6 % WER on real human read speech is solid — close to what dictation feels like in practice. medium outperforms distil-large-v3 here; small is acceptable but visibly less accurate.
| Engine | Model | WER | CER | RTF |
|---|---|---|---|---|
| whisperkit | medium |
1.3 % | 0.3 % | 0.31× |
| lightning | medium |
2.0 % | 0.4 % | 0.63× |
| whisperkit | small |
2.7 % | 1.2 % | 0.09× |
| lightning | small |
3.1 % | 1.6 % | 0.19× |
| whisperkit | distil-large-v3 |
5.3 % | 0.7 % | 0.23× |
| lightning | distil-large-v3 |
5.3 % | 0.7 % | 0.97× |
| whisperkit | base |
5.6 % | 2.1 % | 0.04× |
| lightning | base |
9.4 % | 3.7 % | 0.06× |
| lightning | tiny |
12.9 % | 5.1 % | 0.04× |
| whisperkit | tiny |
15.0 % | 5.8 % | 0.02× |
| Engine | Model | WER | CER | RTF |
|---|---|---|---|---|
| whisperkit | medium |
6.3 % | 2.7 % | 0.31× |
| lightning | medium |
6.3 % | 2.9 % | 0.63× |
| whisperkit | distil-large-v3 |
9.4 % | 3.5 % | 0.23× |
| lightning | distil-large-v3 |
10.0 % | 3.7 % | 0.98× |
| whisperkit | small |
11.4 % | 5.5 % | 0.10× |
| lightning | small |
11.8 % | 5.7 % | 0.19× |
| lightning | base |
22.8 % | 12.5 % | 0.06× |
| whisperkit | base |
24.0 % | 13.2 % | 0.04× |
| whisperkit | tiny |
27.3 % | 15.8 % | 0.03× |
| lightning | tiny |
28.7 % | 16.5 % | 0.04× |
Read: Once you add real-world noise, medium pulls clearly ahead of small. tiny and base collapse — they’re not viable in any non-pristine environment.
| Engine | Model | WER | CER |
|---|---|---|---|
| whisperkit | medium |
16.7 % | 3.7 % |
| lightning | medium |
16.7 % | 3.7 % |
| lightning | small |
29.2 % | 5.0 % |
| lightning | tiny |
23.7 % | 8.1 % |
| lightning | base |
52.8 % | 8.1 % |
Read: WhisperKit medium on auto-detect now matches Lightning — every clip detects its true language (de/es/fr/ja/ko/zh) and transcribes it, where it once mislabelled everything en and translated. zh_001 went 350 % → 0.0 % WER; a 30.8 s Mandarin clip lands at 12.9 % CER (mostly number formatting, 三点 → 3点, not misrecognition). The cause was a config default, not the model: WhisperKit’s DecodingOptions.detectLanguage is false unless you set it, so detection never ran. The other WhisperKit sizes still carry their pre-fix numbers (small 198.6 %, base 284.3 %, tiny 303.3 %, distil-large-v3 231.9 %) — they’ll improve the same way once re-benched; PRs welcome.
| Engine | Model | Cold start | Peak RSS |
|---|---|---|---|
| lightning | tiny |
1.92 s | 53 MB |
| lightning | small |
1.10 s | 37 MB |
| lightning | medium |
1.17 s | 27 MB |
| lightning | distil-large-v3 |
1.23 s | 23 MB |
| whisperkit | tiny |
6.06 s | 103 MB |
| whisperkit | small |
22.41 s | 165 MB |
| whisperkit | medium |
27.44 s | 197 MB |
| whisperkit | distil-large-v3 |
23.12 s | 111 MB |
Cold-start values include first-run model download. Warm-cache loads are 1–10 s typically.
| Memory tier | Default model | Why |
|---|---|---|
| 8 GB | whisperkit small |
4.1 % WER on real speech, 11 % on noisy (acceptable), 165 MB peak — leaves 6+ GB for the agent backend |
| 16 GB (this Mac) | whisperkit medium |
2.6 % LibriSpeech WER, 1.3 % multi-voice, 6.3 % noisy. RTF 0.22–0.31× (≥3× realtime), 197 MB. Best balance. |
| 24+ GB | whisperkit medium (same) |
No 24 GB host benched yet; revisit when one ships data |
Engine default: WhisperKit. Native Swift, zero Python sidecar, faster on Apple Silicon. Lightning stays as the bench baseline.
Multilingual users: auto-detect handles non-English well. Pinning a language in Settings is optional — worth it only to skip the detection pass when you always dictate in one language.
large-v3-turbo — neither engine accepts the obvious model name. Discover the right one via WhisperKit.fetchAvailableModels() and rerun. Turbo’s promise is “approaches large-v3 quality at small-ish cost”; if real, it could displace medium as the default.medium (253 % → 16.7 %). The other sizes (tiny/base/small/distil) still show pre-fix numbers above and should be re-run with detection enabled.bench/CONTRIBUTING.md.These feed into M2/M3 specs:
initial_prompt injection from a user dictionary.transcribeWithResults supports a callback. M3 spec should expose progressive transcripts to the overlay so longer utterances feel snappier.From the repo root, on the v2 branch:
# Synthetic corpus (multi-voice TTS + multilingual sentences).
bash bench/corpus/fetch.sh
# Real human speech (~337 MB one-time download, openslr.org).
N=20 bash bench/corpus/fetch_librispeech.sh
# Noise-augmented variants (white + pink × 5/10/20 dB SNR).
.venv/bin/python bench/corpus/mix_noise.py --source bench/corpus/voices
# Run the matrix.
swift run openquack-bench \
--engines whisperkit,lightning \
--models tiny,base,small,medium,distil-large-v3 \
--corpus bench/corpus \
--verbose
Output: bench/out/<host-tag>/{report.md,report.csv,host.json}. Submit yours via PR per bench/CONTRIBUTING.md.