Status: draft
Milestone: M1 (bench) / M2 (fix)
Reported: issue #17 — “Mandarin being translated as [SPEAKING CHINESE]”
When a user speaks Mandarin without setting a language preference, WhisperKit can produce one of three failure modes instead of a correct transcript:
[SPEAKING CHINESE] or [FOREIGN LANGUAGE]. The model recognised there was
non-English speech but generated a training-data annotation rather than the
transcript. This is the exact failure in issue #17.task = .transcribe (not .translate), typically when its language-detect
confidence is low and the decoder falls back to an English beam. The user gets
a rough English rendering of what they said, not their words in Chinese.The existing benchmark already demonstrates that auto-detect on short multilingual
clips is catastrophic (>200% WER, per docs/BENCHMARKS.md). But WER alone cannot
distinguish the three failure modes above — mode 1 may score similarly to mode 3
even though the user experience is completely different. We need categorical metrics
and a fix.
task = .transcribe is already set; this is not a translate-task bug.[SPEAKING CHINESE]-style outputs are training-data annotations that were not
filtered from Whisper’s pretraining corpus. The model learned to emit them when
it detects non-English audio without a language token.DecodingOptions.supressTokens is currently [] in our engine. WhisperKit
exposes this field; injecting the token IDs for [SPEAKING, [FOREIGN, etc.
would suppress that failure mode. This is a primary candidate fix.Verification before any fix lands: look up token IDs for the bracket strings and confirm they are in the Whisper vocabulary. If they are not word-piece tokens (i.e. they decompose into many IDs), suppression via this field is impractical and the fix moves to the output-text post-processing layer instead.
Add to ClipMetrics and Report:
| Metric | Type | Definition |
|---|---|---|
failureMode |
enum | .ok, .placeholder, .silentTranslation, .garbled |
outputScriptMatch |
Bool | true if dominant Unicode block of output matches reference |
hallucRateRaw |
Double | fraction of output chars matching \[.*?\] patterns |
Failure mode classification rules (applied per clip, in order):
\[(SPEAKING|FOREIGN|INAUDIBLE|MUSIC|NOISE|APPLAUSE)[^\]]*\] →
.placeholder.silentTranslation.garbled.okExtend bench/corpus/fetch.sh to generate 6 additional Mandarin clips using
macOS say -v Tingting:
| ID | Content | Why |
|---|---|---|
zh_003 |
我需要在明天下午三点开会。 | short, time/number context |
zh_004 |
请帮我发一封邮件给张总。 | instruction sentence |
zh_005 |
这个季度的销售额增长了百分之十五。 | longer, numeral density |
zh_006 |
苹果公司发布了新款 iPhone。 | proper noun (Apple, iPhone) |
zh_007 |
今天北京天气晴,最高气温三十度。 | city name + numeral |
zh_008 |
人工智能正在改变我们的工作方式。 | abstract / AI topic |
8 clips total (zh_001–zh_008) gives a statistically meaningful sample for a single-language bucket.
Failure mode breakdown column to the per-bucket aggregate table.--verbose mode.failure_mode,output_script_match,halluc_rate_raw columns.After landing the metrics + expanded corpus, rerun the bench with and without
--language zh on the new zh bucket and record the delta. This gives us:
Fixes are explored in a worktree. Land only the one(s) that show measurable improvement on the new categorical metrics without regressing English WER.
F1 — Token ID suppression of bracket hallucinations (one-line in WhisperKitEngine)
Resolve token IDs for [SPEAKING, [FOREIGN, [INAUDIBLE, [MUSIC] via
pipe.tokenizer?.convertTokenToId(...) at engine-init time. Add resolved IDs to
options.supressTokens. Verify token IDs exist; if not, fall through to F3.
Expected impact: eliminates failure mode 1 (placeholder) cleanly. No effect on modes 2 or 3.
F2 — Script-match retry with language hint
After a transcription returns a .silentTranslation or .garbled result, retry
the same audio with options.language = <auto-detected lang>. Detect the audio
language via a first-pass no-decode language-only run. WhisperKit supports this
through detectLanguage(audioArray:).
Expected impact: fixes mode 2 and 3 for short clips where the language-detection first pass is more reliable than the full decode. Adds latency on failure paths only (~50 ms). Does not help mode 1 if bracket tokens already consumed the decode budget.
F3 — Output-text post-processing: strip bracket hallucinations
Regex-strip \[(SPEAKING|FOREIGN|INAUDIBLE|MUSIC|NOISE|APPLAUSE)[^\]]*\] from the
final text. Lives in WhisperKitEngine.transcribe(...), after join/trim.
This is a safety net, not a primary fix — it doesn’t address modes 2 or 3, and it masks rather than fixes. Land only if F1 is impractical (tokens not in vocabulary).
F4 — Prompt the user for language before first non-English clip
This is already tracked as a roadmap item (“user language preference in Settings → General”). It is the complete fix for all three modes. SPEC-021 should not attempt to implement the full Settings UI — that belongs in M2/M2.5. If F1+F2 close the issue adequately, F4 remains a UX improvement for later.
failureMode == .ok rate on bench/corpus/multilingual/ rises from current
baseline to ≥75% across zh clips, without a language hint.bench/corpus/short and voices does not regress by more than
0.5 pp.task from .transcribe to anything else.| PR | Title | SPEC cite | CI gate |
|---|---|---|---|
| PR-A | bench: categorical failure-mode metrics + zh corpus expansion |
SPEC-021 | swift build + swift test |
| PR-B | fix(whisperkit): suppress bracket hallucinations + script-match retry |
SPEC-021 | swift build + swift test + bench delta |
PR-A must merge before PR-B (B depends on the new metrics to report its delta).