Status: ratified — implemented 2026-06-02 (shipping in v2.0.0-alpha.18)
Owner: Sources/OpenQuackKit/Polish/ChineseScript.swift,
OpenQuackPlatform/LanguageDecodePolicy.swift,
OpenQuackStreaming/StreamingTranscriber.swift, + OpenQuackApp
Last updated: 2026-06-02
Whisper’s zh model emits a mix of Traditional and Simplified hanzi —
often skewing Traditional even for a mainland speaker, and the script can
flip between streaming chunks of one utterance. Today the only fix is a
manual Settings picker (ChineseScript = auto / simplified / traditional)
that defaults to auto (no-op), so most users see inconsistent script and
never discover the control.
This spec makes script normalisation automatic and configuration-free: when the transcription language is Chinese, the output is normalised to the script implied by the user’s macOS language preferences (Traditional for a Traditional-Chinese system, Simplified otherwise — including non-Chinese systems). No picker; it just happens.
zh. A user who pins a language in Settings
still works — the rule keys off the resolved language, pinned or
detected.Convert the characters, don’t touch the transcription. The conversion
is an ICU character-level Hant↔Hans transform, applied to whatever Whisper
already emitted — we never alter the decode path to force a language. The
transform is a no-op on every non-Han character (Latin, digits, punctuation),
so it is safe to run on any output except Japanese (ja) and Korean
(ko): their kanji/hanja share code points with hanzi and would be corrupted.
The normaliser therefore skips only ja/ko; zh, en, nil, and mixed
output all pass through the transform, with non-Chinese characters untouched.
This is what handles mixed language: an utterance transcribed as en
that carries embedded Chinese (e.g. “I use 軟體 daily”) still gets its hanzi
normalised to the system script, while the English is left exactly as-is.
Verified (2026-05-31, medium, offline auto-detect, M4-16GB). An
English-dominant ~8 s clip with an embedded Mandarin phrase transcribed as
language=en with the Chinese emitted as hanzi inline (not translated):
…It goes like this. 你好世界,今天天气很好。 Isn't that…. Feeding that exact
string through the converter for a Traditional system flips 天气→天氣
while the English stays byte-for-byte identical; a Simplified system leaves
it. The old == "zh" gate skipped this en-labelled output entirely. (The
decoder does not always translate embedded Chinese away — so the
mixed-language win is real on the offline path. The streaming path needs the
companion change in Streaming below to decode the switch as Chinese in the
first place.)
Locale.preferredLanguages, scanning for the first Chinese entry:
Hant script subtag, or region ∈ {TW, HK, MO} → TraditionalHans subtag, region ∈ {CN, SG}, or bare zh → SimplifiedThe region table is the source of truth: Locale.Language(identifier:)
does not reliably populate .script on region-only forms like zh-HK
without likely-subtag maximisation, so we never depend on it.
軟體
becomes 软体, not the idiomatic mainland 软件. Word-level (OpenCC) is a
future upgrade; out of scope here (no C++ dep).zh; no separate handling.ja/ko
label (e.g. a Japanese name in an utterance detected en) would be converted.
Accepted: that requires both a detection miss and embedded CJK, and the
alternative (never convert outside zh) loses the mixed-language win.The streaming path (≥30 s) used to detect the language on the first chunk and lock it for the whole recording (SPEC-021). That kept a monolingual utterance from flipping language mid-stream, but it also meant an utterance that genuinely switched language — start in English, finish in Chinese — was forced to the first language: the Chinese tail came back as an English translation, with no hanzi to normalise.
Per-chunk re-detection is the fix, but the naive version (re-detect every chunk)
re-opens the failure the lock was added for: WhisperKit returns the "en"
fallback — not nil — when detection is weak, so a short, low-content chunk can
silently flip a monolingual recording’s tail to the wrong language.
Middle ground (keyed off chunk duration). The cutter never emits an interior
chunk shorter than targetChunkSeconds (20 s), which is plenty for reliable
language ID — so full chunks re-detect (that’s what lets the language switch
at a boundary), and only the trailing partial chunk, if it is below
minDetectSeconds (default 10 s), inherits the running language instead of
risking a misdetect. The first chunk has nothing to inherit, so a rare short
first chunk still detects. The whole decision is the pure, unit-tested
LanguageDecodePolicy.decideStreamingChunk; a pinned language skips detection
entirely as before.
Bench (2026-06-02, medium, streaming auto, smoke, M4-16GB). A/B via
openquack-stream-bench --min-detect 10 (new) vs 9999 (reproduces the lock),
same clips, same process:
| Clip | Old (lock) | New (per-chunk) |
|---|---|---|
en_long 49 s monolingual EN |
WER 3.7% | WER 3.7%, byte-identical output |
zh_long 37 s monolingual ZH |
hanzi | hanzi, byte-identical output |
codeswitch 53 s EN→ZH |
Chinese tail translated to English (no hanzi) | Chinese tail decoded as hanzi (…最近我还在尝试用它来朗读和整理我的读书笔记…) |
No regression on either monolingual case; the code-switch tail goes from an English mistranslation to correct Chinese, which the normaliser above then puts in the system script. (Smoke mode has no real-time pacing, so its post-stop wait is not a latency figure — SPEC-012 owns that.)
Known limitation. A switch confined to the final < minDetectSeconds (10 s)
of audio lands in the short trailing chunk, which inherits rather than detects,
so that brief tail is missed. Catching it would mean trusting detection on very
short chunks — the exact regression the lock guarded against — so the tradeoff is
deliberate. A switch that occupies a full chunk (the normal case for a real
language change) is caught.
For the feature to apply out of the box, the default transcription language
moves from pinned-en to auto-detect (""). This is really SPEC-021’s
domain (the detection mechanism), folded in here for one coherent change.
The legacy Settings caption (“Auto-detect can be unreliable on short
utterances…”) predates the SPEC-021 alpha.17 fix (caption 2026-04-27; fix
2026-05-31) and is stale. Gate the default flip on a bench: run
openquack-bench --models tiny --corpus bench/corpus/short with
--language en vs auto and compare EN WER. SPEC-021 validated the
non-English failure modes, not EN short-clip misdetection specifically, so
this is the gap worth measuring.
en default; normalisation still
applies whenever the user is on auto-detect or pins zh. Report the delta.Bench result (2026-05-31, medium on bench/corpus/short, M4-16GB):
auto-detect is byte-for-byte identical to pinned-en — WER
{0, 0, 0, 0, 0.067} on both runs, and auto-detect labels every clip en.
No EN regression → default flipped to auto-detect ("") and the stale
caption replaced. (medium is the app’s default model, so this is the
representative comparison; tiny was not cached.)
ChineseScript.resolve(preferredLanguages:)
maps the full matrix deterministically (machine-independent):
zh-Hant, zh-Hant-TW, zh-TW, zh-HK, zh-MO → .traditional;
zh-Hans, zh-CN, zh-SG, bare zh, en-US, [] → .simplified;
["en-US","zh-Hant-TW"] → .traditional (first Chinese entry wins).normalize(_:language:preferredLanguages:):
ja / ko return the input byte-for-byte unchanged even when it contains
Han characters (kanji/hanja protection); zh, en, nil, and mixed
Latin+Chinese strings convert the Chinese characters to the resolved script
while leaving non-Han characters untouched.chineseScript UserDefault and its Settings
picker are removed; swift build && swift test green.LanguageDecodePolicy.decideStreamingChunk:
a full chunk (≥ minDetectSeconds) re-detects even when a language is already
running; a short chunk inherits the running language; a short chunk with
nothing to inherit detects; a pinned language always forces. The offline
decide(pinned:locked:) path is unchanged (its tests stay green).--min-detect 10 vs 9999 on a monolingual EN
clip, a monolingual ZH clip, and an EN→ZH code-switch clip: no monolingual
regression, code-switch tail decodes as hanzi (see Streaming above).bench/corpus/short, en vs
auto, documented in the PR before the flip lands.Sources/OpenQuackKit/Transcription/TranscriptionEngine.swift —
EngineTranscription.detectedLanguage (the resolved-language signal).StringTransform Hant-Hans / Hans-Hant.