openquack

SPEC-012 — Streaming transcription (perf-only chunking)

Status: draft (M3) Owner: OpenQuackKit/Streaming/ Last updated: 2026-04-30

Goal

A 90-second voice memo feels the same as a 10-second one — paste latency stays roughly flat after the user releases the hotkey, regardless of utterance length.

We get there by transcribing chunks of the audio while recording is still in flight, so by the time the user stops speaking, only the trailing tail (a few seconds at most) still needs to be processed. The final transcript is assembled from the chunked results plus the tail. This is a perf-oriented internal pipeline change. The user never sees partial transcripts — the overlay still shows recording → transcribing → done, the only difference is that transcribing resolves in roughly constant time instead of growing linearly with audio length.

Why this matters

WhisperKit’s runtime is bounded by defaultWindowSamples = 480_000 (30 s @ 16 kHz, see Models.swift:1543). For audio longer than one window, the offline transcribe(audioArray:) path already chunks internally (see “Primary-source notes”), but only after recording ends — every second of audio adds proportional wall time to the post-stop wait. On a baseline M4/16GB at WhisperKit-medium ~0.22× RTF, a 2-minute dictation costs ~26 s of post-stop wait. That breaks the “send without re-reading” quality bar from VISION.md: the user has already moved on.

Streaming the chunk transcribes during recording reduces post-stop wait to “tail-chunk transcribe + assembly” — bounded by the chunk size, independent of total length.

Non-goals

Primary-source notes (read before designing)

These are the WhisperKit surfaces this spec builds on. Conclusions in this spec must trace back here, not to memory.

Implication: the right shape is a new StreamingTranscriber actor in OpenQuackKit/Streaming/ that consumes float frames from a frames callback on AudioRecorder, owns a sliding [Float] buffer plus an EnergyVAD, dispatches sealed chunks to the existing WhisperKit pipe, and assembles results in order.

Public surface (sketch)

public actor StreamingTranscriber {
    /// Tunables. Defaults assume WhisperKit-medium on M-series 16 GB.
    public struct Config: Sendable {
        /// Minimum audio duration before streaming kicks in. Below this,
        /// the engine just transcribes the full buffer at stop and skips
        /// chunking entirely (offline path is faster end-to-end for short
        /// utterances).
        public var streamingThreshold: TimeInterval = 30
        /// Target chunk length. Cuts happen at the next silence past
        /// this mark, never before.
        public var targetChunkSeconds: TimeInterval = 20
        /// Maximum chunk length — if no silence found by this point,
        /// force-cut anyway (rare on natural speech, common on monologues).
        public var maxChunkSeconds: TimeInterval = 28
        /// VAD energy threshold for silence detection. EnergyVAD default.
        public var silenceEnergyThreshold: Float = 0.02
        /// Cap on in-flight chunk transcribes. Bounds memory/CPU.
        public var maxInFlightChunks: Int = 1
    }

    public init(engine: WhisperKitEngine, config: Config = .init())

    /// Begin a streaming session. Resets internal state. Caller wires
    /// `appendFrames` to AudioRecorder's frames callback (see SPEC-001
    /// extension below).
    public func begin(language: String?, customWords: String?) async

    /// Feed PCM frames captured at the recorder's native rate. The
    /// transcriber resamples internally to 16 kHz (mirrors what
    /// `WhisperKitEngine.transcribe(audioFile:)` does today via
    /// WhisperKit's own AudioProcessor).
    public func appendFrames(_ samples: [Float], sampleRate: Double)

    /// Block until pending chunks finish, transcribe the trailing tail,
    /// and return the assembled transcript. Equivalent in shape to
    /// `WhisperKitEngine.transcribe`'s return value.
    public func finish() async throws -> EngineTranscription

    /// Discard everything; cancel in-flight chunk transcribes. Safe at
    /// any time. After cancel, `begin` is required before reuse.
    public func cancel() async
}

And a minimal extension to SPEC-001 — additive, no behavioural change to existing callers:

extension AudioRecorder {
    /// Emitted on the audio thread for every captured tap buffer (~10–20 ms
    /// at typical input rates). Set before `start()`. Called with raw
    /// float32 samples in the input device's native rate; consumers must
    /// resample if they need 16 kHz.
    public var framesHandler: (([Float], Double) -> Void)? { get set }
}

framesHandler is opt-in; existing dictation-only callers leave it nil and pay nothing. The streaming app wires it to StreamingTranscriber.appendFrames.

Behaviour

Chunk boundary strategy

Default — silence-aware sliding window. As frames arrive, the buffer grows. Once buffer.count ≥ targetChunkSeconds * 16_000, run EnergyVAD.voiceActivity(in:) over the most recent targetChunkSeconds ... maxChunkSeconds of buffered audio. If findLongestSilence(in:) returns a hit, cut at that silence’s midpoint (matching VADAudioChunker.splitOnMiddleOfLongestSilence’s behaviour). If no silence by maxChunkSeconds, force-cut at maxChunkSeconds — the window-internal seek logic in WhisperKit handles non-silent boundaries, just at slightly higher hallucination risk on the boundary word.

The cut produces a sealed Chunk { startSecondsInUtterance, samples: [Float] } which is appended to the in-flight queue.

Why VAD-midpoint and not fixed window: same reason WhisperKit’s offline chunker does it — silence-cuts almost never split words, and when paired with the VAD-aware decode they don’t suppress boundary content. Fixed-window cuts at e.g. 20.0 s have measurable WER cost on boundary words.

In-flight queue

A serial Task consumes the queue one chunk at a time. With maxInFlightChunks = 1, the design assumes the model is faster than realtime (RTF ≤ ~0.5×) for the chosen default model. With a baseline M4/16GB at WhisperKit-medium ~0.22× RTF, a 20-second chunk takes ~4.4 s wall time — well under the next chunk’s arrival, so the queue typically stays empty between cuts.

The cap is conservative: bumping it to 2 gives some headroom against RTF spikes (background load, thermal throttle) at the cost of double peak RAM. Treated as a per-machine knob; default 1 stays safe.

Stop semantics (the load-bearing case)

When finish() is called:

  1. The current open buffer is the tail — everything after the last cut. Its length is at most maxChunkSeconds, in practice bounded by user behaviour (typically 5–15 s).
  2. await any chunk currently transcribing.
  3. Submit the tail as the final chunk and await it.
  4. Concatenate all chunk transcripts in order, trimming any single-token overlaps where the segment seeker re-emitted boundary tokens (rare but observed; TranscriptionUtilities already exposes the dedup helper used by the offline chunker).
  5. Compute final EngineTranscription (text, detectedLanguage from the first chunk, audioSeconds from total appended, wallSeconds from begin() to now, ttft as the timeToFirstToken of chunk 1).

Post-stop wait under the design = tailTranscribeTime + maybeOneInFlightChunkRemaining + assemblyTime. With the tuning above, both terms are bounded by maxChunkSeconds * RTF, independent of total utterance length.

Cancel mid-stream

cancel() must guarantee no leaks and no UI inconsistency:

  1. Stop accepting new frames (appendFrames becomes a no-op until begin).
  2. Cancel the in-flight chunk’s TaskWhisperKit.transcribe honours Swift task cancellation cooperatively; if it doesn’t drop on the next decoding step, treat the result as discarded on completion.
  3. Release the buffer.
  4. Caller (AppDelegate.cancelRecording) is responsible for showing the cancelled overlay state — no behaviour change vs. today.

The AudioRecorder buffer / WAV file are owned by AudioRecorder and follow its existing cancel path (delete WAV on cancel, see SPEC-001). StreamingTranscriber does not touch the WAV; the streaming pipeline operates on the in-memory float copy fed via framesHandler.

Error mid-stream

If a chunk transcribe throws:

  1. Continue, don’t fail the session. Log the chunk index, store an empty placeholder transcript for the failed chunk, keep accepting frames.
  2. On finish(), if any chunk failed, still return the assembled transcript over the surviving chunks. Surface a flag in EngineTranscription (extend with chunkFailures: Int) so the caller can render an error overlay banner after paste — losing a 20-s window mid-utterance is bad, but losing the whole 2-minute transcript because of it is worse.
  3. If every chunk fails (model crash, out-of-memory), fall through to the existing offline path: WhisperKitEngine.transcribe(audioFile:) on the WAV the recorder still has on disk. This is the “graceful degradation” branch — slower, but the user gets something. The overlay shows the normal transcribing state for the duration.

Polish interaction (SPEC-007)

Polish runs once, on the assembled transcript, after finish() returns. Never per-chunk. Three reasons:

  1. Chunk boundaries cut sentences mid-clause; per-chunk polish would reorganise fragments and produce stitched-together garbage.
  2. Polish prompts ask for filler removal, false-start cleanup, bullet-isation — all whole-utterance decisions. The LLM can only do them with full context.
  3. Polish is the cheapest stage when the LLM is hot (cache keep_alive: -1, see SPEC-007); running it once on the final text is comparable in wall time to one chunk transcribe.

The pipeline becomes:

mic frames ─┬─► AudioRecorder.write(WAV)        (SPEC-001, unchanged)
            └─► StreamingTranscriber             (this spec)
                  ├── EnergyVAD silence cuts
                  ├── chunk[i] ─► WhisperKit.transcribe(audioArray:)
                  └── on finish() ─► assemble ─► raw transcript
                                                   │
                                                   ▼
                                              TextPolisher (SPEC-007, batch)
                                                   │
                                                   ▼
                                              PasteService (SPEC-005)

For utterances below streamingThreshold, StreamingTranscriber is bypassed entirely — AppDelegate calls the existing WhisperKitEngine.transcribe(audioFile:) on the finalised WAV, same as today. The branch decision is at-stop based on AudioRecorder.elapsedSeconds.

Pipeline integration

In AppDelegate.startRecording:

let recorder = AudioRecorder()
recorder.framesHandler = { [streamer] samples, rate in
    Task { await streamer.appendFrames(samples, sampleRate: rate) }
}
try recorder.start()
await streamer.begin(language: settings.language, customWords: settings.customWords)

In AppDelegate.stopAndTranscribe:

let elapsed = recorder.elapsedSeconds
let url = recorder.stop()

let raw: EngineTranscription
if elapsed >= streamer.config.streamingThreshold {
    raw = try await streamer.finish()                    // streamed path
} else {
    await streamer.cancel()                              // discard accumulated frames
    raw = try await engine.transcribe(audioFile: url!,   // offline path
                                       language: settings.language,
                                       customWords: settings.customWords)
}

let polished = try await polisher.polish(raw.text, ...)
try await PasteService.paste(polished)

AppDelegate keeps a long-lived StreamingTranscriber between sessions to avoid allocating the buffer on every hotkey press. The cost is ~1.4 MB per minute of streaming-buffer headroom; bounded by cancel() between sessions.

Quality gates

Bench-able. Add to openquack-bench:

BENCHMARKS.md is the source of truth; the M3 default may shift the chunk-size tuning if a smaller/faster default model lands.

Open questions

References