Parakeet (NVIDIA, CC-BY-4.0 FastConformer ASR family) ported to
ggml. Pure C++/ggml inference on CPU
and GPU (Metal / CUDA / Vulkan), with no runtime dependency on Python,
PyTorch, or onnxruntime. Ships CTC, TDT, EOU, and Sortformer engines
under one Engine umbrella; EOU (FastConformer-RNN-T 120M with native
<EOU> end-of-utterance token) is the most recently shipped engine.
Supported checkpoints:
| HF repo | Decoder | Mel | d_model × n_layers |
Vocab | Params | GGUF size | RTF (Metal) | Languages |
|---|---|---|---|---|---|---|---|---|
nvidia/parakeet-ctc-0.6b |
CTC | 80 | 1024 × 24 | 1024 | 600 M | 697 MiB q8_0 / 1.3 GiB f16 | 0.014-0.046 | English only |
nvidia/parakeet-ctc-1.1b |
CTC | 80 | 1024 × 42 | 1024 | 1.1 B | 1217 MiB q8_0 | 0.026-0.074 | English only |
nvidia/parakeet-tdt-0.6b-v3 |
TDT | 128 | 1024 × 24 | 8192 | 600 M | 715 MiB q8_0 / 1.34 GiB f16 | 0.024-0.050 | ~25 languages + PnC |
nvidia/parakeet-tdt-1.1b |
TDT | 80 | 1024 × 42 | 1024 | 1.1 B | 1225 MiB q8_0 | 0.027-0.079 | English only, lowest WER (no PnC) |
nvidia/diar_sortformer_4spk-v1 |
Sortformer head (diarization) | 80 | enc 512 × 18 + tf 192 × 18 | n/a (4 speakers) | ~123 M | 263 MiB f16 / 141 MiB q8_0 / 75 MiB q4_0 | 0.017-0.097 | Speaker diarization (up to 4 speakers, offline) |
nvidia/diar_streaming_sortformer_4spk-v2 |
Sortformer head (diarization) | 128 | enc 512 × 17 + tf 192 × 18 | n/a (4 speakers) | ~117 M | 251 MiB f16 / 134 MiB q8_0 / 72 MiB q4_0 | similar to v1 in offline mode | Speaker diarization, streaming-trained (offline + Phase 11.11.1 sliding-history live streaming today; full NeMo-style spkcache streaming in Phase 11.11.2) |
nvidia/parakeet_realtime_eou_120m-v1 |
RNN-T (1L LSTM 640) + <EOU> token |
128 | 512 × 17 (chunked-limited att=[70,1] + causal subsampler + LN-in-conv) | 1027 (1024 BPE + <EOU> + <EOB> + blank) |
120 M | 246 MiB f16 / 132 MiB q8_0 | encoder out cosine 0.999997 vs NeMo offline; CPU-only today (GPU follow-up tracked) | English only, low-latency streaming ASR with native <EOU> end-of-utterance token detection (NeMo voice-agent target). NVIDIA Open Model License. Phase 12.5 ships offline + Mode 2 + rolling-encoder Mode 3 with offline-equivalent transcripts (Mode 2 byte-equal NeMo, Mode 3 within tolerance). Driving the streaming-trained weights through NeMo's chunked-limited cache_aware_stream_step was prototyped during the Phase 12.x exploration and rejected on quality grounds (~2× early-utterance WER, no <EOU> emitted) -- see PROGRESS.md §8.5 case (A). |
Same converter, same encoder graph (with conv_norm_type / causal_downsampling /
chunked_limited_attention / use_bias all toggled by GGUF metadata so the EOU
streaming-trained encoder reuses the same C++ graph as the offline
CTC / TDT encoders), same GGUF schema. Model identity lives entirely in
parakeet.model.type + the encoder hyperparameters.
All public entry points sit on a single qvac_parakeet::Engine
that auto-dispatches on parakeet.model.type:
Engine::transcribe()-- one-shot wav -> text. CTC, TDT, or EOU.Engine::transcribe_stream()-- Mode 2, offline encoder + streamed segments. CTC, TDT, or EOU. EOU segments carry an extrais_eou_boundaryflag that fires on the chunk where the model emits the<EOU>token.Engine::stream_start()->StreamSession-- Mode 3, live duplex cache-aware push API. CTC, TDT, or EOU. TDT needs slightly more context at the same chunk size (transducer is more sensitive to missing right-lookahead; typical WER delta vs offline is +5-10 %). EOU's Mode 3 transcript is byte-equal to its offline path on shipping fixtures; the<EOU>boundary detection in Mode 3 is approximate by design (the chunked-limited streaming-inference alternative was evaluated and rejected on quality grounds; see PROGRESS.md §8.5 case (A)).Engine::diarize()-- one-shot wav -> [{speaker, start, end}]. Sortformer.Engine::diarize_start()->SortformerStreamSession-- live diarization push API (sliding-history v1; Phase 11.11.2 spkcache pending). Sortformer.
Plus a free function transcribe_with_speakers(sortformer_engine, asr_engine, ...) for combined "who said what" attribution.
Both StreamSession and SortformerStreamSession also support a
small cross-engine event surface (Phase 13) via
StreamingOptions::on_event / SortformerStreamingOptions::on_event:
StreamEventType::EndOfTurnfires when an EOU session detects the<EOU>token (Mode 2 + Mode 3);eot_confidence = 1.0when the model emitted the boundary.StreamEventType::VadStateChangedfires on Sortformer chunks whose any-speaker probability crossesthreshold(withspeaker_id = argmaxon entering Speaking), and on CTC / TDT sessions when the opt-in energy-VAD fallback (enable_energy_vad = true) crosses its dB threshold with hangover.
Defaults to nullptr; consumers that ignore events keep the same
behaviour as before. Designed to be the same shape whisper.cpp will
emit so engine-agnostic event handling can be written once.
wav -> log-mel (80/128) -> FastConformer encoder (sub 8x, 17-42 blocks,
optional LN-in-conv +
causal subsampler +
chunked-limited attn mask)
|
+-------------+--------------+--------------+--------------+
v v v v v
CTC head + TDT 2L LSTM + EOU 1L LSTM + Sortformer encoder_proj +
greedy + joint MLP + joint MLP + 18L TF + sigmoid head +
SP detok transducer transducer + threshold segmentation
| greedy <EOU> reset
| + duration head + segment flush
| | | |
text text + PnC text (\n on {speaker, t0, t1}*
<EOU> turn ends)
Each .gguf ships everything its decoder needs in a single file
(encoder weights, decoder weights, precomputed mel filterbank, and
the SentencePiece tokenizer where applicable). The C++ Engine
auto-detects the model type at load time and dispatches to the right
decoder, so the public API stays single-engine from the consumer's
perspective.
- C++17 compiler (clang or gcc)
- cmake >= 3.14
- Python 3.10+ with
torch,nemo_toolkit[asr],gguf,numpy,librosa,soundfile,sentencepiece— needed once, at setup time only, to run the weight converter (which bakes the precomputed mel filterbank into the GGUF) and the reference-dump scripts. Once the GGUF exists, the C++ binary has zero runtime dependency on Python.
See scripts/ for one-shot helpers.
git clone <this-repo> qvac-parakeet.cpp
cd qvac-parakeet.cpp
# Clone ggml at the pinned commit. The same pin is used for every
# backend (CPU, Metal, CUDA, Vulkan); no engine- or backend-specific
# ggml patches are applied today.
./scripts/setup-ggml.sh
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(sysctl -n hw.ncpu 2>/dev/null || nproc)For a GPU backend pick one of Metal (Apple Silicon, ~2.5x faster
than CPU), CUDA (NVIDIA), or Vulkan (everything else) at configure
time. The init order at runtime is CUDA -> Metal -> Vulkan -> CPU,
so a single binary built with multiple backends compiled in will use
the first available one and there is no runtime backend switch -- the
expectation is one backend per build.
# Apple Silicon:
cmake -S . -B build-metal -DCMAKE_BUILD_TYPE=Release \
-DGGML_METAL=ON -DGGML_METAL_EMBED_LIBRARY=ON
# NVIDIA: -DGGML_CUDA=ON
# Generic: -DGGML_VULKAN=ON
cmake --build build-metal -j$(sysctl -n hw.ncpu)
# `--n-gpu-layers` is a yes/no toggle today: any value > 0 moves the
# whole encoder to the compiled-in GPU backend. The flag is named for
# compatibility with llama.cpp / whisper.cpp; partial-layer offload
# is not implemented (encoder is small enough to fit on one device).
./build-metal/qvac-parakeet \
--n-gpu-layers 1 \
--model models/parakeet-ctc-0.6b.gguf \
--wav test/samples/jfk.wav(Use a quantised GGUF -- e.g. parakeet-ctc-0.6b.q8_0.gguf --
produced via --quant q8_0 in the converter snippet under §2 if you
want the smaller / faster file. The bare .gguf is f16 and works
the same.)
This produces the main binary plus per-stage validation harnesses:
| Binary | What it does |
|---|---|
build/qvac-parakeet |
End-to-end CLI: wav / raw PCM -> text (CTC + TDT + EOU) or speaker segments (Sortformer). Auto-routes on GGUF metadata. Supports --stream (Mode 2/3 transcription, sliding-history Sortformer streaming), --diarization-model PATH (combined ASR + Sortformer attribution), --bench, --profile. EOU streaming JSON output includes a is_eou_boundary flag per segment. |
build/live-mic |
Live microphone session for either transcription (CTC/TDT/EOU) or diarization (Sortformer). Auto-detects from the GGUF. |
build/live-mic-attributed |
Live microphone with simultaneous ASR + Sortformer; tags each transcript segment with the speaker whose live diarization range overlaps it the most. --accumulate collapses output to one line per speaker. |
build/test-mel |
16 kHz log-mel parity vs NeMo AudioToMelSpectrogramPreprocessor. |
build/test-encoder |
FastConformer encoder per-stage parity vs dump-ctc-reference.py. |
build/test-ctc |
CTC head + greedy decode + SentencePiece detokenize parity vs NeMo transcribe() (consumes logits.npy from dump-ctc-reference.py). |
build/test-tdt-encoder-parity |
TDT encoder per-stage parity vs dump-tdt-reference.py. |
build/test-sortformer-parity |
Sortformer mel + encoder + speaker-prob parity vs dump-sortformer-reference.py. |
build/test-streaming |
CTC/TDT Mode 2 byte-equality + timestamp coverage + Mode 3 WER tolerance across chunk sizes. |
build/test-eou-streaming |
EOU Mode 2 transcript byte-equality vs Engine::transcribe() reference + is_eou_boundary firing on the trailing <EOU> chunk + Mode 3 transcript-within-tolerance. |
build/test-sortformer-streaming |
SortformerStreamSession push API: random-burst feed, no-duplicate, single-is_final assertions. |
-DQVAC_PARAKEET_BUILD_TESTS=ON(default ON in standalone): builds thetest-*parity + streaming harnesses listed above.-DQVAC_PARAKEET_BUILD_EXAMPLES=ON(defaultQVAC_PARAKEET_STANDALONE_DEFAULT, i.e. ON for top-levelcmake -S . -B buildbut OFF when consumed as a sub-project): buildslive-mic+live-mic-attributed.-DQVAC_PARAKEET_USE_SYSTEM_GGML=ON: link against an installed ggml instead of the pinned clone inggml/.-DGGML_METAL=ON/-DGGML_CUDA=ON/-DGGML_VULKAN=ON: pick exactly one GPU backend at configure time (see GPU note above).
The converter (scripts/convert-nemo-to-gguf.py -- name is
historical; it auto-detects CTC, TDT and Sortformer from the .nemo
config and writes the right GGUF in each case) takes a .nemo archive
and produces a single self-contained GGUF (encoder + decoder weights +
embedded tokenizer where applicable + precomputed mel filterbank).
python -m venv venv && . venv/bin/activate
pip install "nemo_toolkit[asr]" gguf numpy soundfile librosa sentencepiece
# Parakeet-CTC 0.6B / 1.1B (English, fast)
python scripts/convert-nemo-to-gguf.py \
--ckpt models/parakeet-ctc-0.6b.nemo \
--out models/parakeet-ctc-0.6b.gguf
python scripts/convert-nemo-to-gguf.py \
--ckpt models/parakeet-ctc-1.1b.nemo \
--out models/parakeet-ctc-1.1b.q8_0.gguf \
--quant q8_0
# Parakeet-TDT 0.6B-v3 / 1.1B (multilingual, punctuation, capitalisation)
python scripts/convert-nemo-to-gguf.py \
--ckpt models/parakeet-tdt-0.6b-v3.nemo \
--hf-repo nvidia/parakeet-tdt-0.6b-v3 \
--out models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--quant q8_0
python scripts/convert-nemo-to-gguf.py \
--ckpt models/parakeet-tdt-1.1b.nemo \
--hf-repo nvidia/parakeet-tdt-1.1b \
--out models/parakeet-tdt-1.1b.q8_0.gguf \
--quant q8_0
# Parakeet-EOU 120M (English, real-time streaming + native <EOU> end-of-utterance token)
python scripts/convert-nemo-to-gguf.py \
--ckpt models/parakeet_realtime_eou_120m-v1.nemo \
--hf-repo nvidia/parakeet_realtime_eou_120m-v1 \
--out models/parakeet-eou-120m-v1.q8_0.gguf \
--quant q8_0
# Sortformer 4-speaker diarization (offline v1, streaming-trained v2)
python scripts/convert-nemo-to-gguf.py \
--ckpt models/diar_sortformer_4spk-v1.nemo \
--hf-repo nvidia/diar_sortformer_4spk-v1 \
--out models/sortformer-4spk-v1.f16.gguf
python scripts/convert-nemo-to-gguf.py \
--ckpt models/diar_streaming_sortformer_4spk-v2.nemo \
--hf-repo nvidia/diar_streaming_sortformer_4spk-v2 \
--out models/sortformer-streaming-4spk-v2.f16.ggufFootgun: the script's --hf-repo defaults to nvidia/parakeet-ctc-0.6b,
so when --ckpt points at a non-CTC path that does not exist locally
you must pass --hf-repo explicitly -- otherwise the script will
download the CTC checkpoint instead of the one named in --ckpt.
scripts/download-all-models.sh pre-fetches every supported .nemo
(plus the corresponding ONNX bundles for the Node binding) -- handy
when you're about to be on a flaky network.
--quant selects the storage format for the ~150 large 2D weight
matrices (FFN, attention q/k/v/out/pos/qkv, conv pointwise, subsampling
output, CTC head). Small tensors (biases, norms, fused BN, mel
filterbank, depthwise/ small 2D convs) always stay at f32/f16. Tensors
whose shape[-1] % 32 != 0 also stay at f16 (the block-quantised
formats need a multiple-of-32 last dim); see PROGRESS.md §5.12 for
the full sweep + tier-by-tier accuracy.
--quant |
File size | enc best on 20 s clip | enc best on 11 s clip | Transcript parity |
|---|---|---|---|---|
f32 |
2.4 GiB | n/a (debug only) | n/a | exact |
f16 |
1.3 GiB | 1221 ms | ~680 ms | bit-equal |
q8_0 |
697 MiB | 839 ms | 460 ms | bit-equal |
q5_0 |
453 MiB | 1475 ms (slower) | ~650 ms | bit-equal |
q4_0 |
372 MiB | 1080 ms | 595 ms | bit-equal |
Measurements on an Apple M4 Air, 10 ggml-cpu threads, OpenMP,
--bench-warmup 5 --bench-runs 15. Transcripts on both clips are
bit-equal to NeMo PyTorch reference at every tier tested, including
q4_0. To reproduce / compare across GGUF tiers and backends:
./build/qvac-parakeet --model models/parakeet-ctc-0.6b.q8_0.gguf \
--wav test/samples/sample-16k.wav \
--bench --bench-runs 15 --bench-warmup 5 \
--bench-json artifacts/bench/my-q8_0.jsonFor per-sub-stage encoder profiling (subsampling / CTC head / per-block
times across n_layers = {0, 1, N/2, N}):
./build/qvac-parakeet --model models/parakeet-ctc-0.6b.q8_0.gguf \
--wav test/samples/sample-16k.wav \
--profile --profile-runs 5 --profile-warmup 2Defaults: q8_0 is the speed/accuracy sweet spot (2x smaller
than f16, bit-equal transcripts, 11-23 % faster than onnxruntime on
typical clips). q4_0 is 3.5x smaller still and also bit-equal on
clean speech. q5_0 ships but its mul_mat kernel is slower than both
on Apple Silicon, so it's only useful if you want the q5_0 size
tier specifically.
CPU f16 vs f16 — same floating-point precision, different runtimes:
onnxruntime-f16 ggml-cpu-f16
-----------------------------------------------
model size 2.3 GiB 1.3 GiB
load ms 16 736 642 (26x faster)
inf best ms 948 1117 (15 % slower)
inf median ms 1 007 1132 (12 % slower)
inf stdev ms 52 18 (3x tighter)
RTF best 0.047 0.055
RTF median 0.050 0.056
Transcripts match match
CPU int8 vs int8 — same quantization level, different runtimes:
onnxruntime-int8 ggml-cpu-q8_0
-------------------------------------------------
model size 583.9 MiB 697 MiB
load ms 2 054 179 (11x faster)
inf best ms 677 898 (25 % slower)
inf median ms 721 928 (22 % slower)
inf stdev ms 55 25 (2x tighter)
RTF best 0.034 0.045
RTF median 0.036 0.046
Transcripts match match
GPU Metal — same GGUF, Metal backend (build with -DGGML_METAL=ON,
run with --n-gpu-layers 1):
onnxruntime-int8 ggml-metal-q8_0
---------------------------------------------------
model size 583.9 MiB 697 MiB
load ms 2 295 420 (5.5x faster)
inf best ms 682 282 (2.4x faster)
inf median ms 712 283 (2.5x faster)
inf stdev ms 18 0.83 (21x tighter)
RTF best 0.034 0.014
RTF median 0.035 0.014
Transcripts match match
Summary: on CPU, onnxruntime's AMX-accelerated kernels are 12-25 % faster than ggml-cpu. On Metal, ggml is 2.4-2.5x faster than onnxruntime int8 with 21x tighter variance, landing the 20 s clip's encoder at ~73x real-time; quant tier (f16 / Q8_0 / Q4_0) only affects file size, not throughput, because the Metal path is compute-bound on shader units.
./build/qvac-parakeet \
--model models/parakeet-ctc-0.6b.gguf \
--wav test/samples/jfk.wavAuto-routing on the model type means the same command also works on
TDT GGUFs (you get cased + punctuated text), on EOU GGUFs (you get
the same lowercase no-PnC English text the EOU model was trained for,
with a trailing <EOU> token segmenting the output by utterance),
and on Sortformer GGUFs (you get [start-end] speaker_N lines
instead of text). See --help for the full flag set.
For headless pipelines (ffmpeg / sox upstream, or the QVAC bindings), the
CLI also accepts raw mono PCM via --pcm-in. The raw stream carries no
header, so you must pass --pcm-rate HZ to match the model's expected
sample rate -- omitting it falls back to the model's rate with a
warning, and a mismatched rate fails fast (resampling is not yet wired):
./build/qvac-parakeet \
--model models/parakeet-ctc-0.6b.gguf \
--pcm-in recording.raw \
--pcm-format s16le \ # or f32le; defaults to s16le
--pcm-rate 16000 # required for fail-fast; warning + fallback if omittedThe engine exposes three transcription entry points that mirror the qvac
SDK's transcribe / transcribeStream API:
| Entry point | Caller provides | Caller receives | Status |
|---|---|---|---|
Engine::transcribe() |
full audio | full text | ships |
Engine::transcribe_stream() |
full audio + callback | segments via callback | ships (Mode 2) |
Engine::stream_start() -> StreamSession |
push PCM via feed_pcm_*() |
segments via callback | ships (Mode 3, cache-aware inference) |
Mode 2 runs the offline encoder once, then walks the encoder frames in
chunk_ms-sized windows. For CTC GGUFs it runs ctc_greedy_decode_window
per window and the concatenated transcript is byte-equal to the
non-streaming path -- test-streaming asserts this across chunk sizes
{250, 500, 1000, 2000, 4000, 11000} ms on every run. For TDT GGUFs it
carries TdtDecodeState (LSTM hidden + last token) across windows; the
non-streaming WER is preserved within test-streaming's tolerance band
(40% at the most aggressive chunk=1000 left=2000 right=500 config,
~0% at typical settings) but byte-equality with the non-streaming path
is not guaranteed because the joint network's emission timing can
shift slightly when the encoder context window changes.
From the CLI:
./build/qvac-parakeet \
--model models/parakeet-ctc-0.6b.gguf \
--pcm-in recording.raw --pcm-format s16le \
--stream --stream-chunk-ms 1000 \
--emit text # or jsonlFlags:
--stream— enable Mode 2.--stream-chunk-ms N— segment window stride (default 1000; snaps down to multiples of the encoder frame stride, which is 80 ms on every shipped GGUF; the implementation derives it from the model's mel hop length and subsampling factor).--emit text— one[start-end] textline per segment (default).--emit jsonl— one{"chunk","start","end","is_final","is_eou_boundary","text"}JSON object per line, for easy downstream consumption. Theis_eou_boundaryfield is always present but only ever true on EOU GGUFs (CTC / TDT segments leave it false).
On a 5.5 minute speech clip (LastQuestion_long_EN.raw, 16 kHz
s16le) Mode 2 lands at RTF 0.046 (~22x real-time) on M4 Air with
Metal Q8_0. Segments are emitted at the --stream-chunk-ms cadence
once the offline encoder finishes -- Mode 2 is cosmetic streaming:
first segment lands after the full encoder pass.
Mode 3 feeds PCM into a StreamSession incrementally; each chunk runs
its own encoder pass over [left_context + chunk + right_lookahead]
audio, slices out the center frames, and emits a segment as soon as
that chunk is processed. First segment lands at
chunk_ms + right_lookahead_ms + encoder_time, not after the full
utterance.
Key point: no new model needed. Mode 3 runs whichever
offline-trained Parakeet GGUF you have loaded (CTC or TDT) in
cache-aware inference mode. Accuracy is preserved within a few percent
of offline WER when the left_context_ms and right_lookahead_ms
budgets are reasonable
(the conv module uses symmetric kernel=9 padding, so denying future
context at chunk boundaries hurts more than denying past context).
From the CLI (simulates a live producer feeding the same wav in blocks):
./build/qvac-parakeet \
--model models/parakeet-ctc-0.6b.gguf \
--pcm-in recording.raw --pcm-format s16le \
--stream --stream-duplex \
--stream-chunk-ms 2000 \
--stream-left-context-ms 10000 \
--stream-right-lookahead-ms 2000 \
--emit text # or jsonl--stream-left-context-ms is the audio context per chunk (10 s is
the default; diminishing returns past 5 s); --stream-right-lookahead-ms
is the most impactful accuracy knob (future audio appended before
emitting). Both have StreamingOptions mirrors on the C++ side.
Measured on Apple M4 Air, Q8_0, Metal backend:
| Audio | Config (chunk / left / right ms) | WER vs offline | Wall time | First-seg latency |
|---|---|---|---|---|
jfk.wav (11 s) |
1000 / 2000 / 500 | 0.00% | ~1.8 s | ~1.6 s |
jfk.wav (11 s) |
2000 / 2000 / 1000 | 0.00% | ~1.8 s | ~3.1 s |
jfk.wav (11 s) |
2000 / 5000 / 2000 | 0.00% | ~1.9 s | ~4.1 s |
LastQuestion_EN.raw (5.5 min) |
2000 / 10000 / 2000 | 4.13% | 35 s (RTF 0.11) | ~4 s |
Mode 3 is slower in total wall time than Mode 2 because each chunk
re-runs the encoder over the full (left_ctx + chunk + right_lookahead)
window. A KV-cache + conv-state optimisation (Phase 8.5) will roughly
6x the per-chunk compute on long-form audio while preserving the same
accuracy; the StreamSession public API already supports it as a
drop-in swap.
The Node binding at qvac-lib-infer-parakeet
is the intended consumer for StreamSession; check its README for
the qvac-parakeet.cpp version it currently links against.
EOU GGUFs flow through the same Mode 1 / Mode 2 / Mode 3 entry points
as CTC / TDT, with two extras on each emitted StreamingSegment:
struct StreamingSegment {
// ... existing CTC/TDT/EOU fields: text, token_ids, start_s, end_s,
// chunk_index, is_final, encoder_ms, decode_ms ...
// EOU only: set true when this chunk's decoded portion contained
// the `<EOU>` token. CTC / TDT segments leave this false.
bool is_eou_boundary = false;
float eot_confidence = 0.0f; // reserved for Phase 13 OnEndOfTurn
};The decoder threads its own LSTM h/c state across chunks; on <EOU>
it flushes the current segment to text, zeros h/c, and re-primes the
predictor with the blank embedding -- exactly matching the binding's
processEOU semantics from qvac-lib-infer-parakeet. The token is
not in the visible vocab piece list, so it doesn't appear in
segment.text; consumers see the is_eou_boundary flag instead.
CLI examples on jfk.wav (the JFK quote ends naturally with one
<EOU> boundary at the very end):
# Offline transcription (matches NeMo offline reference bit-for-bit):
./build/qvac-parakeet \
--model models/parakeet-eou-120m-v1.q8_0.gguf \
--wav test/samples/jfk.wav
# -> "and so my fellow americans ask not what your country can do for
# you ask what you can do for your country"
# Mode 2 streaming with chunked emit + JSON output (last chunk gets
# is_eou_boundary=true because the model emits <EOU> at end-of-quote):
./build/qvac-parakeet \
--model models/parakeet-eou-120m-v1.q8_0.gguf \
--wav test/samples/jfk.wav \
--stream --stream-chunk-ms 1500 --emit jsonl
# -> {"chunk":0,...,"is_eou_boundary":false,"text":"and so my"}
# {"chunk":1,...,"is_eou_boundary":false,"text":" fellow americans"}
# ...
# {"chunk":7,...,"is_eou_boundary":true, "text":" country"}
# Mode 3 live duplex (push API; same audio -> same transcript):
./build/qvac-parakeet \
--model models/parakeet-eou-120m-v1.q8_0.gguf \
--wav test/samples/jfk.wav \
--stream --stream-duplex \
--stream-chunk-ms 1000 \
--stream-left-context-ms 5000 \
--stream-right-lookahead-ms 1000Numerical parity vs NeMo PyTorch reference (dump-eou-reference.py)
on jfk.wav:
| Stage | rel error | cosine |
|---|---|---|
| log-mel | 8.17e-1 (tail-frame artifacts) | 0.999644 |
| post-subsampler | 1.00e-1 | 0.999688 |
| encoder out | 7.70e-3 | 0.999997 |
Bit-equal transcripts on jfk.wav and sample-16k.wav
(Alice-in-Wonderland 20 s clip) at both f16 and q8_0 quant tiers.
build/test-eou-streaming asserts these properties on every CI run.
Mode 3 caveat (chosen design, not a workaround): the streaming
session re-runs the offline encoder per chunk over a sliding
[left + chunk + right_lookahead] window without persistent KV /
conv-state cache across chunks. The transcript is byte-equal to
the offline path on shipping fixtures, but <EOU> boundary
detection in Mode 3 is approximate because the trailing chunk
doesn't carry the long-context encoder state the EOU head needs to
confidently fire <EOU> at end-of-utterance.
The obvious alternative -- driving the streaming-trained EOU
weights through NeMo's cache_aware_stream_step (per-layer K/V
cache + depthwise-conv state, chunked-limited streaming attention
mask, O(chunk) per-chunk encoder cost) -- was prototyped during
the Phase 12.x exploration and rejected on quality grounds. It
produces NeMo's streaming transcript, which is structurally
distinct from and meaningfully worse than NeMo's offline transcript
(~2× early-utterance WER on jfk.wav, with the trailing <EOU>
token disappearing entirely from the cache-aware output). This
isn't a C++ port issue: NeMo's own RNN-T over the cache-aware
streaming encoder output reproduces the same regression bit-for-
bit, and Phase 8.0 already documented the same quality cliff two
years earlier on the older streaming_multi checkpoint from the
same model family. See PROGRESS.md §8.5 for the full rationale and
a strict separation of (A) chunked-limited streaming inference
[rejected] from (B) the original Phase 8.5 KV-cache-on-offline-
weights scope [deferred indefinitely, but a different design
shape].
Phase 11.11.1 ships a push-API SortformerStreamSession. The session
buffers audio internally and, every chunk_ms, runs
Engine::diarize() over the trailing history_ms of audio, emits
segments that overlap the new chunk via callback, and slides the chunk
pointer forward.
SortformerStreamingOptions opts;
opts.sample_rate = 16000;
opts.chunk_ms = 2000; // emit cadence
opts.history_ms = 30000; // sliding context window
opts.threshold = 0.5f;
opts.min_segment_ms = 200;
auto session = sortformer_engine.diarize_start(opts,
[](const StreamingDiarizationSegment & s) {
std::printf("[%.2f-%.2f] speaker_%d (chunk %d%s)\n",
s.start_s, s.end_s, s.speaker_id, s.chunk_index,
s.is_final ? ", final" : "");
});
session->feed_pcm_f32(samples, n);
// ...feed more...
session->finalize();CLI:
./build/qvac-parakeet \
--model models/sortformer-4spk-v1.f16.gguf \
--pcm-in recording.raw --pcm-format s16le \
--stream \
--stream-chunk-ms 2000 --stream-history-ms 30000 \
--emit text # or jsonlTrade-offs of the Phase 11.11.1 pragmatic implementation:
- Pro: works with both v1 and v2 Sortformer GGUFs out of the box,
no encoder graph split, no spkcache state. ~RTF 0.25 on M4 Air CPU
with
chunk_ms=2000 history_ms=30000(each chunk re-runs the full encoder over the trailing 30 s). - Pro: speaker IDs stabilise within a few chunks once the history window contains both speakers' audio.
- Con: speaker IDs are derived from each per-chunk
diarize()independently and may shift on the very first chunks, before the history window is full enough to disambiguate speakers.
Phase 11.11.2 (planned) implements true NeMo-style streaming with
spkcache compression + encoder graph split for fully stable
cross-chunk speaker identity at lower per-chunk compute.
Three example binaries take audio from the system default mic via
miniaudio (single-header, MIT, vendored
under examples/miniaudio.h) and drive the streaming push API.
Terminal output only, no GUI. First run on macOS will prompt for
microphone access. Across all three examples: capture happens on the
audio callback thread into a mutex-guarded queue; the main thread
drains and feeds the engine, so the encoder never blocks the
capture buffer. Ctrl-C stops the device, flushes the tail, and
finalize()s cleanly.
examples/live-mic.cpp auto-detects the GGUF: a CTC/TDT model drives
StreamSession and prints transcript segments; a Sortformer model
drives SortformerStreamSession and prints [start-end] speaker_N
lines.
# List capture devices:
./build-metal/live-mic --list-devices
# Live transcription (Ctrl-C to stop, Metal recommended):
./build-metal/live-mic \
--model models/parakeet-ctc-0.6b.q8_0.gguf \
--n-gpu-layers 1 \
--chunk-ms 1000 --left-context-ms 5000 --right-lookahead-ms 1000
# Same, but accumulate transcript on a single line and emit a newline
# after 1 s of silence (hands-free dictation feel):
./build-metal/live-mic \
--model models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--n-gpu-layers 1 \
--chunk-ms 1000 --left-context-ms 5000 --right-lookahead-ms 1000 \
--accumulate --silence-flush-ms 1000
# Live diarization (same binary, Sortformer GGUF auto-detected):
./build-metal/live-mic \
--model models/sortformer-4spk-v1.f16.gguf \
--chunk-ms 2000 --history-ms 30000Defaults are tuned for an interactive feel: first transcription
segment lands ~2 s after you start speaking
(chunk_ms + right_lookahead_ms + encoder_time), then at the
chunk_ms cadence.
examples/live-mic-attributed.cpp loads a CTC/TDT engine and a
Sortformer engine, forwards each captured batch to both, and tags
each transcript segment with the speaker whose live diarization
range overlaps it the most:
[2.10-3.00] speaker_0: hello there how are you
[3.00-4.00] speaker_0: doing today
[4.00-5.20] speaker_1: I am fine thanks
./build/live-mic-attributed \
--asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf \
--diar-model models/sortformer-4spk-v1.f16.gguf \
--asr-chunk-ms 1000 --asr-left-context-ms 5000 --asr-right-lookahead-ms 1000 \
--diar-chunk-ms 2000 --diar-history-ms 30000Each captured audio batch is forwarded to both StreamSession
(transcription) and SortformerStreamSession (diarization). The
diarization callback maintains a sliding deque of recent
[start, end, speaker] spans; the transcription callback looks up
the most-overlapping span at the segment's time range and tags the
line. [diar] active speaker_N at t.ts fires on stderr at speaker
switches.
Knobs:
-
--asr-chunk-ms / --asr-left-context-ms / --asr-right-lookahead-ms: same aslive-micfor transcription. -
--diar-chunk-ms / --diar-history-ms: same asEngine::diarize_start. -
--speaker-history-ms(default 60000): how much diarization history to retain for the attribution lookup. Increase for very long conversations; decrease if memory is tight. -
--asr-n-gpu-layers / --diar-n-gpu-layers: independent GPU offload knobs so you can run e.g. ASR on Metal and diarization on CPU (or vice versa) on machines with a single GPU. -
--accumulate: collapse output to one line per speaker and emit a newline on speaker change or after--silence-flush-msof silence (default 1000). Same UX aslive-mic --accumulate, but each line is prefixed withspeaker_N:. Output looks like:speaker_0: hello there how are you doing today speaker_1: I am fine thanks how about yourself speaker_0: pretty good thanks for asking
With -DGGML_METAL=ON, the same example runs both engines on the
GPU (use independent --asr-n-gpu-layers / --diar-n-gpu-layers
to mix CPU and GPU on a single-GPU machine):
./build-metal/live-mic-attributed \
--asr-model models/parakeet-tdt-0.6b-v3.q8_0.gguf --asr-n-gpu-layers 1 \
--diar-model models/sortformer-4spk-v1.f16.gguf --diar-n-gpu-layers 1 \
--accumulateThe same Phase 11.11.1 sliding-history caveat from "Streaming -- Sortformer" applies to the speaker IDs the attribution layer sees.
CTC parity (mel + encoder + greedy decode):
python scripts/dump-ctc-reference.py \
--wav test/samples/jfk.wav \
--out artifacts/ctc-ref
./build/test-mel models/parakeet-ctc-0.6b.gguf test/samples/jfk.wav artifacts/ctc-ref/mel.npy
./build/test-encoder models/parakeet-ctc-0.6b.gguf artifacts/ctc-ref
./build/test-ctc models/parakeet-ctc-0.6b.gguf artifacts/ctc-ref/logits.npyTDT parity (encoder per-stage; the decoder is checked end-to-end via
the CLI transcript byte-equality check on jfk.wav):
python scripts/dump-tdt-reference.py \
--wav test/samples/jfk.wav \
--out artifacts/tdt-ref
./build/test-tdt-encoder-parity \
models/parakeet-tdt-0.6b-v3.q8_0.gguf test/samples/jfk.wav artifacts/tdt-refSortformer parity (mel + encoder + speaker-prob head):
python scripts/dump-sortformer-reference.py \
--wav test/samples/two-speakers-16k.wav \
--out artifacts/sortformer-ref
./build/test-sortformer-parity \
models/sortformer-4spk-v1.f16.gguf test/samples/two-speakers-16k.wav artifacts/sortformer-refEOU parity (mel + encoder + offline + Mode 2 / Mode 3 streaming).
The reference dump produces both an offline-pass and a streaming-pass
reference (encoder_streaming_out.npy); test-eou-streaming checks
offline transcript byte-equality + <EOU> boundary firing on the
trailing chunk:
python scripts/dump-eou-reference.py \
--wav test/samples/jfk.wav \
--out artifacts/eou-ref
./build/test-eou-streaming \
--model models/parakeet-eou-120m-v1.q8_0.gguf --wav test/samples/jfk.wavStreaming smoke tests (Mode 1/2/3 byte-equality + WER tolerance for
CTC/TDT; Mode 2 byte-equal + Mode 3 transcript-within-tolerance for
EOU; sliding-history push API + no-duplicate + single-is_final for
Sortformer):
./build/test-streaming \
--model models/parakeet-ctc-0.6b.q8_0.gguf --wav test/samples/jfk.wav
./build/test-eou-streaming \
--model models/parakeet-eou-120m-v1.q8_0.gguf --wav test/samples/jfk.wav
./build/test-sortformer-streaming \
--model models/sortformer-4spk-v1.f16.gguf --wav test/samples/two-speakers-16k.wavExpected per-stage rel error (NeMo PyTorch vs C++ at --quant f16):
Stage A log_mel ~ 1e-4 inner / ~ 2e-3 boundary (f32 FFT)
Stage B subsampling_out rel ~ 1e-3 (f16 quantization floor)
Stage C block_0_out rel ~ 1e-3
Stage D block_last_out rel ~ 2e-3
Stage E ctc_logits rel ~ 1e-3 (CTC head only)
Stage F decoded transcript edit distance = 0 on clean speech
Stage S speaker_probs rel ~ 2e-4 (Sortformer head)
Stage E2 eou_encoder_out rel ~ 8e-3, cosine 0.999997 (EOU 17L
chunked-limited)
At --quant q8_0 through q4_0 the per-stage rel inflates by ~3x
to ~25x, but the transcript stays bit-equal on clean speech for CTC,
TDT, and EOU alike. See PROGRESS.md §5.12 for the CTC quant sweep,
§10.x for TDT, §11.x for Sortformer, and §12.x for EOU.
Phases 0 through 12 have shipped (see PROGRESS.md for the full
journal). Phase 12 (EOU FastConformer-RNN-T 120M with native
<EOU> end-of-utterance token) is feature-complete on the offline
- Mode 2 + Mode 3 streaming axes: bit-equal transcripts to NeMo on
jfk.wavand the 20-second Alice-in-Wonderland clip at bothf16andq8_0quant tiers, encoder cosine 0.999997 vs NeMo PyTorch reference,is_eou_boundaryflag firing on the chunk that contains the trailing<EOU>token in Mode 2.
A cache-aware streaming inference path (NeMo's
cache_aware_stream_step) was prototyped during a Phase 12.x
exploration and rejected on quality grounds: it produces NeMo's
streaming transcript, which is structurally distinct from and
meaningfully worse than the offline transcript on this model family
(~2× early-utterance WER, no <EOU> token emitted). The same
quality cliff was documented two years earlier in Phase 8.0 on the
predecessor streaming_multi checkpoint family. The exploration
branch was reverted before landing; PROGRESS.md §8.5 captures the
detailed rationale so future contributors don't re-run the same
loop.
Phase 11.12 (quantised Sortformer GGUFs) and Phase 13 (cross-engine
StreamEvent API for OnVadState / OnEndOfTurn) shipped on top
of Phase 12; both Sortformer checkpoints now have q8_0 / q4_0
tiers with quant-aware parity gates, and all four engines (CTC,
TDT, EOU, Sortformer) can opt into per-event callbacks alongside
the existing per-segment callbacks. The remaining active workstream
is Phase 11.11.2 (NeMo-style spkcache + encoder graph split for
fully stable Sortformer streaming speaker IDs).
Headline highlights (per phase, one bullet each; PROGRESS.md §N.x
has the full round-by-round journal):
- Parity (Phase 4):
qvac-parakeet --model ... --wav ...produces the expected transcript end-to-end, matching NeMo PyTorch bit-equivalently onjfk.wavandsample-16k.wavat every quant tier (f16 through Q4_0) on both CPU and Metal backends. Per-stage numerical parity at the f16 quantization floor (~1-2e-3 rel vs NeMo PyTorch) on every intermediate encoder tensor. - CPU optimisation (Phase 5): encoder runs 22x real-time on an M4 Air CPU at Q8_0 — 12 % faster than ONNX f16, 22 % slower than ONNX int8.
- Metal (Phase 6): encoder runs 73x real-time on the M4 Air GPU at Q8_0 — 2.5x faster than onnxruntime int8 with 21x tighter variance (0.83 ms stdev).
- Mode 2 streaming (Phase 7):
Engine::transcribe_stream()runs the offline encoder once, walks the encoder frames inchunk_mswindows, emits per-segment callbacks. Byte-equal to non-streaming on CTC; WER-bounded on TDT. - Mode 3 live duplex (Phase 8):
Engine::stream_start()->StreamSessionpush API. Cache-aware inference on the existing offline GGUF (CTC or TDT); no new checkpoint needed. ~4 % WER on a 5.5 min sci-fi clip with ~4 s first-segment latency at defaults. - Multi-model loader (Phase 9): same converter + same
Enginehandle the 1.1B variants alongside the 0.6B baselines. - TDT decoder (Phase 10):
nvidia/parakeet-tdt-0.6b-v3andparakeet-tdt-1.1bported -- multilingual transcription with punctuation + capitalisation. 2-layer LSTM prediction + joint MLP- transducer greedy decode (CPU, f32 after dequant). Mode 1, 2 and 3 all support TDT GGUFs.
- Sortformer diarization (Phase 11):
diar_sortformer_4spk-v1anddiar_streaming_sortformer_4spk-v2ported -- 4-speaker diarization with rel 2.0e-4 vs NeMo on speaker probabilities.Engine::diarize()API + CLI auto-routing. §11.10 shipstranscribe_with_speakersfor combined ASR + speaker attribution. §11.11.1 shipsEngine::diarize_start()->SortformerStreamSessionfor live diarization (sliding-history v1; Phase 11.11.2 NeMo-style spkcache streaming pending). Sortformer also ships atq8_0(~140 MiB, 1.9x smaller than f16) andq4_0(~75 MiB, 3.5x smaller); user-facing diarization output is identical across all three quant tiers onjfk.wav. - EOU end-of-utterance ASR (Phase 12):
nvidia/parakeet_realtime_eou_120m-v1ported -- a streaming-trained 120M FastConformer-RNN-T English ASR with a native<EOU>end-of-utterance token that fires at natural turn boundaries (NeMo voice-agent target). The sameparakeet_ctc.cppencoder graph is reused with three structural switches gated on GGUF metadata: LayerNorm in the conv module, asymmetric(L=k-1, R=s-1)causal padding in the dw_striding subsampler, and a chunked-limited attention mask viaggml_soft_max_ext. Newparakeet_eou.{h,cpp}ports the 1-layer LSTM + joint MLP RNN-T decoder with<EOU>reset semantics.StreamingSegmentgains anis_eou_boundaryflag +eot_confidenceslot reserved for Phase 13's cross-engineOnEndOfTurnevent. Encoder cosine 0.999997 vs NeMo offline at f16 quant floor; transcripts bit-equal to NeMo onjfk.wavandsample-16k.wavat bothf16andq8_0tiers. Driving these streaming-trained weights through NeMo's chunked-limitedcache_aware_stream_stepwas prototyped + rejected on quality grounds (PROGRESS.md §8.5 case (A)).
Next: vcpkg port for qvac-parakeet.cpp + the
qvac-lib-infer-parakeet binding swap to consume this library
instead of onnxruntime; Accelerate BLAS for the TDT/EOU decoder's
LSTM + joint gemvs and Sortformer's transformer attention;
CONV_2D_DW on Metal (upstream ggml contribution); Metal
flash-attn; Phase 11.11.2 Sortformer streaming (NeMo-style spkcache).
qvac-parakeet.cpp/
ggml/ pristine ggml clone (not tracked; populated
by scripts/setup-ggml.sh, or skipped entirely
when building with -DQVAC_PARAKEET_USE_SYSTEM_GGML=ON)
src/
main.cpp CLI (wav / raw PCM -> text or speaker segments,
+ Mode 2/3 transcription streaming, sliding-history
diarization streaming, attribution) + qvac_parakeet_cli_main
+ transcribe_wav (CTC-only one-shot helper)
cli_main.cpp thin main() -> qvac_parakeet_cli_main shim
parakeet_ctc.{h,cpp} GGUF loader + FastConformer encoder ggml graph
+ CTC head + greedy decode (shared by all engines;
model_type field selects the decoder; LN-in-conv
+ causal subsampler + chunked-limited attention
mask gated on EOU GGUFs)
parakeet_tdt.{h,cpp} TDT decoder: 2-layer LSTM prediction + joint MLP
+ transducer greedy decode (CPU)
parakeet_eou.{h,cpp} EOU decoder: 1-layer LSTM prediction + joint MLP
+ transducer greedy decode with `<EOU>` token
reset semantics (segment flush + h/c zeroing).
CPU only today.
parakeet_sortformer.{h,cpp} Sortformer diarization: encoder_proj + 18-layer
Transformer encoder + ReLU MLP + sigmoid head + segmenter
parakeet_engine.cpp Engine + StreamSession + SortformerStreamSession
(transcribe, transcribe_stream, stream_start, diarize,
diarize_start, transcribe_with_speakers)
mel_preprocess.{h,cpp} wav I/O + STFT + mel + optional per-feature CMVN
(skipped on EOU GGUFs that set normalize=NA)
sentencepiece_bpe.{h,cpp} SentencePiece BPE detokenizer (CTC + TDT + EOU)
dr_wav.h vendored single-header WAV reader
npy.h minimal .npy load / save + compare
test_*.cpp per-stage numerical-parity harnesses (mel, encoder,
ctc, tdt-encoder, sortformer) + streaming
validation (test-streaming, test-eou-streaming,
test-sortformer-streaming)
include/qvac-parakeet/
qvac-parakeet.h CLI entry (qvac_parakeet_cli_main) + library overview
ctc/engine.h persistent multi-engine Engine umbrella + StreamSession +
SortformerStreamSession + transcribe_with_speakers.
The header path "ctc/" is historical -- the API now
covers CTC, TDT, EOU, and Sortformer GGUFs.
StreamingSegment carries is_eou_boundary +
eot_confidence (EOU-only fields; reserved for
Phase 13 cross-engine OnEndOfTurn event).
ctc/pipeline.h one-shot wav -> text API (CTC GGUFs only;
hard-errors on TDT/EOU/Sortformer)
examples/
live-mic.cpp live microphone -> transcription (CTC/TDT/EOU) or live
diarization (Sortformer); auto-detects the GGUF.
live-mic-attributed.cpp live microphone -> dual-engine ASR + Sortformer
with per-segment speaker attribution.
miniaudio.h vendored single-header audio capture (MIT).
scripts/
setup-ggml.sh pin + clone ggml
convert-nemo-to-gguf.py .nemo -> GGUF (auto-detects CTC / TDT / EOU / Sortformer)
dump-ctc-reference.py NeMo PyTorch -> .npy reference tensors (CTC stages)
dump-tdt-reference.py NeMo PyTorch -> .npy reference tensors (TDT stages)
dump-eou-reference.py NeMo PyTorch -> .npy reference tensors (EOU stages,
offline + cache-aware streaming pass)
dump-sortformer-reference.py NeMo PyTorch -> .npy reference tensors (Sortformer stages)
dump-block0-substages.py per-sub-stage timing inputs for --profile
ref-encoder-from-gguf.py run the GGUF encoder in PyTorch as a parity oracle
streaming-reference.py reference per-chunk outputs for streaming validation
verify-gguf-roundtrip.py load a GGUF and assert all expected tensors are present
quantize-ctc-onnx-int8.py int8-quantize an ONNX CTC export (for the Node binding)
download-all-models.sh pre-fetch every supported .nemo (and ONNX bundle)
transcribe.sh wav -> text wrapper
cmake/ CMake package config (for vcpkg follow-up)
test/samples/ fixture wavs (jfk.wav, sample-16k.wav)
artifacts/ dumped reference tensors (.npy) per engine; not tracked
models/ downloaded .nemo + converted .gguf checkpoints; not tracked
PROGRESS.md chronological development journal
README.md this file
Released under the Apache License 2.0.
Model licenses: every NVIDIA Parakeet (CTC, TDT) and Sortformer
checkpoint listed in the model table at the top of this README ships
under CC-BY-4.0 on
Hugging Face. The EOU checkpoint (parakeet_realtime_eou_120m-v1)
is distributed under the
NVIDIA Open Model License
-- check each model card for the canonical attribution. This
repository only ships the inference code; model weights are
downloaded on demand by the converter / download-all-models.sh.
The bundled ggml/ is MIT-licensed (see ggml/LICENSE).