Durable reference for humans and agents maintaining parakeet.cpp.
This project follows the Linux kernel project's guidelines for AI coding assistants (the same policy LocalAI uses). Key rules for commits:
- No
Signed-off-byfrom AI. Only a human submitter may sign off on the Developer Certificate of Origin. - No
Co-Authored-By: <AI>trailers. The human contributor owns the change. - Use an
Assisted-by:trailer to attribute AI involvement. Format:Assisted-by: AGENT_NAME:MODEL_VERSION [TOOL1] [TOOL2](e.g.Assisted-by: Claude:claude-opus-4-8 [Claude Code]). - The human submitter is responsible for reviewing, testing, and understanding every line of generated code.
parakeet.cpp is a C++17/ggml inference port of NVIDIA NeMo Parakeet ASR. It targets CPU (GPU backends are wired but not exercised in CI) and is designed for parity with the NeMo reference: a Python converter turns a NeMo checkpoint into a metadata-driven GGUF, and a C++ model loader + conformer inference engine run the same computation natively, with no Python dependency at inference time.
The public surface ships as a flat C-API (include/parakeet_capi.h +
libparakeet.so) suitable for dlopen/FFI/LocalAI integration.
Current status: Phase 5 complete. Supports all offline Parakeet families -
CTC, RNNT, TDT, and hybrid TDT-CTC (0.6B/1.1B/110M, EN + multilingual v3) -
validated at WER 0 vs NeMo on every published checkpoint. Quantization
(F16/Q8_0/K-quants) validated at WER 0. Cache-aware streaming + EOU decoding
(parakeet_realtime_eou_120m-v1) is implemented: pk::StreamingEncoder
(per-layer conv/attention caches) + pk::StreamingSession (carried RNN-T
state) + <EOU>/<EOB> timed events, exposed via parakeet_capi_stream_* and
parakeet-cli transcribe --stream. The streaming transcript matches NeMo's
cache-aware streaming byte-for-byte.
These are measured wins. An agent "simplifying" them has caused real regressions before, so do not change them without an A/B benchmark that proves parity.
- Keep the persistent
ggml_gallocrinsrc/backend.cpp. Reusing one allocator across the many tiny per-utterance graphs (no per-call alloc/free) is the core throughput lever on CPU and GPU. Do NOT replace it withggml_backend_schedon the fast path: sched re-plans the graph split on every call and regressed CUDA by 7-23% when it did. The scheduler is used ONLY as a per-graph fallback, when the active GPU backend lacks a kernel for some op (so the unsupported op can run on CPU); when every op is supported, the fast gallocr path runs. If you think gallocr can go, you are about to reintroduce that regression. - Zero-copy weights.
clone_weightreturns loader tensors directly so the same device buffer is reused every utterance; do not copy weights per call.
include/ public C/C++ headers
parakeet.h , C++ API
parakeet_capi.h , flat C-API for FFI / dlopen
src/ libparakeet implementation
model.hpp/cpp , load-once pk::Model
parakeet.cpp , thin transcribe() wrapper
parakeet_capi.cpp , flat C-API implementation
common.hpp/cpp , logging helpers
audio_io.hpp/cpp , dr_wav load + linear resample to 16k
model_loader.hpp/cpp, GGUF -> ParakeetConfig + name->tensor
mel.cpp , log-mel frontend
encoder.cpp / conformer.cpp / relpos_attention.cpp
ctc_decoder.cpp , CTC head + greedy decode
prediction.cpp , stacked LSTM prediction net
joint.cpp , joint network
tdt.cpp / rnnt.cpp , TDT / RNNT greedy loops
streaming_encoder.hpp/cpp, cache-aware streaming FastConformer encoder
streaming.hpp/cpp , pk::StreamingSession (carried RNN-T + EOU events) + run_stream_over_pcm
examples/cli/ parakeet-cli binary
subcommands: info, transcribe (+ --stream), quantize
scripts/ Python tooling
convert_parakeet_to_gguf.py, .nemo/.hf -> GGUF (--dtype f32|f16|q8_0)
gen_nemo_baseline.py , NeMo intermediates -> baseline.gguf
gen_stream_baseline.py , NeMo cache-aware streaming encode+decode -> stream baseline.gguf
validate_vs_nemo.py , WER parity gate vs NeMo
publish_hf.py , convert+quantize -> HF upload (dry-run default)
requirements.txt , nemo_toolkit[asr] + gguf
tests/ ctest targets
test_smoke.cpp , version string (model-independent)
test_audio_io.cpp , wav load + resample (model-independent)
test_fft.cpp , FFT cross-check (model-independent)
test_model_loader.cpp , config + tensor map (model-dependent)
test_capi.cpp , C-API load -> transcribe -> free (model-dependent)
test_transcribe_speech.cpp, end-to-end CTC transcript (model-dependent)
test_transcribe_tdt.cpp , TDT transcript on speech fixture (model-dependent)
test_transcribe_0_6b.cpp, regression gate for 0.6B model (model-dependent)
test_transcribe_ctc.cpp , standalone CTC regression (model-dependent)
test_transcribe_rnnt.cpp, RNNT regression (model-dependent)
test_transcribe_eou.cpp , offline EOU model transcript + token ids (PARAKEET_TEST_GGUF_EOU)
test_streaming_encoder.cpp, cache-aware streaming encoder == offline + NeMo
test_streaming_decode.cpp , streaming RNN-T tokens == NeMo cache-aware streaming
test_capi_stream.cpp , streaming C-API transcript == NeMo streaming (PARAKEET_TEST_BASELINE_EOU_STREAM)
python/check_convert.py , converter round-trip (model-dependent)
python/check_baseline.py, baseline dumper (model-dependent)
fixtures/clip.wav , 2 s 16 kHz mono WAV for stage parity tests
fixtures/speech.wav , LibriSpeech 2086-149220-0033, ~7.4 s
third_party/ vendored deps
ggml/ , submodule pinned at v0.13.0
dr_wav.h , vendored single header
models/ output dir for converted GGUFs (gitignored;
MANIFEST.md tracks the expected published set)
docs/
conversion.md , GGUF schema reference
quantization.md , quantization allowlist, policy, measured size + WER per type
parity.md , full model coverage matrix + per-stage tensor parity
.github/workflows/
ci.yml , build job (per-push) + closed-loop job (pull_request + dispatch)
cmake -B build -DPARAKEET_BUILD_TESTS=ON -DGGML_NATIVE=ON && cmake --build build -j
| Option | Default | Purpose |
|---|---|---|
PARAKEET_BUILD_TESTS |
OFF | Compile and register ctest targets |
PARAKEET_BUILD_CLI |
ON | Build parakeet-cli |
PARAKEET_SHARED |
OFF | Build libparakeet as a shared library |
PARAKEET_GGML_CUDA |
OFF | Forward GGML_CUDA to the submodule |
PARAKEET_GGML_METAL |
OFF | Forward GGML_METAL to the submodule |
PARAKEET_GGML_VULKAN |
OFF | Forward GGML_VULKAN to the submodule |
PARAKEET_GGML_HIPBLAS |
OFF | Forward GGML_HIPBLAS to the submodule |
Use -DGGML_NATIVE=OFF when building for CI or portable binaries.
ctest --test-dir build --output-on-failure -LE model
Expected: test_smoke, test_audio_io, test_fft PASS.
export PARAKEET_TEST_GGUF=/tmp/pk110m.gguf
export PARAKEET_TEST_BASELINE=/tmp/baseline.gguf
export PARAKEET_TEST_BASELINE_SPEECH=/tmp/baseline_speech.gguf
ctest --test-dir build --output-on-failure
Tests return exit code 77 (ctest SKIP) when the venv or checkpoint is absent, so they never break a CI environment that lacks them.
| Label | Tests | Needs |
|---|---|---|
| (none) | test_smoke, test_audio_io, test_fft |
nothing |
model |
test_model_loader, test_capi, test_transcribe_*, check_* |
venv + checkpoint |
Set up the Python venv once:
python3 -m venv .venv
.venv/bin/pip install torch --index-url https://download.pytorch.org/whl/cpu
.venv/bin/pip install -r scripts/requirements.txt # nemo_toolkit[asr] + gguf
NeMo 2.7.3 is the validated version. The anchor checkpoint is
nvidia/parakeet-tdt_ctc-110m (~440 MB, auto-downloaded by NeMo on first use).
Convert (HuggingFace id or local .nemo):
.venv/bin/python scripts/convert_parakeet_to_gguf.py \
--model nvidia/parakeet-tdt_ctc-110m \
--dtype q8_0 \
--output models/parakeet-tdt_ctc-110m.gguf
Featurizer window and filterbank are lifted from the checkpoint at runtime; mel/fft parameters do not need to be specified manually.
See docs/quantization.md for the full policy. Summary:
Only linear ggml_mul_mat-consumed weights are quantized:
- Encoder per-layer FFN + attention projections (
feed_forward*.linear*.weight,self_attn.linear_{q,k,v,out,pos}.weight) - Subsampling output projection (
encoder.pre_encode.out.weight) - Joint enc/pred projections (
joint.enc.weight,joint.pred.weight)
Everything else stays F32: conv kernels, LSTM weights/biases, mel featurizer,
batch_norm stats, LayerNorm gain/bias, all *.bias, pos_bias, embeddings, the
joint output projection (joint.joint_net.2.weight, hand-rolled loop), and the
CTC head (stored [1, V], block quantization impossible without transpose).
Supported --dtype values for the converter: f32 (default), f16, q8_0.
For K-quants (q4_k, q5_k, q6_k), re-quantize an F32 GGUF with the CLI:
parakeet-cli quantize <in.gguf> <out.gguf> <type>
All variants of the 110m anchor hold WER 0.0 vs NeMo at F16, Q8_0, Q6_K, and
Q4_K. See docs/quantization.md for size figures.
The binary is at build/examples/cli/parakeet-cli.
parakeet-cli info <model.gguf>
parakeet-cli transcribe --model <model.gguf> --input <audio.wav> [--decoder ctc|tdt] [--stream] [--timestamps] [--json]
parakeet-cli quantize <in.gguf> <out.gguf> <type>
--timestamps prints one <start>-<end> <word> (<conf>) line per word (also
works with --stream, where words print as they finalize); --json prints the
parakeet_capi_transcribe_path_json document (text + per-word/per-token
timestamps + confidence).
include/parakeet_capi.h defines the flat C-API. Build libparakeet.so with
-DPARAKEET_SHARED=ON. Verify exports with nm -D build-shared/libparakeet.so | grep parakeet_capi.
The LocalAI backend lives in the LocalAI repo and dlopens libparakeet.so.
Symbols the LocalAI side depends on, do not remove or change any signature
without a coordinated bump on the LocalAI side:
parakeet_capi_abi_version
parakeet_capi_load
parakeet_capi_free
parakeet_capi_transcribe_path
parakeet_capi_transcribe_pcm
parakeet_capi_transcribe_path_json # text + per-word/per-token timestamps + confidence as JSON
parakeet_capi_free_string
parakeet_capi_last_error
# streaming (cache-aware EOU model parakeet_realtime_eou_120m-v1):
parakeet_capi_stream_begin
parakeet_capi_stream_feed # 16k mono f32 PCM -> newly-finalized text; *eou_out=1 on <EOU>/<EOB>
parakeet_capi_stream_finalize # flush the end-of-stream tail
parakeet_capi_stream_free
parakeet_capi_transcribe_path_json(ctx, wav, decoder) returns malloc'd UTF-8
JSON {"text":..,"words":[{"w","start","end","conf"}],"tokens":[{"id","t","conf"}]}
(times in seconds, conf in (0,1]), built from
pk::Model::transcribe_path_with_timestamps. Confidence is NeMo's max_prob
method, the rescaled softmax probability of the emitted (argmax) token over the
same logit slice NeMo log-softmaxes (conf = (N·p_max − 1)/(N − 1), N = classes);
per-word conf is the min aggregate over the word's tokens. Word offsets +
confidence match NeMo transcribe(timestamps=True) exactly (see docs/parity.md).
parakeet_capi_abi_version returns an integer that LocalAI can check for
compatibility; bump it on any breaking change to the above signatures or
semantics. Additive changes (new functions) are fine without bumping.
Streaming semantics: parakeet_capi_stream_feed buffers PCM, decodes encoder
chunks as audio arrives (carried encoder/decoder caches), and returns the
newly-finalized text (<EOU>/<EOB> STRIPPED, surfaced via *eou_out).
parakeet_capi_stream_finalize flushes the streaming tail and does NOT
fabricate an <EOU> NeMo's cache-aware streaming would not emit (for a final
chunk whose right context is incomplete, the trailing <EOU> is dropped exactly
as NeMo does). Internally these wrap pk::StreamingSession (src/streaming.*):
feed_mel_chunk (token ids, used by test_streaming_decode), take_new_text,
drain_events (pk::EouEvent{token,is_eob,encoder_frame,time_sec}),
drain_words (pk::Word{text,start,end,conf} for words finalized since the last
drain, a word finalizes when the next ▁-token arrives, the last word on
finalize(); reuses the offline pk::group_words grouping),
last_chunk_had_eou, finalize. The CLI --stream path uses
pk::run_stream_over_pcm (full-clip mel + the model's chunk schedule); its
on_chunk callback now also receives the per-chunk finalized pk::Words, which
--stream --timestamps prints.
Used by Phase 1 parity tests. Requires the venv and a 16 kHz mono WAV.
.venv/bin/python scripts/gen_nemo_baseline.py \
--model nvidia/parakeet-tdt_ctc-110m \
--audio tests/fixtures/clip.wav \
--output /tmp/baseline.gguf
.venv/bin/python scripts/gen_nemo_baseline.py \
--model nvidia/parakeet-tdt_ctc-110m \
--audio tests/fixtures/speech.wav \
--output /tmp/baseline_speech.gguf
scripts/publish_hf.py converts the anchor to the full variant set (F16,
Q8_0, Q4_K) and uploads each to a HF repo. Dry-run by default, add --upload
to actually push. Requires an HF token at ~/.cache/huggingface/token
(huggingface-cli login).
.venv/bin/python scripts/publish_hf.py \
--model nvidia/parakeet-tdt_ctc-110m \
--repo mudler/parakeet.cpp-110m
# add --upload to actually push
See models/MANIFEST.md for the expected set of published GGUFs per checkpoint.
.github/workflows/ci.yml has two jobs:
- build (every push + pull_request): cmake build +
ctest -LE model. Fast. - closed-loop (pull_request +
workflow_dispatch): converts the 110m checkpoint and assertsparakeet-cli transcribe --decoder tdtmatches the reference transcript below. Heavy (NeMo download, ~60 min); not on every push.
tests/fixtures/speech.wav on the 110m TDT head decodes (WER 0.0 vs NeMo) to
exactly the following. This is the closed-loop assertion and the quickest smoke
test that a build is correct on any backend (CPU, Metal, CUDA):
Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait.
See docs/conversion.md for the authoritative schema. Quick summary:
general.architecture = "parakeet"- All metadata keys use the
parakeet.*prefix. - Tensor names are verbatim NeMo
state_dictkeys, no remapping, no prefix stripping. This convention is load-bearing: the C++ model loader mapsname -> ggml_tensor*by exact string. Never remap tensor names at conversion time.
Pinned at v0.13.0 in third_party/ggml. No local patches. To bump:
- Update the submodule SHA.
- Run
ctest --test-dir build --output-on-failure. - Fix any API breakage in
src/model_loader.cpp.
- Convert + run
parakeet-cli infoto inspect the GGUF metadata. - Run
scripts/validate_vs_nemo.pyto get a WER figure. - If it passes (WER 0), add a row to
docs/parity.mdandmodels/MANIFEST.md.
The C++ loader is metadata-driven (arch, d_model, layers, mel params, vocab, pred LSTM layers, xscaling, optional biases all read from GGUF KV); no source changes are typically needed.
- Bump the venv and re-run the converter on the anchor checkpoint.
- Regenerate baselines via
scripts/gen_nemo_baseline.py. - Run the full test suite. Any parity drift will surface in the
test_*targets.
- Update the submodule SHA.
- Run
ctest --output-on-failure. - Fix any API breakage in
src/model_loader.cpp(gguf/ggml C API).
- Extend
examples/cli/main.cppcmd_quantizewith the new type mapping. - Update the
should_quantizeheuristic inscripts/convert_parakeet_to_gguf.pyif the new type has a different block size requirement. - Run
scripts/validate_vs_nemo.pyon the quantized GGUF and record WER + size indocs/quantization.md.