Skip to content

gpu-cli/parakeet-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

parakeet-rs

Pure Rust implementation of NVIDIA's Parakeet-TDT-0.6B-v3 speech-to-text model, built on Candle with Metal GPU acceleration for Apple Silicon.

Highlights

  • 3.83% WER on LibriSpeech test-clean (NVIDIA published: 1.93%)
  • 7.6x real-time on Apple Silicon with Metal GPU
  • 25 languages — English, French, German, Spanish, and 21 more European languages
  • No Python runtime — single static binary, no ONNX/CoreML/MLX dependency
  • Automatic model download from HuggingFace Hub
  • Output formats: plain text, JSON, SRT, VTT subtitles

Architecture

Full implementation of the FastConformer-TDT pipeline:

Component Description
Preprocessor 128-bin log-mel spectrogram (rustfft + GPU filterbank)
Subsampling 8x depthwise-separable conv stack (3 stride-2 layers)
Encoder 24-layer FastConformer (relative positional attention, Macaron FFN, GLU conv)
Decoder Token-and-Duration Transducer (2-layer LSTM + joint network)
Tokenizer 8192 BPE vocabulary via SentencePiece (multilingual)

Quick Start

# Build (requires Rust 1.85+)
cargo build --release --features metal -p parakeet

# Transcribe audio (downloads model on first run, ~2.4GB)
./target/release/parakeet --input recording.wav

# With timestamps
./target/release/parakeet --input recording.wav --timestamps

# SRT subtitles
./target/release/parakeet --input recording.wav --format srt --output recording.srt

# JSON output with word-level timestamps
./target/release/parakeet --input recording.wav --format json

Using a local model

# Convert NeMo checkpoint to SafeTensors (one-time)
python3 scripts/convert_nemo.py

# Use local model directory
./target/release/parakeet --model-dir converted_model --input recording.wav

Supported Languages

Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Ukrainian

Requirements

  • Rust 1.85+ (edition 2024)
  • macOS with Apple Silicon (M1/M2/M3/M4) for Metal acceleration
  • 16-bit mono WAV input at any sample rate (resampled internally to 16kHz)

Benchmarks

Measured on Apple Silicon with --features metal:

Metric Value
WER (LibriSpeech test-clean, 500 samples) 3.83%
Real-time factor 0.131
Speed 7.6x real-time
Model size 0.6B parameters (~2.4GB)

Also supports v2 (English-only, 2.66% WER) — auto-detected from model directory.

Project Structure

parakeet/
  src/
    audio.rs         # Log-mel spectrogram (rustfft + GPU matmul)
    subsampling.rs   # 8x depthwise-separable conv subsampling
    encoder.rs       # 24-layer FastConformer encoder
    decoder.rs       # TDT decoder (LSTM + joint network)
    decoding.rs      # Greedy TDT decoding with duration skipping
    tokenizer.rs     # BPE tokenizer
    pipeline.rs      # End-to-end transcription pipeline
    config.rs        # Model configuration (v2 + v3)
    weights.rs       # HF Hub download + SafeTensors loading
    wer.rs           # Word error rate computation
    bin/parakeet.rs  # CLI entry point
scripts/
  convert_nemo.py    # NeMo .nemo → SafeTensors converter
  benchmark_wer.py   # LibriSpeech WER benchmark

License

MIT

Model weights are subject to NVIDIA's CC-BY-4.0 license.

About

Pure Rust implementation of NVIDIA's Parakeet-TDT-0.6B-v3 ASR model (25 languages) using Candle, targeting Apple Silicon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors