mojo-audio

High-performance audio DSP and ML inference library for voice conversion — built in Mojo and Python, runs on NVIDIA DGX Spark ARM64 with zero PyTorch CUDA dependency.

What it is

mojo-audio has two layers:

DSP layer (Mojo) — low-level audio processing: FFT, mel spectrogram, resampling, VAD, pitch shifting, iSTFT. 20–40% faster than librosa through SIMD vectorization and multi-core parallelization.

ML inference layer (Python + MAX Graph) — GPU-accelerated neural network inference without PyTorch CUDA. Runs natively on DGX Spark SM_121 ARM64 via MAX Engine:

Model	Purpose	Backend
`AudioEncoder`	HuBERT / ContentVec content features	MAX Graph GPU
`PitchExtractor`	RMVPE pitch (F0) estimation	MAX Graph GPU + numpy BiGRU

Together these form the core of a voice conversion pipeline (content extraction → pitch extraction → synthesis) that runs fully on Spark without cloud or PyTorch.

ML Inference

AudioEncoder — HuBERT / ContentVec

Extracts content feature vectors from raw audio. Supports facebook/hubert-base-ls960 and lengyue233/content-vec-best. Automatically uses GPU if available.

from models import AudioEncoder

model = AudioEncoder.from_pretrained("facebook/hubert-base-ls960")
features = model.encode(audio_np)  # [1, N] float32 @16kHz → [1, T, 768]

GPU pipeline: CNN feature extractor + positional conv (numpy, avoids MAX conv2d groups bug) + 12× transformer blocks.

PitchExtractor — RMVPE

Extracts F0 (fundamental frequency) per 10ms frame. No PyTorch CUDA needed — runs on DGX Spark ARM64.

from models import PitchExtractor

model = PitchExtractor.from_pretrained()  # downloads lj1995/VoiceConversionWebUI/rmvpe.pt
f0_hz = model.extract(audio_np)  # [1, N] float32 @16kHz → [T] float32 Hz, 0=unvoiced

Architecture: U-Net MAX Graph (5-level encoder + bottleneck + 5-level decoder) → numpy BiGRU → pitch salience bins → Hz per frame.

Running the models

# Fast tests (no download)
pixi run test-models
pixi run test-pitch-extractor

# Full correctness tests (downloads model weights ~180–360MB)
pixi run test-models-full
pixi run test-pitch-extractor-full

# GPU benchmark
pixi run bench-models

DSP Layer

Mel Spectrogram (Mojo)

Whisper-compatible mel spectrogram preprocessing — 20–40% faster than librosa.

from audio import mel_spectrogram

var mel = mel_spectrogram(audio)  // (80, 2998) for 30s @16kHz, ~12ms with -O3

Performance:

30-second audio @16kHz:

librosa (Python):   15ms  (1993x realtime)
mojo-audio (-O3):   12ms  (2457x realtime)  ← 20–40% faster

Optimization journey: 476ms (naive) → 12ms (-O3) = 40x total speedup through iterative FFT, RFFT, twiddle caching, sparse mel filterbank, SIMD float32, radix-4 butterflies, and multi-core parallelization.

Other DSP components

Module	What it does
`resample.mojo`	Lanczos resampler (48kHz → 16kHz)
`vad.mojo`	Voice activity detection / silence trimming
`pitch.mojo`	Phase vocoder pitch shifting
`wav_io.mojo`	WAV file I/O
`ffi/`	C-compatible shared library (`libmojo_audio.so`)

Running DSP tests

# All Mojo DSP tests
pixi run test

# Individual
pixi run test-pitch
pixi run bench-optimized   # mel spectrogram benchmark
pixi run bench-python      # librosa baseline comparison

Installation

Requirements: pixi, Mojo 0.26+, Linux x86_64 or aarch64

git clone https://github.com/itsdevcoffee/mojo-audio.git
cd mojo-audio
pixi install

Build FFI shared library (for C/Rust/Python DSP integration):

pixi run build-ffi-optimized   # → libmojo_audio.so (Linux) or .dylib (macOS)

See macOS Build Guide for macOS-specific setup.

Project Structure

mojo-audio/
├── src/
│   ├── audio.mojo              # Mel spectrogram, FFT, STFT, windowing
│   ├── pitch.mojo              # Phase vocoder pitch shifting
│   ├── resample.mojo           # Lanczos resampler
│   ├── vad.mojo                # Voice activity detection
│   ├── wav_io.mojo             # WAV I/O
│   ├── ffi/                    # C-compatible shared library exports
│   └── models/                 # MAX Graph ML inference (Python)
│       ├── audio_encoder.py    # HuBERT / ContentVec via MAX Graph
│       ├── pitch_extractor.py  # RMVPE pitch extraction via MAX Graph
│       ├── _rmvpe.py           # U-Net graph + numpy BiGRU
│       ├── _rmvpe_weight_loader.py
│       └── _weight_loader.py   # HuBERT/ContentVec weight loader
├── tests/
│   ├── test_audio_encoder.py   # AudioEncoder tests (pytest)
│   ├── test_pitch_extractor.py # PitchExtractor tests (pytest)
│   ├── test_fft.mojo           # FFT correctness
│   ├── test_mel.mojo           # Mel spectrogram
│   └── ...                     # Other Mojo DSP tests
├── experiments/
│   ├── hubert-max/             # HuBERT MAX Graph experiments
│   ├── contentvec-max/         # ContentVec benchmarks
│   └── max-bug-repro/          # MAX Engine bug reproductions
├── docs/
│   ├── plans/                  # Implementation plans
│   ├── context/                # Architecture reference
│   └── project/                # Roadmap
└── pixi.toml

Platform Support

Platform	DSP	ML Inference
Linux x86_64 (NVIDIA RTX)	✅	✅ GPU
Linux aarch64 (DGX Spark SM_121)	✅	✅ GPU
macOS Apple Silicon	✅	✅ CPU
macOS Intel	✅	✅ CPU

Roadmap

The next steps are tracked in docs/project/03-06-2026-roadmap.md:

Sprint 2: Full GPU AudioEncoder (remove numpy bridge once MAX conv2d groups bug is fixed), phase-locked phase vocoder
Sprint 3: HiFiGAN vocoder in MAX Graph
Sprint 4: Full VITS synthesis — end-to-end voice conversion on Spark
Sprint 5: Shade integration and demo

Comparison

Feature	mojo-audio	librosa	torchaudio	RVC / Applio	pyworld
DSP
Mel spectrogram	✅	✅	✅	via librosa	❌
FFT / STFT	✅	✅	✅	via librosa	partial
Resampling	✅	✅	✅	via librosa	❌
Voice activity detection	✅	❌	❌	via silero	❌
Phase vocoder pitch shift	✅	✅	❌	✅	❌
iSTFT / Griffin-Lim	✅	✅	✅	❌	❌
WAV I/O	✅	✅	✅	✅	❌
C FFI / shared library	✅	❌	❌	❌	❌
ML Inference
HuBERT content features	✅ MAX Graph	❌	❌	✅ PyTorch	❌
ContentVec content features	✅ MAX Graph	❌	❌	✅ PyTorch	❌
RMVPE pitch extraction	✅ MAX Graph	❌	❌	✅ PyTorch	❌
WORLD pitch extraction	❌	❌	❌	via pyworld	✅
GPU inference	✅ MAX Engine	❌	✅ CUDA	✅ CUDA	❌
Platform
Linux x86_64	✅	✅	✅	✅	✅
DGX Spark ARM64	✅	✅	❌	❌	❌
macOS Apple Silicon	✅	✅	✅	partial	✅
PyTorch CUDA required	❌	❌	✅	✅	❌
Performance
Mel spec vs librosa	+20–40%	baseline	~parity	baseline	—
GPU inference without CUDA	✅	❌	❌	❌	❌

Known Issues

MAX Engine conv2d groups bug (v26.1): ops.conv2d returns incorrect results when groups > 1 and kernel size is large (K≥128). Filed as modular/modular#6129. Workaround: HuBERT's pos_conv layer runs outside the MAX Graph via numpy.

Citation

@software{mojo_audio_2026,
  author = {Dev Coffee},
  title = {mojo-audio: Audio DSP and ML inference for voice conversion},
  year = {2026},
  url = {https://github.com/itsdevcoffee/mojo-audio}
}

GitHub | Issues | Roadmap

Name		Name	Last commit message	Last commit date
Latest commit History 188 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
docs		docs
examples		examples
experiments		experiments
include		include
src		src
tests		tests
ui		ui
web		web
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mojo-audio

What it is

ML Inference

AudioEncoder — HuBERT / ContentVec

PitchExtractor — RMVPE

Running the models

DSP Layer

Mel Spectrogram (Mojo)

Other DSP components

Running DSP tests

Installation

Project Structure

Platform Support

Roadmap

Comparison

Known Issues

Citation

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mojo-audio

What it is

ML Inference

AudioEncoder — HuBERT / ContentVec

PitchExtractor — RMVPE

Running the models

DSP Layer

Mel Spectrogram (Mojo)

Other DSP components

Running DSP tests

Installation

Project Structure

Platform Support

Roadmap

Comparison

Known Issues

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages