fix: consolidate metal support by mudler · Pull Request #4 · mudler/parakeet.cpp

mudler · 2026-06-01T21:59:02Z

Two independent changes to make parakeet.cpp work on metal:

GPU-only scheduler fallback (src/backend.cpp). When the selected device
is a GPU, compute runs through ggml_backend_sched over {GPU, CPU}, so an
op the GPU lacks is offloaded to CPU instead of aborting.
A native Metal CONV_2D_DW kernel and conv module's leading side time padding (PAD). It mirrors the
existing CONV_2D op but is depthwise: each output channel convolves only
its own input channel. Scope is WHCN-contiguous, F32, which is what
subsampling uses; other layouts and dtypes report unsupported and fall back
to CPU via change 1, so the kernel is correct by construction for the cases
it claims.

With change 1 alone, Metal works (the conv runs on CPU), 2 is an optimization to gain some speed back and move all ops to Metal.

Fixes: #2

parakeet.cpp aborted on Apple Metal (ggml's Metal backend has no CONV_2D_DW kernel, which the Conformer subsampling emits). Make it run on Metal and keep every backend fast: - backend: GPU devices use the persistent gallocr fast path. A per-graph ggml_backend_supports_op scan only routes to ggml_backend_sched (CPU fallback) when the active GPU backend genuinely lacks a kernel for some op. CUDA covers every op, so it stays on gallocr (verified parity with master on the GB10; the earlier blanket-sched approach regressed CUDA 7-23%). The CPU path is unchanged. - ggml: native Metal CONV_2D_DW kernel (patch 0002) and leading-side PAD support (patch 0003), the two ops the encoder needed. With these the whole encoder, down to the log-mel front end, runs on Metal. - bench: scripts/bench_metal_dw.sh measures steady-state RTFx via parakeet-cli bench (warm up once, time inference only). - ci: run the closed-loop end-to-end transcript assertion on pull requests, not just manual dispatch. - docs: Apple Metal section in README and BENCHMARK; AGENTS.md gains a performance-invariant note (keep gallocr) and the reference transcript. Metal (M4, q4_k) is about 3-5x over CPU on the larger models. CPU and CUDA are within noise of master. Assisted-by: Claude:claude-opus-4-8 [Claude Code]

mudler · 2026-06-02T13:15:13Z

on this I've actually ported the missing ops. Metal should be fully covered

mudler force-pushed the worktree-metal-conv2d-dw branch from 608b255 to 37986e3 Compare June 1, 2026 22:02

mudler marked this pull request as draft June 1, 2026 22:02

mudler changed the title ~~feat: metal support~~ fix: consolidate metal support Jun 1, 2026

mudler marked this pull request as ready for review June 1, 2026 22:52

mudler force-pushed the worktree-metal-conv2d-dw branch from e4eb12c to b4eb62c Compare June 1, 2026 23:25

mudler force-pushed the worktree-metal-conv2d-dw branch from b4eb62c to e52b079 Compare June 1, 2026 23:27

mudler merged commit 9edf17c into master Jun 1, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: consolidate metal support#4

fix: consolidate metal support#4
mudler merged 1 commit into
masterfrom
worktree-metal-conv2d-dw

mudler commented Jun 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

mudler commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mudler commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mudler commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mudler commented Jun 1, 2026 •

edited

Loading