Skip to content

fix: consolidate metal support#4

Merged
mudler merged 1 commit into
masterfrom
worktree-metal-conv2d-dw
Jun 1, 2026
Merged

fix: consolidate metal support#4
mudler merged 1 commit into
masterfrom
worktree-metal-conv2d-dw

Conversation

@mudler
Copy link
Copy Markdown
Owner

@mudler mudler commented Jun 1, 2026

Two independent changes to make parakeet.cpp work on metal:

  1. GPU-only scheduler fallback (src/backend.cpp). When the selected device
    is a GPU, compute runs through ggml_backend_sched over {GPU, CPU}, so an
    op the GPU lacks is offloaded to CPU instead of aborting.
  2. A native Metal CONV_2D_DW kernel and conv module's leading side time padding (PAD). It mirrors the
    existing CONV_2D op but is depthwise: each output channel convolves only
    its own input channel. Scope is WHCN-contiguous, F32, which is what
    subsampling uses; other layouts and dtypes report unsupported and fall back
    to CPU via change 1, so the kernel is correct by construction for the cases
    it claims.

With change 1 alone, Metal works (the conv runs on CPU), 2 is an optimization to gain some speed back and move all ops to Metal.

Fixes: #2

@mudler mudler force-pushed the worktree-metal-conv2d-dw branch from 608b255 to 37986e3 Compare June 1, 2026 22:02
@mudler mudler marked this pull request as draft June 1, 2026 22:02
@mudler mudler changed the title feat: metal support fix: consolidate metal support Jun 1, 2026
@mudler mudler marked this pull request as ready for review June 1, 2026 22:52
@mudler mudler force-pushed the worktree-metal-conv2d-dw branch from e4eb12c to b4eb62c Compare June 1, 2026 23:25
parakeet.cpp aborted on Apple Metal (ggml's Metal backend has no CONV_2D_DW
kernel, which the Conformer subsampling emits). Make it run on Metal and keep
every backend fast:

- backend: GPU devices use the persistent gallocr fast path. A per-graph
  ggml_backend_supports_op scan only routes to ggml_backend_sched (CPU fallback)
  when the active GPU backend genuinely lacks a kernel for some op. CUDA covers
  every op, so it stays on gallocr (verified parity with master on the GB10; the
  earlier blanket-sched approach regressed CUDA 7-23%). The CPU path is unchanged.
- ggml: native Metal CONV_2D_DW kernel (patch 0002) and leading-side PAD support
  (patch 0003), the two ops the encoder needed. With these the whole encoder,
  down to the log-mel front end, runs on Metal.
- bench: scripts/bench_metal_dw.sh measures steady-state RTFx via parakeet-cli
  bench (warm up once, time inference only).
- ci: run the closed-loop end-to-end transcript assertion on pull requests, not
  just manual dispatch.
- docs: Apple Metal section in README and BENCHMARK; AGENTS.md gains a
  performance-invariant note (keep gallocr) and the reference transcript.

Metal (M4, q4_k) is about 3-5x over CPU on the larger models. CPU and CUDA are
within noise of master.

Assisted-by: Claude:claude-opus-4-8 [Claude Code]
@mudler mudler force-pushed the worktree-metal-conv2d-dw branch from b4eb62c to e52b079 Compare June 1, 2026 23:27
@mudler mudler merged commit 9edf17c into master Jun 1, 2026
4 checks passed
@mudler
Copy link
Copy Markdown
Owner Author

mudler commented Jun 2, 2026

on this I've actually ported the missing ops. Metal should be fully covered

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Metal backend crashes on macOS arm64 with tdt-0.6b-v3-q8_0.gguf

1 participant