Skip to content

Fix Metal model view overlap for q4#58

Open
Dango233 wants to merge 2 commits intoantirez:mainfrom
Dango233:codex/metal-auto-model-overlap
Open

Fix Metal model view overlap for q4#58
Dango233 wants to merge 2 commits intoantirez:mainfrom
Dango233:codex/metal-auto-model-overlap

Conversation

@Dango233
Copy link
Copy Markdown

Summary

Fixes q4 startup on machines where the model crosses Metal's per-buffer maxBufferLength, then documents q4 usage and benchmark results for a Mac Studio M2 Ultra with 192 GB RAM.

The original implementation used a fixed 672 MiB model-view overlap. The q4 GGUF has 1152 MiB expert tensors, so on this machine one tensor can straddle two mapped Metal views and fail startup with:

ds4: Metal model range 106.89..108.01 GiB is not covered by mapped model views

This PR fixes that by deriving the Metal model-view overlap from the largest tensor described by the GGUF metadata, with a CLI/server override kept as a diagnostic lower-bound escape hatch.

192 GB q4 setup

On the tested Mac Studio M2 Ultra with 192 GB RAM, q4 runs at --ctx 32768 after raising the iGPU wired-memory limit:

sudo sysctl iogpu.wired_limit_mb=188000

Benchmark

Single-run Metal CLI numbers using:

  • --ctx 32768
  • --nothink
  • -sys ""
  • --temp 0
  • -n 256
Machine Quant Prompt Prefill Generation
Mac Studio M2 Ultra, 192 GB q4 short 50.42 t/s 29.88 t/s
Mac Studio M2 Ultra, 192 GB q4 3844 tokens 394.66 t/s 21.69 t/s

The long benchmark uses tests/test-vectors/prompts/long_code_audit.txt.

Test plan

  • make ds4 ds4-server ds4_test
  • ./ds4 --inspect -m ds4flash.gguf
  • ./ds4_test --server
  • q4 short benchmark command above
  • q4 long benchmark command above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant