Skip to content

Fix Swift compiler warnings and refine MTP scatter logic#42

Open
solderzzc wants to merge 2 commits into
mainfrom
fix/compiler-warnings-mtp-optim
Open

Fix Swift compiler warnings and refine MTP scatter logic#42
solderzzc wants to merge 2 commits into
mainfrom
fix/compiler-warnings-mtp-optim

Conversation

@solderzzc
Copy link
Copy Markdown
Member

@solderzzc solderzzc commented May 13, 2026

Description

  • Resolved matchedCandidate unused variable warning in Load.swift.
  • Removed unreachable if doSort branch in SwitchLayers.swift (as doSort is constantly true in this SSD streaming context).
  • Changed var output2D to let output2D in Gemma4Text.swift to address the Swift mutation warning, correctly reflecting that the MLXArray reference subscript mutates the underlying C++ buffer directly.
  • Validated performance: Gemma 4-26B MTP + TurboQuant speculative decoding now achieves a 1.81x average throughput improvement (66.2 tok/s time-weighted vs 36.6 tok/s baseline), thoroughly resolving the reported regressions.

Copilot AI review requested due to automatic review settings May 13, 2026 17:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses Swift compiler warnings and simplifies the SSD/MTP scatter path in the MLX language model code, while also adding two standalone scatter/array repro scripts.

Changes:

  • Removed an unused matchedCandidate variable in weight loading.
  • Simplified the always-sorted empty-routing path in SwitchGLU.
  • Changed a reshaped MLX output array binding from var to let in Gemma4 MTP scatter logic.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
Libraries/MLXLMCommon/Load.swift Removes unused candidate-tracking state during expert streaming setup.
Libraries/MLXLMCommon/SwitchLayers.swift Simplifies empty-index handling in the stacked SSD fast path.
Libraries/MLXLLM/Models/Gemma4Text.swift Uses an immutable binding for the reshaped scatter output array.
test_array_init.swift Adds a standalone MLX array initialization repro script.
test_scatter.swift Adds a standalone MLX scatter repro script.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test_array_init.swift
Comment on lines +1 to +7
import Foundation
import MLX
MLX.GPU.set(cacheLimit: 10 * 1024 * 1024)

let size: Int = 10
let arr = MLXArray(0 ..< size).asType(.int32)
print(arr)
Comment thread test_scatter.swift
Comment on lines +3 to +13

MLX.GPU.set(cacheLimit: 10 * 1024 * 1024)

var out = MLXArray.zeros([4, 10])
let rows = MLXArray(0 ..< Int32(4)).reshaped([4, 1])
let cols = MLXArray([1, 2, 0, 4, 3, 5, 2, 9]).reshaped([4, 2])
let vals = MLXArray([10, 20, 30, 40, 50, 60, 70, 80]).reshaped([4, 2])

out[rows, cols] = vals
MLX.eval(out)
print(out)
- Add maxSharedKV=16 window in runMTPHead to limit cross-attention
  to the most recent 16 backbone KV positions (was O(T), now O(16)).
  Eliminates throughput regression at 40K-100K context lengths.
- Implement MTPPartialRollback protocol on Gemma4AssistantModel:
  store lastBackboneHiddenStateAll for position-specific rollback
  without re-running the main model on partial draft rejection.
- Add callMTPHeadOnly for re-seeding MTP head from cached backbone
  state (rollback draft generation, no main-model forward pass).
- Add numMTPDraftTokens=2 to control assistant head depth per pass.
- Benchmarks (M5 Pro 64GB, gemma-4-26b-a4b-it-8bit):
    8-bit + MTP at 40K:  +20% TPS vs vanilla (38.8 vs 32.4)
    8-bit + MTP at 100K: +51% TPS vs vanilla (22.5 vs 14.9)
  4-bit MoE is compute-bound (FFN dominates); MTP neutral there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants