Skip to content

Prevent allocator from operating on residency set during metal capture#3114

Draft
robertmsale wants to merge 1 commit intoml-explore:mainfrom
robertmsale:gputrace-bug
Draft

Prevent allocator from operating on residency set during metal capture#3114
robertmsale wants to merge 1 commit intoml-explore:mainfrom
robertmsale:gputrace-bug

Conversation

@robertmsale
Copy link

Proposed changes

In issue #2846, it was reported that when profiling a .gputrace generated by MLX, Xcode could fail to rebuild the graph due to residency-set replay errors (command queue residency set limit of 32 exceeded).

This PR makes Metal capture replay-safe by preventing residency-set queue attachment/mutation while capture is active, and restoring normal behavior after capture ends.

What changed

  • Added capture-aware residency gating in the Metal backend:
    • Introduced metal::residency_sets_enabled() / metal::set_residency_sets_enabled(bool).
    • Disabled residency sets at start_capture(...) and re-enabled them at stop_capture().
  • Updated device residency attachment flow to respect capture state:
    • Queue-level residency sets are detached at capture start.
    • Deferred re-attachment occurs after capture stop.
  • Updated allocator wired-limit handling:
    • Residency-set resize/attach operations are deferred while capture is active.
    • Deferred state is applied on capture stop.
  • Updated residency-set creation logic:
    • Avoid creating/attaching residency sets during active capture.
    • Keep allocations tracked and apply when safe.

These changes preserve normal runtime behavior outside capture while avoiding replay-time residency-set explosions during Xcode profiling/counters.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

@awni
Copy link
Member

awni commented Feb 8, 2026

Residency sets are off by default so I'm slightly confused. Did you have them turned on when capturing the workload (e.g. did you set mx.set_wired_limit)?

@awni
Copy link
Member

awni commented Feb 8, 2026

Oh I guess addResidencySet is the issue (which we always do even if it is empty). However, I'm not crazy about adding this complexity especially without knowing why. We should only ever add one residency set to the command queue. The fact that it's adding a lot during replay is quite strange 🤔

@robertmsale
Copy link
Author

robertmsale commented Feb 8, 2026

This confused me too! In this repro, I did not call mx::set_wired_limit(...).

The workload is the stock capture example in examples/cpp/metal_capture.cpp:

  • capture starts at line 14 (mx::metal::start_capture(...))
  • then it creates GPU streams at lines 19-20 (new_stream(mx::Device::gpu))

My takeaway was backend behavior during capture/replay causing those errors.

What was happening pre-fix:

  1. The Metal allocator path eagerly registered queue residency state during backend setup (even when effective wired limit was 0 / not explicitly set by user).
  2. When streams/queues were created, queue-level residency-set operations could still be recorded during capture.
  3. Xcode counter replay replays command queue state and eventually hit the per-queue residency-set limit (addResidencySet limit exceeded), causing graph/counter profiling failures.

What changed:

  • I removed eager constructor-time queue residency attachment and made attachment lazy on first non-zero wired limit request.
  • During capture, residency-set mutations/attachments are gated off, and existing queue attachments are detached before capture to keep trace replay stable.
  • After capture stops, deferred state is restored.

I tried to piece it together in a way that production workloads do not touch any of the bookkeeping and release builds can optimize out those branches if nobody does Metal GPU capture. I would like to point out that I tried many different strategies in metal_capture.cpp, in the hopes that maybe if that test is written a certain way we do not hit the replay issue and wouldn't require any working code edits (e.g. creating streams and running eval before capture start), but to no avail. Freezing the residency set edits during capture is what solved it for me.

@robertmsale
Copy link
Author

We should only ever add one residency set to the command queue. The fact that it's adding a lot during replay is quite strange 🤔

I'm just speculating here, but I think the problem happens when the profiler does derived counter analysis. MLX does indeed only have 1 residency set, everything we're doing is above board. But during derived counter analysis the GPUTools replays passes and if we have any residency attachment operations during capture, it re-issues queue residency attachment over and over because it happened even one time during capture. We hit a limit of 32 adds, and get 24 errors. I think the pattern is we have 3 GPU queues: default stream multiply, s2, s3. The 56 residency sets comes from 3 queues and number of replay counter passes. Some passes touch all queues, some only touch s2, some s3, and you end up with 56 passes total from that.

So I think this PR is mostly a workaround for the GPU profiling limitation and is not indicative of a deeper issue, but I can't debug the profiler 😅

@robertmsale robertmsale marked this pull request as draft February 9, 2026 21:58
@robertmsale
Copy link
Author

robertmsale commented Feb 9, 2026

I'm converting this to draft because, after thinking about this, I think this is a workaround for a bug in the GPU profiler. Something didn't feel right about it.

I wrote a toy benchmark:

import Foundation
import Metal

enum ResidencySetLimitError: Error {
    case noDevice
    case noQueue
    case noCommandBuffer
    case noEncoder
    case noOutputBuffer
    case invalidSetCount(String)
}

@available(macOS 15.0, *)
func makePipeline(device: MTLDevice) throws -> MTLComputePipelineState {
    let source = """
    #include <metal_stdlib>
    using namespace metal;

    kernel void write_index(device uint *out [[buffer(0)]],
                            uint gid [[thread_position_in_grid]]) {
        out[gid] = gid;
    }
    """

    let library = try device.makeLibrary(source: source, options: nil)
    guard let function = library.makeFunction(name: "write_index") else {
        fatalError("Could not build runtime kernel function.")
    }
    return try device.makeComputePipelineState(function: function)
}

@available(macOS 15.0, *)
func runExperiment(device: MTLDevice, queue: MTLCommandQueue, setCount: Int) throws {
    let descriptor = MTLResidencySetDescriptor()
    var sets: [any MTLResidencySet] = []
    sets.reserveCapacity(setCount)

    print("Adding \(setCount) residency sets to one command queue...")
    for i in 0..<setCount {
        let set = try device.makeResidencySet(descriptor: descriptor)
        set.requestResidency()
        queue.addResidencySet(set)
        sets.append(set)
        print("  added set \(i + 1)")
    }

    let pipeline = try makePipeline(device: device)
    let elementCount = 64
    let outputSize = elementCount * MemoryLayout<UInt32>.stride

    guard let output = device.makeBuffer(length: outputSize, options: .storageModeShared) else {
        throw ResidencySetLimitError.noOutputBuffer
    }

    guard let commandBuffer = queue.makeCommandBuffer() else {
        throw ResidencySetLimitError.noCommandBuffer
    }
    guard let encoder = commandBuffer.makeComputeCommandEncoder() else {
        throw ResidencySetLimitError.noEncoder
    }

    encoder.setComputePipelineState(pipeline)
    encoder.setBuffer(output, offset: 0, index: 0)

    let width = min(pipeline.maxTotalThreadsPerThreadgroup, elementCount)
    let tg = MTLSize(width: width, height: 1, depth: 1)
    let grid = MTLSize(width: elementCount, height: 1, depth: 1)
    encoder.dispatchThreads(grid, threadsPerThreadgroup: tg)
    encoder.endEncoding()

    commandBuffer.commit()
    commandBuffer.waitUntilCompleted()

    let ptr = output.contents().bindMemory(to: UInt32.self, capacity: elementCount)
    let first = (0..<8).map { ptr[$0] }
    print("Kernel completed. First 8 outputs: \(first)")

    // Keep sets alive through command buffer completion.
    withExtendedLifetime(sets) {}
}

func parseSetCount(_ args: [String]) throws -> Int {
    guard let raw = args.dropFirst().first else {
        return 33
    }
    guard let value = Int(raw), value > 0 else {
        throw ResidencySetLimitError.invalidSetCount(raw)
    }
    return value
}

func main() throws {
    guard let device = MTLCreateSystemDefaultDevice() else {
        throw ResidencySetLimitError.noDevice
    }
    guard let queue = device.makeCommandQueue() else {
        throw ResidencySetLimitError.noQueue
    }

    let setCount = try parseSetCount(CommandLine.arguments)
    print("Device: \(device.name)")
    print("Set count: \(setCount)")

    if #available(macOS 15.0, *) {
        try runExperiment(device: device, queue: queue, setCount: setCount)
    } else {
        print("Residency sets require macOS 15.0+.")
    }
}

do {
    try main()
} catch ResidencySetLimitError.invalidSetCount(let raw) {
    fputs("Invalid set count: \(raw)\n", stderr)
    exit(2)
} catch {
    fputs("Experiment failed: \(error)\n", stderr)
    exit(1)
}

I wanted to see if I could hit a runtime assertion by adding 33 residency sets so I ran it with 1 set, and ran it with 33. The run with 33 produced the same exact error as was present in the GPU Tools profiler. Profiling with no residency sets would yield results that are nowhere near production (e.g. cache evictions that would never happen if memory was wired down) so this workaround would be preventing real profiling and smart code decisions from being made.

This leads me to believe that this is actually a bug in Xcode's GPU profiling. If running that metal_capture program didn't hit the assertion, then that is sufficient evidence to prove MLX is doing everything correctly, and it is in fact the tooling that is broken. I am considering closing this PR, but keeping it in draft for right now so the information is out there. If MLX is interested in the workaround I can open it back up, or we can close it if the goal is to wait for the Xcode development team to fix the profiler, but I am confident this is a profiler issue. @awni what would you like me to do?

Test run:

Device: Apple M1 Ultra
Set count: 33
Adding 33 residency sets to one command queue...
  added set 1
  added set 2
  added set 3
  added set 4
  added set 5
  added set 6
  added set 7
  added set 8
  added set 9
  added set 10
  added set 11
  added set 12
  added set 13
  added set 14
  added set 15
  added set 16
  added set 17
  added set 18
  added set 19
  added set 20
  added set 21
  added set 22
  added set 23
  added set 24
  added set 25
  added set 26
  added set 27
  added set 28
  added set 29
  added set 30
  added set 31
  added set 32
EXIT:134
_addResidencySets:453: failed assertion `IOGPUMetalCommandQueue: command queue residency set limit of 32 exceeded'

@awni
Copy link
Member

awni commented Feb 9, 2026

I think a simpler work-around would be to change here and add a check for if capturing is enabled it's in an M1 GPU then simply don't use residency sets.

But if you are not actively blocked by this then I would also leave this in draft and hopefully the underlying bug will be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants