Prevent allocator from operating on residency set during metal capture#3114
Prevent allocator from operating on residency set during metal capture#3114robertmsale wants to merge 1 commit intoml-explore:mainfrom
Conversation
|
Residency sets are off by default so I'm slightly confused. Did you have them turned on when capturing the workload (e.g. did you set |
|
Oh I guess |
|
This confused me too! In this repro, I did not call The workload is the stock capture example in
My takeaway was backend behavior during capture/replay causing those errors. What was happening pre-fix:
What changed:
I tried to piece it together in a way that production workloads do not touch any of the bookkeeping and release builds can optimize out those branches if nobody does Metal GPU capture. I would like to point out that I tried many different strategies in |
I'm just speculating here, but I think the problem happens when the profiler does derived counter analysis. MLX does indeed only have 1 residency set, everything we're doing is above board. But during derived counter analysis the GPUTools replays passes and if we have any residency attachment operations during capture, it re-issues queue residency attachment over and over because it happened even one time during capture. We hit a limit of 32 adds, and get 24 errors. I think the pattern is we have 3 GPU queues: default stream multiply, s2, s3. The 56 residency sets comes from 3 queues and number of replay counter passes. Some passes touch all queues, some only touch s2, some s3, and you end up with 56 passes total from that. So I think this PR is mostly a workaround for the GPU profiling limitation and is not indicative of a deeper issue, but I can't debug the profiler 😅 |
|
I'm converting this to draft because, after thinking about this, I think this is a workaround for a bug in the GPU profiler. Something didn't feel right about it. I wrote a toy benchmark: import Foundation
import Metal
enum ResidencySetLimitError: Error {
case noDevice
case noQueue
case noCommandBuffer
case noEncoder
case noOutputBuffer
case invalidSetCount(String)
}
@available(macOS 15.0, *)
func makePipeline(device: MTLDevice) throws -> MTLComputePipelineState {
let source = """
#include <metal_stdlib>
using namespace metal;
kernel void write_index(device uint *out [[buffer(0)]],
uint gid [[thread_position_in_grid]]) {
out[gid] = gid;
}
"""
let library = try device.makeLibrary(source: source, options: nil)
guard let function = library.makeFunction(name: "write_index") else {
fatalError("Could not build runtime kernel function.")
}
return try device.makeComputePipelineState(function: function)
}
@available(macOS 15.0, *)
func runExperiment(device: MTLDevice, queue: MTLCommandQueue, setCount: Int) throws {
let descriptor = MTLResidencySetDescriptor()
var sets: [any MTLResidencySet] = []
sets.reserveCapacity(setCount)
print("Adding \(setCount) residency sets to one command queue...")
for i in 0..<setCount {
let set = try device.makeResidencySet(descriptor: descriptor)
set.requestResidency()
queue.addResidencySet(set)
sets.append(set)
print(" added set \(i + 1)")
}
let pipeline = try makePipeline(device: device)
let elementCount = 64
let outputSize = elementCount * MemoryLayout<UInt32>.stride
guard let output = device.makeBuffer(length: outputSize, options: .storageModeShared) else {
throw ResidencySetLimitError.noOutputBuffer
}
guard let commandBuffer = queue.makeCommandBuffer() else {
throw ResidencySetLimitError.noCommandBuffer
}
guard let encoder = commandBuffer.makeComputeCommandEncoder() else {
throw ResidencySetLimitError.noEncoder
}
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(output, offset: 0, index: 0)
let width = min(pipeline.maxTotalThreadsPerThreadgroup, elementCount)
let tg = MTLSize(width: width, height: 1, depth: 1)
let grid = MTLSize(width: elementCount, height: 1, depth: 1)
encoder.dispatchThreads(grid, threadsPerThreadgroup: tg)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let ptr = output.contents().bindMemory(to: UInt32.self, capacity: elementCount)
let first = (0..<8).map { ptr[$0] }
print("Kernel completed. First 8 outputs: \(first)")
// Keep sets alive through command buffer completion.
withExtendedLifetime(sets) {}
}
func parseSetCount(_ args: [String]) throws -> Int {
guard let raw = args.dropFirst().first else {
return 33
}
guard let value = Int(raw), value > 0 else {
throw ResidencySetLimitError.invalidSetCount(raw)
}
return value
}
func main() throws {
guard let device = MTLCreateSystemDefaultDevice() else {
throw ResidencySetLimitError.noDevice
}
guard let queue = device.makeCommandQueue() else {
throw ResidencySetLimitError.noQueue
}
let setCount = try parseSetCount(CommandLine.arguments)
print("Device: \(device.name)")
print("Set count: \(setCount)")
if #available(macOS 15.0, *) {
try runExperiment(device: device, queue: queue, setCount: setCount)
} else {
print("Residency sets require macOS 15.0+.")
}
}
do {
try main()
} catch ResidencySetLimitError.invalidSetCount(let raw) {
fputs("Invalid set count: \(raw)\n", stderr)
exit(2)
} catch {
fputs("Experiment failed: \(error)\n", stderr)
exit(1)
}I wanted to see if I could hit a runtime assertion by adding 33 residency sets so I ran it with 1 set, and ran it with 33. The run with 33 produced the same exact error as was present in the GPU Tools profiler. Profiling with no residency sets would yield results that are nowhere near production (e.g. cache evictions that would never happen if memory was wired down) so this workaround would be preventing real profiling and smart code decisions from being made. This leads me to believe that this is actually a bug in Xcode's GPU profiling. If running that Test run: |
|
I think a simpler work-around would be to change here and add a check for if capturing is enabled it's in an M1 GPU then simply don't use residency sets. But if you are not actively blocked by this then I would also leave this in draft and hopefully the underlying bug will be fixed. |
Proposed changes
In issue #2846, it was reported that when profiling a
.gputracegenerated by MLX, Xcode could fail to rebuild the graph due to residency-set replay errors (command queue residency set limit of 32 exceeded).This PR makes Metal capture replay-safe by preventing residency-set queue attachment/mutation while capture is active, and restoring normal behavior after capture ends.
What changed
metal::residency_sets_enabled()/metal::set_residency_sets_enabled(bool).start_capture(...)and re-enabled them atstop_capture().These changes preserve normal runtime behavior outside capture while avoiding replay-time residency-set explosions during Xcode profiling/counters.
Checklist
Put an
xin the boxes that apply.pre-commit run --all-filesto format my code / installed pre-commit prior to committing changes