dkms: remove CONFIG_DMABUF_MOVENOTIFY gate for P2P enablement#210
Open
dbsanfte wants to merge 1 commit intoROCm:masterfrom
Open
dkms: remove CONFIG_DMABUF_MOVENOTIFY gate for P2P enablement#210dbsanfte wants to merge 1 commit intoROCm:masterfrom
dbsanfte wants to merge 1 commit intoROCm:masterfrom
Conversation
The CONFIG_DMABUF_MOVENOTIFY kernel config option was an AMD out-of-tree config that existed in older patched kernels. Since mainline kernel ~5.12, the DMA-buf move_notify callback is built-in unconditionally (part of struct dma_buf_ops) with no separate Kconfig gate. On mainline kernels (tested on 6.14), CONFIG_DMABUF_MOVENOTIFY is never defined, which causes CONFIG_HSA_AMD_P2P to be disabled even when CONFIG_PCI_P2PDMA=y. This prevents GPU-to-GPU P2P access through PCIe switches (e.g., Broadcom PEX88096) because the IOMMU remap check in amdgpu_device_is_peer_accessible() is compiled out entirely. Without CONFIG_HSA_AMD_P2P, the driver falls back to raw DMA mask address checking which fails for GPUs behind PCIe switches where BAR addresses (e.g., 62 TiB) exceed the GPU's 44-bit DMA mask, even though IOMMU remapping would make P2P work correctly. The fix removes the CONFIG_DMABUF_MOVENOTIFY inner check, keeping only the CONFIG_PCI_P2PDMA gate which is the actual functional requirement for PCIe peer-to-peer DMA support. Tested on: - 2x AMD Instinct MI50 32GB behind Broadcom PEX88096 Gen4 switch - Kernel 6.14.0-37-generic (Ubuntu mainline) - ROCm 6.4.2 with amdgpu DKMS 6.12.12 - Verified: KFD p2p_links, hipDeviceCanAccessPeer, P2P memcpy, rocm-bandwidth-test bidirectional P2P all functional after fix Signed-off-by: Daniel Sanfte <dbsanfte@users.noreply.github.com>
Collaborator
|
Looks like the fix got dropped during one of the kernel rebases. Amazing that no one has noticed it since it got missed back in 6.10. It should be MOVE_NOTIFY. I've reached out to the KCL team to get this fix brought back in. You can try to apply it yourself by changing CONFIG_DMABUF_MOVENOTIFY to CONFIG_DMABUF_MOVE_NOTIFY, and that should get you unblocked. I'll leave this open until we get a release done that fixes it. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR is to remove a config gate that blocks P2P enablement for GPUs behind PCIe switches on mainline kernels (e.g. Ubuntu).
I found this was necessary to enable P2P communication between my Mi50s on Ubuntu 24.04.
Technical Details
NOTE: Summary by Opus 4.6
The CONFIG_DMABUF_MOVENOTIFY kernel config option was an AMD out-of-tree config that existed in older patched kernels. Since mainline kernel ~5.12, the DMA-buf move_notify callback is built-in unconditionally (part of struct dma_buf_ops) with no separate Kconfig gate.
On mainline kernels (tested on 6.14), CONFIG_DMABUF_MOVENOTIFY is never defined, which causes CONFIG_HSA_AMD_P2P to be disabled even when CONFIG_PCI_P2PDMA=y. This prevents GPU-to-GPU P2P access through PCIe switches (e.g., Broadcom PEX88096) because the IOMMU remap check in amdgpu_device_is_peer_accessible() is compiled out entirely.
Without CONFIG_HSA_AMD_P2P, the driver falls back to raw DMA mask address checking which fails for GPUs behind PCIe switches where BAR addresses (e.g., 62 TiB) exceed the GPU's 44-bit DMA mask, even though IOMMU remapping would make P2P work correctly.
The fix removes the CONFIG_DMABUF_MOVENOTIFY inner check, keeping only the CONFIG_PCI_P2PDMA gate which is the actual functional requirement for PCIe peer-to-peer DMA support.
Test Plan
Tested on:
Test Result
Verified: KFD p2p_links, hipDeviceCanAccessPeer, P2P memcpy, rocm-bandwidth-test bidirectional P2P all functional after fix
Submission Checklist