[RFC] Add vIOMMU and vDevice abstractions for hardware-accelerated nested translation#5
Open
likebreath wants to merge 3 commits intocloud-hypervisor:mainfrom
Open
Conversation
This update incorporates features such as vIOMMU, vDevice, vEVENTQ, and HW_QUEUE, etc. These features are required to implement a virtual IOMMU that leverages hardware IOMMU acceleration. Signed-off-by: Bo Chen <bchen@crusoe.ai>
Extend the iommufd ioctl wrapper library to enable userspace VMMs to build virtual IOMMU that can leverage hardware IOMMU acceleration. Added ioctl wrappers: - IOMMU_DESTROY: Generic destruction of iommufd objects - IOMMU_HWPT_ALLOC: Allocate hardware page tables for nested translation - IOMMUFD_GET_HW_INFO: Query physical IOMMU hardware capabilities - IOMMUFD_HWPT_INVALIDATE: Invalidate cached page table entries - IOMMU_VIOMMU_ALLOC: Allocate virtual IOMMU instances backed by hardware - IOMMU_VDEVICE_ALLOC: Allocate virtual devices under a vIOMMU By exposing these hardware-accelerated operations to userspace, VMMs can implement nested IOMMU virtualization where the physical IOMMU hardware directly processes guest page tables, eliminating expensive emulation and VM exits for IOMMU operations. Signed-off-by: Bo Chen <bchen@crusoe.ai>
Introduce `IommufdVIommu` and `IommufdVDevice` structs to provide high-level abstractions for managing IOMMUFD virtual IOMMU instances and virtual devices [1]. These structures encapsulate the multi-phase workflow required for hardware-accelerated nested translation — including S2 HWPT infrastructure setup, vIOMMU/vDevice allocation, and runtime S1 HWPT management — into clean and ergonomic Rust APIs. This enables VMMs to build accelerated virtual IOMMUs without directly managing low-level IOMMUFD object lifecycles and sequencing constraints. Key changes: - `IommufdVIommu`: Manages the vIOMMU lifecycle, including Stage-2 HWPT allocation and default Stage-1 HWPT configuration (bypass/abort mode). - `IommufdVDevice`: Represents devices attached to a vIOMMU, supporting dynamic Stage-1 HWPT allocation and hardware info queries. - Type Safety: Add `IommufdInvalidateData`, `IommufdHwInfoData`, and `IommufdHwptData` enums to handle architecture-specific data (e.g., ARM SMMUv3, Intel VT-d). - Public interfaces: Provide methods for physical device info retrieval, Stage-1 HWPT configuration, and invalidation. - Resource Management: Implement `Drop` traits to ensure proper resource release within the IOMMUFD context. Note: This implementation primarily targets ARM SMMUv3; Intel VT-d structures are currently placeholders for future implementation. [1] https://docs.kernel.org/userspace-api/iommufd.html Signed-off-by: Bo Chen <bchen@crusoe.ai>
31b9806 to
eabe752
Compare
likebreath
added a commit
to likebreath/vfio
that referenced
this pull request
Jan 30, 2026
Add infrastructure to enable VFIO devices to leverage hardware IOMMU
acceleration through iommufd's uAPIs. This allows userspace VMMs to
attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly
enabling userspace to configure stage-1 (guest-managed) page tables that
are composed with stage-2 (host-managed) page tables in hardware.
This depends on the IommufdVIOMMU and IommufdVDevice abstractions
introduced in the iommufd-ioctls crate [1].
New Public Interfaces:
1. VfioIommufd::new() signature change:
- Added `s1_hwpt_data_type: Option<iommu_hwpt_data_type>` parameter
- When `Some`, enables nested translation mode for subsequently attached
VFIO devices
- Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1
2. VfioDevice::new_with_iommufd():
- New constructor for vfio devices backed by iommufd with
hardware-accelerated nested HWPT support
- Automatically creates IommufdVIommu/IommufdVDevice when nested mode
is enabled via `VfioIommufd`
- Supports sharing a single `IommufdVIommu` instance across multiple
VFIO devices
- Returns `IommufdVDevice` handle for subsequent S1 HWPT operations
- Attaches device to bypass HWPT by default (until guest enables IOMMU)
3. VfioDevice::install_s1_hwpt():
- Install guest-configured stage-1 page tables into hardware
- Called when guest writes to virtual IOMMU stream table entries
- Atomically replaces existing S1 HWPT if present
- Uses `IommufdHwptData` enum for type-safe hardware-specific configuration
4. VfioDevice::uninstall_s1_hwpt():
- Revert device to bypass or abort mode
- abort=true: Use abort HWPT (fault all DMA)
- abort=false: Use bypass HWPT (passthrough translation)
- Called during guest IOMMU reset or shutdown
Dependencies on iommufd-ioctls:
This implementation builds upon three types from iommufd-ioctls [1]:
- `IommufdVIommu`: Represents a physical IOMMU slice managing S2 HWPT
and default S1 HWPTs (bypass/abort). Shared across devices behind the
same virtual IOMMU.
- `IommufdVDevice`: Represents a device attached to a `IommufdVIommu`.
Handles dynamic S1 HWPT allocation and lifecycle management.
- `IommufdHwptData`: Type-safe enum for architecture-specific HWPT
configuration (SMMUv3 STE data, VT-d context entries).
Integration Notes for VMMs:
1. VMM creates `VfioIommufd` with `s1_hwpt_data_type` if hardware
accelerated virtual IOMMUs are enabled and used to manage
VFIO devices
2. VMM calls `VfioDevice::new_with_iommufd()` per passthrough device
- The same instance of virtual IOMMU should reuse the same instance
of `IommufdVIommu`
- Each VFIO device will has its own `VfioDevice` and `IommufdVDevice`
instance
3. VMM need to make sure the virtual IOMMU is compatible with the
physical IOMMU:
- `IommufdVDevice::get_hw_info` is used to retrieve hardware
information of the physical IOMMU
3. VMM traps guest IOMMU commands and calls:
- `install_s1_hwpt()` when guest enables IOMMU
- `uninstall_s1_hwpt()` when guest disables IOMMU
- `IommufdVIommu::invalidate_hwpt()` when guest invalidate IOTLB
entries
This enables VMM to enable hardware-accelerated IOMMU to manage VFIO
devices and use physical IOMMU hardware to directly process guest page
tables.
[1] cloud-hypervisor/iommufd#5
Signed-off-by: Bo Chen <bchen@crusoe.ai>
likebreath
added a commit
to likebreath/vfio
that referenced
this pull request
Jan 30, 2026
Add infrastructure to enable VFIO devices to leverage hardware IOMMU
acceleration through iommufd's uAPIs. This allows userspace VMMs to
attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly
enabling userspace to configure stage-1 (guest-managed) page tables that
are composed with stage-2 (host-managed) page tables in hardware.
This depends on the IommufdVIOMMU and IommufdVDevice abstractions
introduced in the iommufd-ioctls crate [1].
New Public Interfaces:
1. VfioIommufd::new() signature change:
- Added `s1_hwpt_data_type: Option<iommu_hwpt_data_type>` parameter
- When `Some`, enables nested translation mode for subsequently attached
VFIO devices
- Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1
2. VfioDevice::new_with_iommufd():
- New constructor for vfio devices backed by iommufd with
hardware-accelerated nested HWPT support
- Automatically creates IommufdVIommu/IommufdVDevice when nested mode
is enabled via `VfioIommufd`
- Supports sharing a single `IommufdVIommu` instance across multiple
VFIO devices
- Returns `IommufdVDevice` handle for subsequent S1 HWPT operations
- Attaches device to bypass HWPT by default (until guest enables IOMMU)
3. VfioDevice::install_s1_hwpt():
- Install guest-configured stage-1 page tables into hardware
- Called when guest writes to virtual IOMMU stream table entries
- Atomically replaces existing S1 HWPT if present
- Uses `IommufdHwptData` enum for type-safe hardware-specific configuration
4. VfioDevice::uninstall_s1_hwpt():
- Revert device to bypass or abort mode
- abort=true: Use abort HWPT (fault all DMA)
- abort=false: Use bypass HWPT (passthrough translation)
- Called during guest IOMMU reset or shutdown
Dependencies on iommufd-ioctls:
This implementation builds upon three types from iommufd-ioctls [1]:
- `IommufdVIommu`: Represents a physical IOMMU slice managing S2 HWPT
and default S1 HWPTs (bypass/abort). Shared across devices behind the
same virtual IOMMU.
- `IommufdVDevice`: Represents a device attached to a `IommufdVIommu`.
Handles dynamic S1 HWPT allocation and lifecycle management.
- `IommufdHwptData`: Type-safe enum for architecture-specific HWPT
configuration (SMMUv3 STE data, VT-d context entries).
Integration Notes for VMMs:
1. VMM creates `VfioIommufd` with `s1_hwpt_data_type` if hardware
accelerated virtual IOMMUs are enabled and used to manage
VFIO devices
2. VMM calls `VfioDevice::new_with_iommufd()` per passthrough device
- The same instance of virtual IOMMU should reuse the same instance
of `IommufdVIommu`
- Each VFIO device will has its own `VfioDevice` and `IommufdVDevice`
instance
3. VMM need to make sure the virtual IOMMU is compatible with the
physical IOMMU:
- `IommufdVDevice::get_hw_info` is used to retrieve hardware
information of the physical IOMMU
3. VMM traps guest IOMMU commands and calls:
- `install_s1_hwpt()` when guest enables IOMMU
- `uninstall_s1_hwpt()` when guest disables IOMMU
- `IommufdVIommu::invalidate_hwpt()` when guest invalidate IOTLB
entries
This enables VMM to enable hardware-accelerated IOMMU to manage VFIO
devices and use physical IOMMU hardware to directly process guest page
tables.
[1] cloud-hypervisor/iommufd#5
Signed-off-by: Bo Chen <bchen@crusoe.ai>
Member
|
@likebreath sorry if that's a question that has already been asked, but why isn't this repo (iommufd) living inside the rust-vmm organization? This is similar to kvm-ioctls/bindings and vfio-ioctls/bindings, right? |
Member
Author
Hi @sboeuf, great to hear from you! You are right - that is the plan once the crate is mature and stable. For now, however, I prefer to decouple active development from the rust-vmm repo integration, particularly given the ongoing efforts to switch rust-vmm to a monorepo structure [1]. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This RFC introduces high-level abstractions for building hardware-accelerated virtual IOMMUs using the kernel's iommufd nested translation capabilities. The implementation targets userspace VMMs (e.g., Cloud Hypervisor) that want to expose virtual IOMMUs to guests while leveraging physical IOMMU hardware for performance.
Motivation
The Linux kernel's iommufd subsystem (v6.15+) now supports nested translation, enabling VMMs to offload guest stage-1 page table processing to physical IOMMU hardware. However, the low-level iommufd uAPI involves complex multi-phase workflows that are error-prone to implement correctly.
This RFC proposes ergonomic Rust abstractions that encapsulate this complexity.
Architecture Overview
The patch series consists of three commits building from low-level to high-level:
1. Bindings Update (Commit 1/3)
Updates bindings to include new kernel structures:
iommu_viommu_alloc/iommu_vdevice_allociommu_hwpt_invalidatewith vIOMMU support2. Low-Level ioctl Wrappers (Commit 2/3)
Adds safe Rust wrappers for 6 new ioctls:
IOMMU_HWPT_ALLOC: Allocate nested HWPTs (stage-1 and stage-2)IOMMU_VIOMMU_ALLOC: Create vIOMMU instances backed by hardwareIOMMU_VDEVICE_ALLOC: Allocate virtual devices under a vIOMMUIOMMUFD_GET_HW_INFO: Query physical IOMMU capabilitiesIOMMUFD_HWPT_INVALIDATE: Forward TLB/cache invalidation to hardwareIOMMU_DESTROY: Generic cleanup for iommufd objects3. High-Level Abstractions (Commit 3/3)
Introduces two key abstractions:
IommufdVIommu: Manages the vIOMMU lifecycleinvalidate_hwpt()for forwarding guest TLB commandsDropfor automatic resource cleanupIommufdVDevice: Represents devices attached to a vIOMMUget_device_hw_info()Dropfor proper vDevice teardownType-safe enums:
IommufdHwptData: ARM SMMUv3 vs Intel VT-d HWPT dataIommufdInvalidateData: Architecture-specific invalidation commandsIommufdHwInfoData: Hardware capability queriesDesign Rationale
Multi-Phase Workflow Encapsulation
The kernel's nested translation workflow involves two distinct phases:
Phase 1: Infrastructure Setup (typically at VM boot)
Phase 2: Runtime Operations (triggered by guest IOMMU commands)
IommufdVIommuandIommufdVDevicehide this complexity behind simple constructors and methods, handling proper sequencing and error recovery automatically.Resource Ownership and Safety
Arc<IommuFd>for shared iommufd file descriptor ownershipIommufdVDeviceholdsArc<IommufdVIommu>to enforce lifecycle dependenciesDropimplementations ensure kernel resources are released in correct orderCurrent Limitations
IOMMUFD_OBJ_VEVENTQ(for reporting hardware events/errors to guest),IOMMUFD_OBJ_HW_QUEUE(for direct guest command queue access).