Skip to content

[RFC] Add vIOMMU and vDevice abstractions for hardware-accelerated nested translation#5

Open
likebreath wants to merge 3 commits intocloud-hypervisor:mainfrom
likebreath:0129/rfc_viommu_vdevice
Open

[RFC] Add vIOMMU and vDevice abstractions for hardware-accelerated nested translation#5
likebreath wants to merge 3 commits intocloud-hypervisor:mainfrom
likebreath:0129/rfc_viommu_vdevice

Conversation

@likebreath
Copy link
Member

@likebreath likebreath commented Jan 30, 2026

This RFC introduces high-level abstractions for building hardware-accelerated virtual IOMMUs using the kernel's iommufd nested translation capabilities. The implementation targets userspace VMMs (e.g., Cloud Hypervisor) that want to expose virtual IOMMUs to guests while leveraging physical IOMMU hardware for performance.

Motivation

The Linux kernel's iommufd subsystem (v6.15+) now supports nested translation, enabling VMMs to offload guest stage-1 page table processing to physical IOMMU hardware. However, the low-level iommufd uAPI involves complex multi-phase workflows that are error-prone to implement correctly.

This RFC proposes ergonomic Rust abstractions that encapsulate this complexity.

Architecture Overview

The patch series consists of three commits building from low-level to high-level:

1. Bindings Update (Commit 1/3)

iommufd-bindings: Regenerate bindings from kernel v6.15

Updates bindings to include new kernel structures:

  • iommu_viommu_alloc / iommu_vdevice_alloc
  • iommu_hwpt_invalidate with vIOMMU support
  • ARM SMMUv3 and Intel VT-d specific data structures

2. Low-Level ioctl Wrappers (Commit 2/3)

iommufd-ioctls: Add ioctl wrappers for vIOMMU/vDevice operations

Adds safe Rust wrappers for 6 new ioctls:

  • IOMMU_HWPT_ALLOC: Allocate nested HWPTs (stage-1 and stage-2)
  • IOMMU_VIOMMU_ALLOC: Create vIOMMU instances backed by hardware
  • IOMMU_VDEVICE_ALLOC: Allocate virtual devices under a vIOMMU
  • IOMMUFD_GET_HW_INFO: Query physical IOMMU capabilities
  • IOMMUFD_HWPT_INVALIDATE: Forward TLB/cache invalidation to hardware
  • IOMMU_DESTROY: Generic cleanup for iommufd objects

3. High-Level Abstractions (Commit 3/3)

iommufd-ioctls: Add vIOMMU and vDevice abstraction layer

Introduces two key abstractions:

IommufdVIommu: Manages the vIOMMU lifecycle

  • Allocates stage-2 HWPT (hypervisor-controlled, linked to IOAS)
  • Pre-allocates bypass and abort stage-1 HWPTs for default modes
  • Provides invalidate_hwpt() for forwarding guest TLB commands
  • Implements Drop for automatic resource cleanup

IommufdVDevice: Represents devices attached to a vIOMMU

  • Dynamically allocates stage-1 HWPTs from guest STE (Stream Table Entry) data
  • Supports hot-swapping HWPT configurations at runtime
  • Queries physical IOMMU hardware capabilities via get_device_hw_info()
  • Implements Drop for proper vDevice teardown

Type-safe enums:

  • IommufdHwptData: ARM SMMUv3 vs Intel VT-d HWPT data
  • IommufdInvalidateData: Architecture-specific invalidation commands
  • IommufdHwInfoData: Hardware capability queries

Design Rationale

Multi-Phase Workflow Encapsulation

The kernel's nested translation workflow involves two distinct phases:

Phase 1: Infrastructure Setup (typically at VM boot)

  1. Create IOAS for stage-2 mappings (GPA → HPA)
  2. Allocate parent (stage-2) HWPT
  3. Allocate vIOMMU instance
  4. Allocate vDevice and bind physical device
  5. Attach device to bypass HWPT (until guest enables IOMMU)

Phase 2: Runtime Operations (triggered by guest IOMMU commands)

  1. Trap guest writes to virtual IOMMU command queue
  2. Allocate nested HWPT with guest stage-1 configuration
  3. Re-attach device to nested HWPT
  4. Forward TLB invalidation commands to hardware

IommufdVIommu and IommufdVDevice hide this complexity behind simple constructors and methods, handling proper sequencing and error recovery automatically.

Resource Ownership and Safety

  • Uses Arc<IommuFd> for shared iommufd file descriptor ownership
  • IommufdVDevice holds Arc<IommufdVIommu> to enforce lifecycle dependencies
  • Drop implementations ensure kernel resources are released in correct order
  • Type-safe enums prevent architecture mismatches (e.g., passing VT-d data to SMMUv3)

Current Limitations

  1. ARM SMMUv3 Focus: Intel VT-d support is stubbed with unimplemented!()
  2. Nested Translation Focus: Future work to support IOMMUFD's IOMMUFD_OBJ_VEVENTQ (for reporting hardware events/errors to guest), IOMMUFD_OBJ_HW_QUEUE (for direct guest command queue access).

This update incorporates features such as vIOMMU, vDevice, vEVENTQ, and
HW_QUEUE, etc. These features are required to implement a virtual IOMMU
that leverages hardware IOMMU acceleration.

Signed-off-by: Bo Chen <bchen@crusoe.ai>
Extend the iommufd ioctl wrapper library to enable userspace VMMs to
build virtual IOMMU that can leverage hardware IOMMU acceleration.

Added ioctl wrappers:
- IOMMU_DESTROY: Generic destruction of iommufd objects
- IOMMU_HWPT_ALLOC: Allocate hardware page tables for nested translation
- IOMMUFD_GET_HW_INFO: Query physical IOMMU hardware capabilities
- IOMMUFD_HWPT_INVALIDATE: Invalidate cached page table entries
- IOMMU_VIOMMU_ALLOC: Allocate virtual IOMMU instances backed by hardware
- IOMMU_VDEVICE_ALLOC: Allocate virtual devices under a vIOMMU

By exposing these hardware-accelerated operations to userspace, VMMs can
implement nested IOMMU virtualization where the physical IOMMU hardware
directly processes guest page tables, eliminating expensive emulation and
VM exits for IOMMU operations.

Signed-off-by: Bo Chen <bchen@crusoe.ai>
Introduce `IommufdVIommu` and `IommufdVDevice` structs to provide
high-level abstractions for managing IOMMUFD virtual IOMMU instances and
virtual devices [1]. These structures encapsulate the multi-phase workflow
required for hardware-accelerated nested translation — including S2 HWPT
infrastructure setup, vIOMMU/vDevice allocation, and runtime S1 HWPT
management — into clean and ergonomic Rust APIs. This enables VMMs to
build accelerated virtual IOMMUs without directly managing low-level
IOMMUFD object lifecycles and sequencing constraints.

Key changes:
- `IommufdVIommu`: Manages the vIOMMU lifecycle, including Stage-2 HWPT
  allocation and default Stage-1 HWPT configuration (bypass/abort mode).
- `IommufdVDevice`: Represents devices attached to a vIOMMU, supporting
  dynamic Stage-1 HWPT allocation and hardware info queries.
- Type Safety: Add `IommufdInvalidateData`, `IommufdHwInfoData`, and
  `IommufdHwptData` enums to handle architecture-specific data
  (e.g., ARM SMMUv3, Intel VT-d).
- Public interfaces: Provide methods for physical device info retrieval,
  Stage-1 HWPT configuration, and invalidation.
- Resource Management: Implement `Drop` traits to ensure proper resource
  release within the IOMMUFD context.

Note: This implementation primarily targets ARM SMMUv3; Intel VT-d
structures are currently placeholders for future implementation.

[1] https://docs.kernel.org/userspace-api/iommufd.html

Signed-off-by: Bo Chen <bchen@crusoe.ai>
@likebreath likebreath force-pushed the 0129/rfc_viommu_vdevice branch from 31b9806 to eabe752 Compare January 30, 2026 18:49
likebreath added a commit to likebreath/vfio that referenced this pull request Jan 30, 2026
Add infrastructure to enable VFIO devices to leverage hardware IOMMU
acceleration through iommufd's uAPIs. This allows userspace VMMs to
attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly
enabling userspace to configure stage-1 (guest-managed) page tables that
are composed with stage-2 (host-managed) page tables in hardware.

This depends on the IommufdVIOMMU and IommufdVDevice abstractions
introduced in the iommufd-ioctls crate [1].

New Public Interfaces:

1. VfioIommufd::new() signature change:
   - Added `s1_hwpt_data_type: Option<iommu_hwpt_data_type>` parameter
   - When `Some`, enables nested translation mode for subsequently attached
     VFIO devices
   - Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1

2. VfioDevice::new_with_iommufd():
   - New constructor for vfio devices backed by iommufd with
     hardware-accelerated nested HWPT support
   - Automatically creates IommufdVIommu/IommufdVDevice when nested mode
     is enabled via `VfioIommufd`
   - Supports sharing a single `IommufdVIommu` instance across multiple
     VFIO devices
   - Returns `IommufdVDevice` handle for subsequent S1 HWPT operations
   - Attaches device to bypass HWPT by default (until guest enables IOMMU)

3. VfioDevice::install_s1_hwpt():
   - Install guest-configured stage-1 page tables into hardware
   - Called when guest writes to virtual IOMMU stream table entries
   - Atomically replaces existing S1 HWPT if present
   - Uses `IommufdHwptData` enum for type-safe hardware-specific configuration

4. VfioDevice::uninstall_s1_hwpt():
   - Revert device to bypass or abort mode
   - abort=true: Use abort HWPT (fault all DMA)
   - abort=false: Use bypass HWPT (passthrough translation)
   - Called during guest IOMMU reset or shutdown

Dependencies on iommufd-ioctls:

This implementation builds upon three types from iommufd-ioctls [1]:

- `IommufdVIommu`: Represents a physical IOMMU slice managing S2 HWPT
  and default S1 HWPTs (bypass/abort). Shared across devices behind the
  same virtual IOMMU.

- `IommufdVDevice`: Represents a device attached to a `IommufdVIommu`.
  Handles dynamic S1 HWPT allocation and lifecycle management.

- `IommufdHwptData`: Type-safe enum for architecture-specific HWPT
  configuration (SMMUv3 STE data, VT-d context entries).

Integration Notes for VMMs:

1. VMM creates `VfioIommufd` with `s1_hwpt_data_type` if hardware
   accelerated virtual IOMMUs are enabled and used to manage
   VFIO devices
2. VMM calls `VfioDevice::new_with_iommufd()` per passthrough device
   - The same instance of virtual IOMMU should reuse the same instance
     of `IommufdVIommu`
   - Each VFIO device will has its own `VfioDevice` and `IommufdVDevice`
     instance
3. VMM need to make sure the virtual IOMMU is compatible with the
   physical IOMMU:
   - `IommufdVDevice::get_hw_info` is used to retrieve hardware
    information of the physical IOMMU
3. VMM traps guest IOMMU commands and calls:
   - `install_s1_hwpt()` when guest enables IOMMU
   - `uninstall_s1_hwpt()` when guest disables IOMMU
   - `IommufdVIommu::invalidate_hwpt()` when guest invalidate IOTLB
      entries

This enables VMM to enable hardware-accelerated IOMMU to manage VFIO
devices and use physical IOMMU hardware to directly process guest page
tables.

[1] cloud-hypervisor/iommufd#5

Signed-off-by: Bo Chen <bchen@crusoe.ai>
likebreath added a commit to likebreath/vfio that referenced this pull request Jan 30, 2026
Add infrastructure to enable VFIO devices to leverage hardware IOMMU
acceleration through iommufd's uAPIs. This allows userspace VMMs to
attach VFIO devices to hardware-accelerated virtual IOMMUs, particularly
enabling userspace to configure stage-1 (guest-managed) page tables that
are composed with stage-2 (host-managed) page tables in hardware.

This depends on the IommufdVIOMMU and IommufdVDevice abstractions
introduced in the iommufd-ioctls crate [1].

New Public Interfaces:

1. VfioIommufd::new() signature change:
   - Added `s1_hwpt_data_type: Option<iommu_hwpt_data_type>` parameter
   - When `Some`, enables nested translation mode for subsequently attached
     VFIO devices
   - Supported types: IOMMU_HWPT_DATA_ARM_SMMUV3, IOMMU_HWPT_DATA_VTD_S1

2. VfioDevice::new_with_iommufd():
   - New constructor for vfio devices backed by iommufd with
     hardware-accelerated nested HWPT support
   - Automatically creates IommufdVIommu/IommufdVDevice when nested mode
     is enabled via `VfioIommufd`
   - Supports sharing a single `IommufdVIommu` instance across multiple
     VFIO devices
   - Returns `IommufdVDevice` handle for subsequent S1 HWPT operations
   - Attaches device to bypass HWPT by default (until guest enables IOMMU)

3. VfioDevice::install_s1_hwpt():
   - Install guest-configured stage-1 page tables into hardware
   - Called when guest writes to virtual IOMMU stream table entries
   - Atomically replaces existing S1 HWPT if present
   - Uses `IommufdHwptData` enum for type-safe hardware-specific configuration

4. VfioDevice::uninstall_s1_hwpt():
   - Revert device to bypass or abort mode
   - abort=true: Use abort HWPT (fault all DMA)
   - abort=false: Use bypass HWPT (passthrough translation)
   - Called during guest IOMMU reset or shutdown

Dependencies on iommufd-ioctls:

This implementation builds upon three types from iommufd-ioctls [1]:

- `IommufdVIommu`: Represents a physical IOMMU slice managing S2 HWPT
  and default S1 HWPTs (bypass/abort). Shared across devices behind the
  same virtual IOMMU.

- `IommufdVDevice`: Represents a device attached to a `IommufdVIommu`.
  Handles dynamic S1 HWPT allocation and lifecycle management.

- `IommufdHwptData`: Type-safe enum for architecture-specific HWPT
  configuration (SMMUv3 STE data, VT-d context entries).

Integration Notes for VMMs:

1. VMM creates `VfioIommufd` with `s1_hwpt_data_type` if hardware
   accelerated virtual IOMMUs are enabled and used to manage
   VFIO devices
2. VMM calls `VfioDevice::new_with_iommufd()` per passthrough device
   - The same instance of virtual IOMMU should reuse the same instance
     of `IommufdVIommu`
   - Each VFIO device will has its own `VfioDevice` and `IommufdVDevice`
     instance
3. VMM need to make sure the virtual IOMMU is compatible with the
   physical IOMMU:
   - `IommufdVDevice::get_hw_info` is used to retrieve hardware
    information of the physical IOMMU
3. VMM traps guest IOMMU commands and calls:
   - `install_s1_hwpt()` when guest enables IOMMU
   - `uninstall_s1_hwpt()` when guest disables IOMMU
   - `IommufdVIommu::invalidate_hwpt()` when guest invalidate IOTLB
      entries

This enables VMM to enable hardware-accelerated IOMMU to manage VFIO
devices and use physical IOMMU hardware to directly process guest page
tables.

[1] cloud-hypervisor/iommufd#5

Signed-off-by: Bo Chen <bchen@crusoe.ai>
@likebreath likebreath marked this pull request as ready for review January 31, 2026 05:05
@sboeuf
Copy link
Member

sboeuf commented Feb 2, 2026

@likebreath sorry if that's a question that has already been asked, but why isn't this repo (iommufd) living inside the rust-vmm organization? This is similar to kvm-ioctls/bindings and vfio-ioctls/bindings, right?

@likebreath
Copy link
Member Author

@likebreath sorry if that's a question that has already been asked, but why isn't this repo (iommufd) living inside the rust-vmm organization? This is similar to kvm-ioctls/bindings and vfio-ioctls/bindings, right?

Hi @sboeuf, great to hear from you!

You are right - that is the plan once the crate is mature and stable. For now, however, I prefer to decouple active development from the rust-vmm repo integration, particularly given the ongoing efforts to switch rust-vmm to a monorepo structure [1].

[1] https://github.com/rust-vmm/rust-vmm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants