-
Notifications
You must be signed in to change notification settings - Fork 266
Description
System Info:
CPU: Intel Core Ultra 7 (Arrow Lake 255H / ThinkBook 16 G8)
GPU: Intel Arc iGPU (Xe-LPG / 140T)
OS: Ubuntu 24.04 (Kernel 6.17). (Issue also verified reproducible on Windows 11 with latest drivers)
Driver: Intel Compute Runtime (Level Zero / NEO)
Issue Description:
I am encountering two distinct, reproducible failure modes on this Arrow Lake platform under compute workloads (PyTorch/OpenVINO):
- Allocation Failure (OOM): The runtime fails to allocate single contiguous memory blocks larger than ~4GB, despite the system having 60GB+ of free RAM. (Allocating the same total amount in smaller chunks succeeds).
- System Instability (Kernel Panic): During heavy compute tasks involving high-bandwidth access (e.g., VAE Decode, Large Context LLM), the system suffers hard freezes/kernel panics, likely due to GTT thrashing.
Cross-Validation:
This behavior (Hard Freezes on heavy load, Allocation limits) is observed on both Windows 11 and Linux, strongly suggesting a platform-level firmware constraint rather than an OS-specific driver bug.
Root Cause Investigation:
lspci indicates that the device supports Physical Resizable BAR. However, the OEM firmware (Lenovo) locks the CPU-visible aperture to a legacy 256 MB, with no exposed option to enable or resize it.
Context:
My understanding is that Intel Arc iGPUs (Xe-LPG) share the same Arc driver stack, virtual memory model, and BAR-style aperture management as discrete Arc GPUs. Discrete Arc GPUs are documented as requiring ReBAR for optimal performance and stability.
Questions:
- Architecture: Does the Arrow Lake Arc iGPU share the architectural requirement for Large/Resizable BAR to ensure stability under heavy compute workloads?
- Compliance: Is the Compute Runtime expected to handle >4GB contiguous allocations and heavy thrashing gracefully within a 256 MB aperture, or is this considered an unsupported or out-of-spec firmware configuration for this platform??
- Triage: Should these crashes be filed as a memory-management bug in the driver, or is this a platform limitation that must be resolved by the OEM firmware?
Goal:
I am trying to determine whether to open a bug report against the driver's memory manager or if I have grounds to escalate this as a firmware defect to the OEM.
Any clarification on the architectural expectations for BAR sizing on Arrow Lake would be greatly appreciated.