Skip to content

[Arrow Lake] System Hangs & >4GB Allocation Failures on OEM Device with Fixed 256MB BAR #890

@psilofski

Description

@psilofski

System Info:

CPU: Intel Core Ultra 7 (Arrow Lake 255H / ThinkBook 16 G8)
GPU: Intel Arc iGPU (Xe-LPG / 140T)
OS: Ubuntu 24.04 (Kernel 6.17). (Issue also verified reproducible on Windows 11 with latest drivers)
Driver: Intel Compute Runtime (Level Zero / NEO)

Issue Description:
I am encountering two distinct, reproducible failure modes on this Arrow Lake platform under compute workloads (PyTorch/OpenVINO):

  1. Allocation Failure (OOM): The runtime fails to allocate single contiguous memory blocks larger than ~4GB, despite the system having 60GB+ of free RAM. (Allocating the same total amount in smaller chunks succeeds).
  2. System Instability (Kernel Panic): During heavy compute tasks involving high-bandwidth access (e.g., VAE Decode, Large Context LLM), the system suffers hard freezes/kernel panics, likely due to GTT thrashing.

Cross-Validation:
This behavior (Hard Freezes on heavy load, Allocation limits) is observed on both Windows 11 and Linux, strongly suggesting a platform-level firmware constraint rather than an OS-specific driver bug.

Root Cause Investigation:
lspci indicates that the device supports Physical Resizable BAR. However, the OEM firmware (Lenovo) locks the CPU-visible aperture to a legacy 256 MB, with no exposed option to enable or resize it.

Context:
My understanding is that Intel Arc iGPUs (Xe-LPG) share the same Arc driver stack, virtual memory model, and BAR-style aperture management as discrete Arc GPUs. Discrete Arc GPUs are documented as requiring ReBAR for optimal performance and stability.

Questions:

  1. Architecture: Does the Arrow Lake Arc iGPU share the architectural requirement for Large/Resizable BAR to ensure stability under heavy compute workloads?
  2. Compliance: Is the Compute Runtime expected to handle >4GB contiguous allocations and heavy thrashing gracefully within a 256 MB aperture, or is this considered an unsupported or out-of-spec firmware configuration for this platform??
  3. Triage: Should these crashes be filed as a memory-management bug in the driver, or is this a platform limitation that must be resolved by the OEM firmware?

Goal:
I am trying to determine whether to open a bug report against the driver's memory manager or if I have grounds to escalate this as a firmware defect to the OEM.

Any clarification on the architectural expectations for BAR sizing on Arrow Lake would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions