Skip to content

GPU driver installation fails on 2.2.52-debian12 #1318

@zafercavdar

Description

@zafercavdar

I receive the following error:

nvidia-smi not installed
/etc/google-dataproc/startup-scripts/dataproc-initialization-script-0: line 293: is_cudnn8: command not found
Files removed: 308 (2317.2 MB)
Writing to /root/.config/pip/pip.conf
gpg: keybox '/usr/share/keyrings/adoptium.gpg' created
gpg: directory '/root/.gnupg' created
gpg: /root/.gnupg/trustdb.gpg: trustdb created
gpg: key 843C48A565F8F04B: public key "Adoptium GPG Key (DEB/RPM Signing Key) <temurin-dev@eclipse.org>" imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <artifact-registry-repository-signer@google.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
gpg: keybox '/usr/share/keyrings/docker-keyring.gpg' created
gpg: key 8D81803C0EBFCD88: public key "Docker Release (CE deb) <docker@docker.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
/etc/apt/sources.list.d/google-cloud.list
gpg: keybox '/usr/share/keyrings/cloud.google.gpg' created
gpg: key C0BA5CE6DC6315A3: public key "Artifact Registry Repository Signer <artifact-registry-repository-signer@google.com>" imported
gpg: Total number processed: 1
gpg:               imported: 1
Reading package lists...
Building dependency tree...
Reading state information...
0 upgraded, 0 newly installed, 0 to remove and 31 not upgraded.
Canceled hold on systemd.
Canceled hold on libsystemd0.

real	0m8.038s
user	0m4.309s
sys	0m1.458s
nvidia-smi not installed
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************-compute@developer.gserviceaccount.com
  entity: user-****************-compute@developer.gserviceaccount.com
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 7
content_type: application/octet-stream
crc32c_hash: pOhoiw==
creation_time: 2025-04-22T07:38:19+0000
etag: CPzuiYyR64wDEAE=
generation: '1745307499132796'
metageneration: 1
name: dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run
size: 307296728
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:38:19+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run#1745307499132796
update_time: 2025-04-22T07:38:19+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/NVIDIA-Linux-x86_64-550.142.run to file:///mnt/shm/userspace.run
  
....

Average throughput: 691.9MiB/s

real	0m2.103s
user	0m2.386s
sys	0m1.030s

real	0m20.568s
user	0m5.847s
sys	0m5.105s
/opt/install-dpgce /
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************-compute@developer.gserviceaccount.com
  entity: user-****************-compute@developer.gserviceaccount.com
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
content_type: application/x-tar
crc32c_hash: hUkg3A==
creation_time: 2025-04-22T07:40:18+0000
etag: CKHpisWR64wDEAE=
generation: '1745307618686113'
md5_hash: u5lHOXdDD2qH/CYP1wGVxw==
metageneration: 1
name: dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz
size: 25508565
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:40:18+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/kmod/debian12/6.1.0-32-cloud-amd64/unsigned/kmod_debian12_550.142.tar.gz#1745307618686113
update_time: 2025-04-22T07:40:18+0000
cache hit
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build.log
opt/install-dpgce/open-gpu-kernel-modules/kernel-open/build_error.log
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-uvm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-drm.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-peermem.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia-modeset.ko
lib/modules/6.1.0-32-cloud-amd64/kernel/drivers/video/nvidia.ko
/
NVIDIA GPU driver provided by NVIDIA was installed successfully
acl:
- entity: project-owners-****************
  projectTeam:
    projectNumber: '****************'
    team: owners
  role: OWNER
- entity: project-editors-****************
  projectTeam:
    projectNumber: '****************'
    team: editors
  role: OWNER
- entity: project-viewers-****************
  projectTeam:
    projectNumber: '****************'
    team: viewers
  role: READER
- email: ****************-compute@developer.gserviceaccount.com
  entity: user-****************-compute@developer.gserviceaccount.com
  role: OWNER
bucket: dataproc-temp-europe-west1-****************-jupf0b8s
component_count: 32
content_type: application/octet-stream
crc32c_hash: ROiILQ==
creation_time: 2025-04-22T07:54:44+0000
etag: CL+w9OGU64wDEAE=
generation: '1745308484442175'
metageneration: 1
name: dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run
size: 4446722669
storage_class: STANDARD
storage_class_update_time: 2025-04-22T07:54:44+0000
storage_url: gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run#1745308484442175
update_time: 2025-04-22T07:54:44+0000
Copying gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/cuda_12.6.3_560.35.05_linux.run to file:///mnt/shm/cuda.run
  
.................

Average throughput: 1.3GiB/s

real	0m4.643s
user	0m16.077s
sys	0m20.338s

real	2m39.479s
user	2m19.921s
sys	0m48.780s
Selecting previously unselected package cuda-keyring.
(Reading database ... 166259 files and directories currently installed.)
Preparing to unpack /mnt/shm/cuda-keyring.deb ...
Unpacking cuda-keyring (1.1-1) ...
Setting up cuda-keyring (1.1-1) ...
unable to rmmod nvidia_uvm
unable to rmmod nvidia_drm
unable to rmmod nvidia_modeset
unable to rmmod nvidia
/opt/install-dpgce /
ERROR: (gcloud.storage.objects.describe) gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz not found: 404.
Copying file:///opt/install-dpgce/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building to gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages/nvidia/nccl/debian12/nccl-build_debian12_2.23.4-1+cuda12.6.tar.gz.building
  

/opt/install-dpgce/nccl /opt/install-dpgce /

real	0m57.433s
user	0m14.465s
sys	0m5.950s

It cannot find gs://dataproc-temp-europe-west1-****************-jupf0b8s/dpgce-packages%2Fnvidia%2Fnccl%2Fdebian12%2Fnccl-build_debian12_2.23.4-1%2Bcuda12.6.tar.gz file however that file exists in the storage.

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions