Skip to content

config_file_lock() fails with OSError: [Errno 37] No locks available on NFS-backed HF cache #11958

@sara4dev

Description

@sara4dev

Description

ModelConfig.from_pretrained() uses config_file_lock() which acquires a filelock.FileLock on $HF_MODULES_CACHE/_remote_code.lock. When HF_HOME points to an NFS mount (common in shared model cache setups on Kubernetes), the lock fails intermittently with OSError: [Errno 37] No locks available (ENOLCK) and OSError: [Errno 116] Stale file handle (ESTALE) due to NFSv3's unreliable Network Lock Manager (NLM).

Steps to Reproduce

  1. Mount a shared NFS volume (e.g. at /model-cache)
  2. Set HF_HOME=/model-cache
  3. Deploy multi-node TRT-LLM with trust_remote_code=True
  4. Multiple MPI ranks on different nodes simultaneously call ModelConfig.from_pretrained()

Error

[TRT-LLM] [RANK 2] [E] Failed to initialize executor on rank 2: [Errno 37] No locks available
    with config_file_lock():
  File "tensorrt_llm/_torch/model_config.py", line 49, in config_file_lock
    with lock:
  File "filelock/_api.py", line 567, in __exit__
  File "filelock/_api.py", line 538, in release
  File "filelock/_unix.py", line 102, in _release
    fcntl.flock(fd, fcntl.LOCK_UN)
OSError: [Errno 37] No locks available

Root Cause

config_file_lock() catches PermissionError and filelock.Timeout to fall back to a tempdir lock, but does not catch OSError. On NFSv3, fcntl.flock() is emulated via the NLM protocol which is unreliable for cross-node locking — both lock acquisition (ESTALE) and release (ENOLCK) fail intermittently.

I tested with 10 pods across 10 different nodes:

Test Error Rate
filelock on NFS 100% (50/50 per pod)
filelock on local /tmp 0% (0/50 per pod)
Plain file I/O on NFS (no locking) 0% (0/50 per pod)

NFS file reads/writes work perfectly — the issue is exclusively with filelock/NLM.

Suggested Fix

Add OSError to the exception handler in config_file_lock():

# Current (broken on NFS):
except (PermissionError, filelock.Timeout):

# Fixed:
except (PermissionError, OSError, filelock.Timeout):

This allows the existing fallback to /tmp to handle NFS lock failures gracefully.

Workaround

Set HF_MODULES_CACHE to a local (non-NFS) path:

env:
  - name: HF_MODULES_CACHE
    value: /tmp/hf_modules

This moves only the lock file and the small remote code cache (~56 KB) to local disk. Model weights remain on the shared NFS mount. The remote code files are already cached alongside the model weights in $HF_HUB_CACHE — the modules cache is just a copy used for Python imports.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Infra<NV>automated tests, build checks, github actions, system stability & efficiency.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions