Description
ModelConfig.from_pretrained() uses config_file_lock() which acquires a filelock.FileLock on $HF_MODULES_CACHE/_remote_code.lock. When HF_HOME points to an NFS mount (common in shared model cache setups on Kubernetes), the lock fails intermittently with OSError: [Errno 37] No locks available (ENOLCK) and OSError: [Errno 116] Stale file handle (ESTALE) due to NFSv3's unreliable Network Lock Manager (NLM).
Steps to Reproduce
- Mount a shared NFS volume (e.g. at
/model-cache)
- Set
HF_HOME=/model-cache
- Deploy multi-node TRT-LLM with
trust_remote_code=True
- Multiple MPI ranks on different nodes simultaneously call
ModelConfig.from_pretrained()
Error
[TRT-LLM] [RANK 2] [E] Failed to initialize executor on rank 2: [Errno 37] No locks available
with config_file_lock():
File "tensorrt_llm/_torch/model_config.py", line 49, in config_file_lock
with lock:
File "filelock/_api.py", line 567, in __exit__
File "filelock/_api.py", line 538, in release
File "filelock/_unix.py", line 102, in _release
fcntl.flock(fd, fcntl.LOCK_UN)
OSError: [Errno 37] No locks available
Root Cause
config_file_lock() catches PermissionError and filelock.Timeout to fall back to a tempdir lock, but does not catch OSError. On NFSv3, fcntl.flock() is emulated via the NLM protocol which is unreliable for cross-node locking — both lock acquisition (ESTALE) and release (ENOLCK) fail intermittently.
I tested with 10 pods across 10 different nodes:
| Test |
Error Rate |
filelock on NFS |
100% (50/50 per pod) |
filelock on local /tmp |
0% (0/50 per pod) |
| Plain file I/O on NFS (no locking) |
0% (0/50 per pod) |
NFS file reads/writes work perfectly — the issue is exclusively with filelock/NLM.
Suggested Fix
Add OSError to the exception handler in config_file_lock():
# Current (broken on NFS):
except (PermissionError, filelock.Timeout):
# Fixed:
except (PermissionError, OSError, filelock.Timeout):
This allows the existing fallback to /tmp to handle NFS lock failures gracefully.
Workaround
Set HF_MODULES_CACHE to a local (non-NFS) path:
env:
- name: HF_MODULES_CACHE
value: /tmp/hf_modules
This moves only the lock file and the small remote code cache (~56 KB) to local disk. Model weights remain on the shared NFS mount. The remote code files are already cached alongside the model weights in $HF_HUB_CACHE — the modules cache is just a copy used for Python imports.
Description
ModelConfig.from_pretrained()usesconfig_file_lock()which acquires afilelock.FileLockon$HF_MODULES_CACHE/_remote_code.lock. WhenHF_HOMEpoints to an NFS mount (common in shared model cache setups on Kubernetes), the lock fails intermittently withOSError: [Errno 37] No locks available(ENOLCK) andOSError: [Errno 116] Stale file handle(ESTALE) due to NFSv3's unreliable Network Lock Manager (NLM).Steps to Reproduce
/model-cache)HF_HOME=/model-cachetrust_remote_code=TrueModelConfig.from_pretrained()Error
Root Cause
config_file_lock()catchesPermissionErrorandfilelock.Timeoutto fall back to a tempdir lock, but does not catchOSError. On NFSv3,fcntl.flock()is emulated via the NLM protocol which is unreliable for cross-node locking — both lock acquisition (ESTALE) and release (ENOLCK) fail intermittently.I tested with 10 pods across 10 different nodes:
filelockon NFSfilelockon local/tmpNFS file reads/writes work perfectly — the issue is exclusively with
filelock/NLM.Suggested Fix
Add
OSErrorto the exception handler inconfig_file_lock():This allows the existing fallback to
/tmpto handle NFS lock failures gracefully.Workaround
Set
HF_MODULES_CACHEto a local (non-NFS) path:This moves only the lock file and the small remote code cache (~56 KB) to local disk. Model weights remain on the shared NFS mount. The remote code files are already cached alongside the model weights in
$HF_HUB_CACHE— the modules cache is just a copy used for Python imports.