`config_file_lock()` fails with `OSError: [Errno 37] No locks available` on NFS-backed HF cache

## Description

`ModelConfig.from_pretrained()` uses `config_file_lock()` which acquires a `filelock.FileLock` on `$HF_MODULES_CACHE/_remote_code.lock`. When `HF_HOME` points to an NFS mount (common in shared model cache setups on Kubernetes), the lock fails intermittently with `OSError: [Errno 37] No locks available` (ENOLCK) and `OSError: [Errno 116] Stale file handle` (ESTALE) due to NFSv3's unreliable Network Lock Manager (NLM).

## Steps to Reproduce

1. Mount a shared NFS volume (e.g. at `/model-cache`)
2. Set `HF_HOME=/model-cache`
3. Deploy multi-node TRT-LLM with `trust_remote_code=True`
4. Multiple MPI ranks on different nodes simultaneously call `ModelConfig.from_pretrained()`

## Error

```
[TRT-LLM] [RANK 2] [E] Failed to initialize executor on rank 2: [Errno 37] No locks available
    with config_file_lock():
  File "tensorrt_llm/_torch/model_config.py", line 49, in config_file_lock
    with lock:
  File "filelock/_api.py", line 567, in __exit__
  File "filelock/_api.py", line 538, in release
  File "filelock/_unix.py", line 102, in _release
    fcntl.flock(fd, fcntl.LOCK_UN)
OSError: [Errno 37] No locks available
```

## Root Cause

`config_file_lock()` catches `PermissionError` and `filelock.Timeout` to fall back to a tempdir lock, but does **not** catch `OSError`. On NFSv3, `fcntl.flock()` is emulated via the NLM protocol which is unreliable for cross-node locking — both lock acquisition (`ESTALE`) and release (`ENOLCK`) fail intermittently.

I tested with 10 pods across 10 different nodes:

| Test | Error Rate |
|------|-----------|
| `filelock` on NFS | **100%** (50/50 per pod) |
| `filelock` on local `/tmp` | **0%** (0/50 per pod) |
| Plain file I/O on NFS (no locking) | **0%** (0/50 per pod) |

NFS file reads/writes work perfectly — the issue is exclusively with `filelock`/NLM.

## Suggested Fix

Add `OSError` to the exception handler in `config_file_lock()`:

```python
# Current (broken on NFS):
except (PermissionError, filelock.Timeout):

# Fixed:
except (PermissionError, OSError, filelock.Timeout):
```

This allows the existing fallback to `/tmp` to handle NFS lock failures gracefully.

## Workaround

Set `HF_MODULES_CACHE` to a local (non-NFS) path:

```yaml
env:
  - name: HF_MODULES_CACHE
    value: /tmp/hf_modules
```

This moves only the lock file and the small remote code cache (~56 KB) to local disk. Model weights remain on the shared NFS mount. The remote code files are already cached alongside the model weights in `$HF_HUB_CACHE` — the modules cache is just a copy used for Python imports.

Test	Error Rate
`filelock` on NFS	100% (50/50 per pod)
`filelock` on local `/tmp`	0% (0/50 per pod)
Plain file I/O on NFS (no locking)	0% (0/50 per pod)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`config_file_lock()` fails with `OSError: [Errno 37] No locks available` on NFS-backed HF cache #11958

Description

Steps to Reproduce

Error

Root Cause

Suggested Fix

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

config_file_lock() fails with OSError: [Errno 37] No locks available on NFS-backed HF cache #11958

Description

Description

Steps to Reproduce

Error

Root Cause

Suggested Fix

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`config_file_lock()` fails with `OSError: [Errno 37] No locks available` on NFS-backed HF cache #11958