Reduce JsonlDataset memory by using mmap array and use npy to store jsonl meta#1604
Reduce JsonlDataset memory by using mmap array and use npy to store jsonl meta#1604jayhenry wants to merge 6 commits intoInternLM:mainfrom
Conversation
|
@claude review |
xtuner/v1/datasets/jsonl.py
Outdated
| fpath = os.path.join(dir_path, fname) | ||
| arr = np.load(fpath, mmap_mode="r" if mmap else None, allow_pickle=True) | ||
| if arr.dtype == object: | ||
| arr = np.load(fpath, allow_pickle=True) |
There was a problem hiding this comment.
Claude: Warning: allow_pickle=True on npy files from disk
np.load(..., allow_pickle=True) can execute arbitrary code during deserialization of object-dtype arrays. While these files are typically written by the same process or a trusted pipeline, if the cache directory is on a shared filesystem, a malicious actor could place a crafted .npy file there.
Consider: (1) only using allow_pickle=True when arr.dtype == object is confirmed (which you do on the retry), and removing it from the first np.load call, or (2) documenting that cache directories must be trusted.
|
Claude: ## Summary This PR reduces JsonlDataset memory by switching IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
… np.array for sampled indices
- Introduced `enable_mmap_shared` parameter in `DatasetConfig` and `JsonlDataset` to allow shared memory mapping for improved memory efficiency. - Updated smoke tests to validate the impact of shared memory on PSS and RSS metrics. - Enhanced dataset building process to utilize mmap for loading data, reducing memory footprint during distributed training. - Refactored test cases to include new functionality and ensure correctness of data loading and memory usage.
…ndly loading - Replace save_dict_to_npz/load_dict_from_npz with save_dict_to_npy_dir/load_dict_from_npy_dir (each _meta key stored as a separate .npy file under jsonl_meta/ dir) - load_dict_from_npy_dir uses mmap_mode='r' by default, enabling OS page sharing - mmap shared-memory branch now uses save/load_dict_to/from_npy_dir for self._meta (stored under tmp_dir/meta/ subdir to separate from offsets/num_tokens) - Refactor _set_meta_attrs as shared method called by both mmap and non-mmap paths - Add tests: npy-dir round-trip, cache_dir consistency, cache_tag consistency Made-with: Cursor
|
@claude review |
| np.save(os.path.join(tmp_dir, "offsets.npy"), self.offsets) | ||
| if self.num_tokens is not None: |
There was a problem hiding this comment.
Claude: Warning — atexit race condition with mmap files.
atexit.register(shutil.rmtree, tmp_dir, True) is only registered by local rank 0. But all ranks mmap from this directory. If rank 0's process exits and its atexit handler fires before other ranks finish, the mmap backing files are deleted.
On Linux, open mmap file descriptors survive unlink, so this usually works — but it's fragile. If any rank hasn't opened the files yet when cleanup runs, it will fail. Consider:
- Using a
dist.barrier()before cleanup, or - Having each rank register its own cleanup, or
- Not auto-cleaning
/tmp(it's a tmpdir, the OS handles it).
| import time | ||
| import random | ||
|
|
||
| import tracemalloc |
There was a problem hiding this comment.
Claude: Nit — import parametrize uses a third-party package.
This imports the parametrize package (pypi: parametrize) instead of using the standard pytest.mark.parametrize. The project already depends on pytest per CLAUDE.md guidelines.
| import tracemalloc | |
| import pytest |
And change @parametrize.parametrize(...) to @pytest.mark.parametrize(...) accordingly.
| result = {} | ||
| for fname in os.listdir(dir_path): | ||
| if not fname.endswith(".npy"): | ||
| continue |
There was a problem hiding this comment.
Claude: Nit — save_dict_to_npy_dir silently accepts non-ndarray values.
The old save_dict_to_npz had explicit type checks (value must be np.ndarray, key must be str). The new save_dict_to_npy_dir passes anything to np.save without validation, which could silently produce pickle-based .npy files or unexpected behavior.
|
Claude: ## Summary This PR reduces IssuesCritical
Warning
Nit
VerdictREQUEST_CHANGES |
No description provided.