Reduce JsonlDataset memory by using mmap array and use npy to store jsonl meta by jayhenry · Pull Request #1604 · InternLM/xtuner

jayhenry · 2026-03-19T08:30:26Z

No description provided.

jayhenry · 2026-03-19T08:30:44Z

xtuner/v1/datasets/jsonl.py

claude · 2026-03-19T08:34:07Z

xtuner/v1/datasets/jsonl.py

+        fpath = os.path.join(dir_path, fname)
+        arr = np.load(fpath, mmap_mode="r" if mmap else None, allow_pickle=True)
+        if arr.dtype == object:
+            arr = np.load(fpath, allow_pickle=True)


Claude: Warning: allow_pickle=True on npy files from disk

np.load(..., allow_pickle=True) can execute arbitrary code during deserialization of object-dtype arrays. While these files are typically written by the same process or a trusted pipeline, if the cache directory is on a shared filesystem, a malicious actor could place a crafted .npy file there.

Consider: (1) only using allow_pickle=True when arr.dtype == object is confirmed (which you do on the retry), and removing it from the first np.load call, or (2) documenting that cache directories must be trusted.

xtuner/v1/datasets/jsonl.py

tests/datasets/test_jsonl_dataset.py

xtuner/v1/datasets/jsonl.py

xtuner/v1/datasets/custom_pack.py

claude · 2026-03-19T08:35:42Z

Claude: ## Summary

This PR reduces JsonlDataset memory by switching self.sampled from a Python list to numpy arrays (removing it as an instance attribute), replacing NPZ cache with per-key NPY directories for mmap-friendly loading, and adding cross-rank mmap sharing via /tmp npy files.

Issues

Critical

jsonl.py:82: Lost _has_chunk guard on max_length filtering. The original code explicitly skips max_length filtering when LongTextPretrainTokenizeFunction is used (not self._has_chunk). The new _filter_sampled_indices function always applies max_length filtering, which will incorrectly discard valid long-text samples that have been split into chunks.

Warning

custom_pack.py:209-211: Pack validation significantly weakened — sample_idx range checks and token range validation removed. token_end=0 sentinel resolution also removed.
jsonl.py:73: allow_pickle=True on first np.load call — security concern if cache dir is on shared filesystem.
jsonl.py:466: MD5-based tmp_dir can collide across jobs when tokenize_fn is not CachableTokenizeFunction.
test_jsonl_dataset.py:5: import parametrize — unusual third-party package instead of pytest.mark.parametrize.

Nit

jsonl.py:220: Commented-out code # return dist.local_rank() == 0 should be removed.

Verdict

REQUEST_CHANGES

… np.array for sampled indices

- Introduced `enable_mmap_shared` parameter in `DatasetConfig` and `JsonlDataset` to allow shared memory mapping for improved memory efficiency. - Updated smoke tests to validate the impact of shared memory on PSS and RSS metrics. - Enhanced dataset building process to utilize mmap for loading data, reducing memory footprint during distributed training. - Refactored test cases to include new functionality and ensure correctness of data loading and memory usage.

…ndly loading - Replace save_dict_to_npz/load_dict_from_npz with save_dict_to_npy_dir/load_dict_from_npy_dir (each _meta key stored as a separate .npy file under jsonl_meta/ dir) - load_dict_from_npy_dir uses mmap_mode='r' by default, enabling OS page sharing - mmap shared-memory branch now uses save/load_dict_to/from_npy_dir for self._meta (stored under tmp_dir/meta/ subdir to separate from offsets/num_tokens) - Refactor _set_meta_attrs as shared method called by both mmap and non-mmap paths - Add tests: npy-dir round-trip, cache_dir consistency, cache_tag consistency Made-with: Cursor

jayhenry · 2026-03-19T12:31:52Z

@claude review

xtuner/v1/datasets/jsonl.py

xtuner/v1/datasets/custom_pack.py

xtuner/v1/datasets/config.py

claude · 2026-03-19T12:36:16Z

xtuner/v1/datasets/jsonl.py

+                np.save(os.path.join(tmp_dir, "offsets.npy"), self.offsets)
+                if self.num_tokens is not None:


Claude: Warning — atexit race condition with mmap files.

atexit.register(shutil.rmtree, tmp_dir, True) is only registered by local rank 0. But all ranks mmap from this directory. If rank 0's process exits and its atexit handler fires before other ranks finish, the mmap backing files are deleted.

On Linux, open mmap file descriptors survive unlink, so this usually works — but it's fragile. If any rank hasn't opened the files yet when cleanup runs, it will fail. Consider:

Using a dist.barrier() before cleanup, or

Having each rank register its own cleanup, or

Not auto-cleaning /tmp (it's a tmpdir, the OS handles it).

claude · 2026-03-19T12:36:19Z

tests/datasets/test_jsonl_dataset.py

+import time
+import random
+
+import tracemalloc


Claude: Nit — import parametrize uses a third-party package.

This imports the parametrize package (pypi: parametrize) instead of using the standard pytest.mark.parametrize. The project already depends on pytest per CLAUDE.md guidelines.

Suggested change

import tracemalloc

import pytest

And change @parametrize.parametrize(...) to @pytest.mark.parametrize(...) accordingly.

claude · 2026-03-19T12:36:22Z

xtuner/v1/datasets/jsonl.py

+    result = {}
+    for fname in os.listdir(dir_path):
+        if not fname.endswith(".npy"):
+            continue


Claude: Nit — save_dict_to_npy_dir silently accepts non-ndarray values.

The old save_dict_to_npz had explicit type checks (value must be np.ndarray, key must be str). The new save_dict_to_npy_dir passes anything to np.save without validation, which could silently produce pickle-based .npy files or unexpected behavior.

claude · 2026-03-19T12:36:39Z

Claude: ## Summary

This PR reduces JsonlDataset memory by switching self.sampled from a Python list to numpy arrays, replacing NPZ cache with per-key NPY directories for mmap-friendly loading, and adding cross-rank mmap sharing via /tmp npy files. It also removes several validation checks from CustomPackDataset.

Issues

Critical

custom_pack.py:208-211: token_end=0 sentinel resolution removed. The module docstring still documents token_end == 0 as "take to end of sample", but the resolution code is gone. __getitem__ will slice input_ids[t_start:0] → empty list, silently producing zero-length sequences.

Warning

jsonl.py:82: _has_chunk guard removed from max_length filtering, violating documented architecture contract.
config.py:51: enable_mmap_shared defaults to True — behavior change for all existing users on upgrade.
jsonl.py:482-483: atexit cleanup race — rank 0 may delete mmap backing files before other ranks open them.

Nit

test_jsonl_dataset.py:7: import parametrize — third-party package; prefer pytest.mark.parametrize.
jsonl.py:68: save_dict_to_npy_dir lost input type validation from old save_dict_to_npz.

Verdict

REQUEST_CHANGES

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/jsonl.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/jsonl.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

tests/datasets/test_jsonl_dataset.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/jsonl.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/custom_pack.py Show resolved Hide resolved

jayhenry added 4 commits March 19, 2026 09:33

add jsonl smoke test

f9a2dc9

Implement filtering and sampling functions for JSONL dataset, and use…

48691dd

… np.array for sampled indices

jayhenry force-pushed the json_npy branch from cc47297 to fc5abe6 Compare March 19, 2026 09:33

jayhenry added 2 commits March 19, 2026 10:04

fix review

34df306

load npy only support non-object array

b5dea5c

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/jsonl.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/custom_pack.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

xtuner/v1/datasets/config.py Show resolved Hide resolved

claude bot reviewed Mar 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce JsonlDataset memory by using mmap array and use npy to store jsonl meta#1604

Reduce JsonlDataset memory by using mmap array and use npy to store jsonl meta#1604
jayhenry wants to merge 6 commits intoInternLM:mainfrom
jayhenry:json_npy

jayhenry commented Mar 19, 2026

Uh oh!

jayhenry commented Mar 19, 2026

Uh oh!

Uh oh!

claude bot Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 19, 2026

Uh oh!

jayhenry commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 19, 2026

Uh oh!

claude bot Mar 19, 2026

Uh oh!

claude bot Mar 19, 2026

Uh oh!

claude bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		np.save(os.path.join(tmp_dir, "offsets.npy"), self.offsets)
		if self.num_tokens is not None:

Conversation

jayhenry commented Mar 19, 2026

Uh oh!

jayhenry commented Mar 19, 2026

Uh oh!

Uh oh!

claude bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Mar 19, 2026

Issues

Critical

Warning

Nit

Verdict

Uh oh!

jayhenry commented Mar 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot commented Mar 19, 2026

Issues

Critical

Warning

Nit

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant