feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB) by howard989 · Pull Request #17 · rlops/rlix

howard989 · 2026-05-25T08:38:13Z

What

Replace the hardcoded free-memory gate in MilesPipeline._wait_for_overlap_engines_offloaded() with a configurable residual gate delegated to MILES shrink_engines.

The threshold is controlled by:

MILES_MAX_RESIDUAL_GPU_MEM_GB

Default: 3.0 GiB

Sender side: rlops/miles branch howard/m11-forward-residual-gpu-env-v2.

Why

Per @taoluo review (R02-01): "free memory is gpu-model dependent e.g. 24gb vs 80gb gpu. it would be more robust to check the residual memory allocation."

The base used target_free_gb = 20.0 against nvidia-smi --query-gpu=memory.free, which is GPU-capacity dependent and not portable. The condition we need before wake_up is not "at least N GB free"; it is "the previous tenant released its GPU memory", i.e. residual allocation.

We iterated through a few signals:

memory.free: GPU-capacity dependent.
whole-GPU memory.used: includes train actor / neighbor pipeline / unrelated CUDA context.
SGLang /server_info weight+kvcache+graph: accounting/static-pool size, not resident memory after torch_memory_saver pause.

On Vast, a slept SGLang engine reported ~9.32 GiB in /server_info accounting but only ~1.81 GiB real process-resident memory in nvidia-smi. Therefore the paired MILES PR now gates on each engine's per-process resident GPU memory inside shrink_engines.

What This PR Does

rlix/utils/env.py
- adds parse_env_positive_float
rlix/pipeline/miles_coordinator.py
- forwards MILES_MAX_RESIDUAL_GPU_MEM_GB into per-pipeline runtime env
- parses it with default 3.0
- passes it to shrink_engines(post_sleep_vram_threshold_gb=...)
rlix/pipeline/miles_pipeline.py
- removes the old target_free_gb = 20.0 free-memory hard gate
- keeps state == offloaded polling as the liveness gate
- keeps raw nvidia-smi --query-gpu=memory.used as diagnostic logging only

Default 3.0 Rationale

Vast smoke with Qwen2.5-0.5B on RTX 5090 / CUDA 12.9 measured 1.81-1.83 GiB per-engine resident memory after offload. This is mostly non-offloadable CUDA/runtime baseline, not model memory.

2.0 leaves only ~0.17 GiB margin, which is too tight across GPU/driver/SGLang versions. 3.0 leaves ~1.2 GiB margin while still catching large residuals such as an unoffloaded KV pool.

This is a smoke-measured heuristic, not a model-derived value. It remains overridable via MILES_MAX_RESIDUAL_GPU_MEM_GB.

Diff Baseline Note

This is a clean branch off latest zhenyu/miles-mvp-e2e. The closed #11 used 10.0 as an intermediate whole-GPU residual threshold. This PR's effective change is 20.0 free memory -> 3.0 per-engine process-resident residual.

Tests

python -m pytest -q tests/test_env_utils.py tests/test_miles_residual_threshold_wiring.py

Result:

6 passed in 0.05s

E2E Verification

Vast dual smoke with paired MILES branch:

post-sleep process-resident GPU residual max=1.809 / 1.828 GiB
threshold=3.000 GiB
mp1 / mp2 training loop complete
shutdown_hard complete for both pipelines
EXIT_CODE=0

Known shutdown RolloutManager 500 / RemoteProtocolError teardown noise appears while residual /generate requests are cancelled. Training completed and both pipelines reached shutdown_hard; EXIT_CODE=0.

Scope

Gate signal + configurability only. No model-size-derived threshold. Option Beta / hooks are already upstream and untouched.

Refs: plans/m11-review.review-report/R02.md (R02-01, MEDIUM).

howard989 added 2 commits May 25, 2026 00:12

fix(rlix): gate MILES wake on residual SGLang memory

3407747

fix(rlix): use 3GB per-process residual threshold default

ac9312d

howard989 mentioned this pull request May 25, 2026

fix(rlix): gate MILES wake on per-process residual GPU memory (R02-01) rlops/miles#5

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB)#17

feat(miles): per-engine process-resident GPU residual gate + forward MILES_MAX_RESIDUAL_GPU_MEM_GB)#17
howard989 wants to merge 2 commits into
rlops:zhenyu/miles-mvp-e2efrom
howard989:howard/m11-residual-gpu-threshold-v2

howard989 commented May 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

howard989 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

What This PR Does

Default 3.0 Rationale

Diff Baseline Note

Tests

E2E Verification

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

howard989 commented May 25, 2026 •

edited

Loading