test(vllm): add EFA test for vLLM Ubuntu (NCCL + NIXL)#6113
Open
Yadan-Wei wants to merge 31 commits into
Open
Conversation
Wires the existing EFA test harness into pr-vllm-ec2.yml. Same 2x p4d fixture covers three checks against EFA: - NCCL collectives over EFA (existing test, now image-agnostic). - NIXL libfabric plugin packaging smoke (new, single-process). - NIXL disaggregated prefill/decode across nodes with LIBFABRIC backend (new, two vLLM servers + minimal proxy, gated on RUN_NIXL_TESTS=1). setup_nccl_tests.sh is a no-op on PyTorch DLCs (binary preinstalled) and compiles nccl-tests against the nvidia-nccl-cu12 wheel on vLLM Ubuntu. nccl_allreduce.sh now defaults CUDA_HOME=/usr/local/cuda for images that do not export it. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Without a dedicated change filter, build-change stayed false on EFA-only edits (test/efa/**, reusable-efa-tests.yml), build-image was skipped, and efa-test never fired. Mirrors the sanity-test/telemetry-test/model-tests pattern: fall back to the prod image when no rebuild happened. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
verifiable.cu compiled for nccl-tests' default 9 archs peaks at 8+GB and got SIGKILL'd on the test host (exit 137). Detect the compute capability via nvidia-smi and build only for that one arch; serialize make to keep memory pressure flat. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
nccl-tests' verifiable.cu compiled with nvcc OOM-kills the build on the test host even with single-arch and -j1 (>8GB peak per file). It's also unconditionally linked into every _perf binary, so we can't strip it. Drop the build entirely. Use torch.distributed all_reduce when the preinstalled binary isn't present (vLLM image): same NCCL→aws-ofi-nccl→ EFA path, no compile step. nccl_allreduce.sh picks the implementation at runtime; existing log validators (aws-ofi-nccl, "Selected provider is efa", Libfabric, GDRDMA) work for both since they parse NCCL_DEBUG=INFO output. Bandwidth threshold (3 GB/s) is now read from a JSON line that torch_allreduce.py emits on rank 0. PyTorch DLC path unchanged: setup_nccl_tests.sh short-circuits, the existing all_reduce_perf binary is used, the existing perf check parses the same column 11 it always did. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
master's V1 EFA test (test/dlc_tests/container_tests/bin/efa/build_all_reduce_perf.sh) builds nccl-tests inside the test container with NCCL_HOME=/usr/local and default NVCC_GENCODE — and works on p4d. Earlier OOMs were from a different NCCL_HOME path (the python wheel) which can produce different template instantiation against alternate headers. Match master: - NCCL_HOME=/usr/local first; fall back to the nvidia-nccl-cu12 wheel only when /usr/local/include/nccl.h is absent (vLLM image case). - No NVCC_GENCODE override — let nccl-tests use its defaults. - Drop the torch.distributed fallback and revert nccl_allreduce.sh to the original single-binary path. Removes torch_allreduce.py. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Last run failed at \`python3 -c \"import nvidia.nccl\"\` because import succeeded but \`nvidia.nccl.__file__\` was None — namespace package behavior differs across cu12 and cu13 wheel builds. Replace the import probe with a path-walk over Python's site-packages / dist-packages dirs, looking for the canonical nvidia/nccl/include/nccl.h layout that both cu12 and cu13 wheels use. Add diagnostic output (sys.path + find) so the next failure (if any) surfaces actual evidence instead of an empty WHEEL_DIR. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
verifiable.cu compiled against nvidia-nccl-cu13 wheel headers OOMs nvcc (SIGKILL at exit 137) because the wheel ships heavier-templated headers from a different NCCL build than master uses. master's V1 PyTorch DLC builds NCCL from source into /usr/local, producing the thin headers nccl-tests expects. Cheapest reliable fix: apt-install libnccl-dev at test time. Same NCCL version the image runs against (Ubuntu repo tracks the upstream NCCL release), thin headers, ~50MB transient. Use NCCL_HOME=/usr which is where apt puts it. The container is torn down after the test so this doesn't persist into the published image. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Previous run was killed at ~787s with exit 137. DEFAULT_TIMEOUT (600s) applied to setup_nccl_tests.sh wasn't enough — verifiable.cu is template-heavy and the build legitimately takes ~13 min on a p4d. Earlier hypothesis was nvcc OOM, but the log shows nccl.h was already at /usr/include before the apt install (libnccl-dev was already present from the image's EFA install path), so the apt-headers-fix attempt didn't change the compile inputs at all. The kill was a fabric/invoke timeout sending SIGKILL to docker exec, surfacing as 137. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Latest CI run failed with exit 137 from docker exec — UnexpectedExit from Fabric, not CommandTimedOut, so the 1500s timeout wasn't hit. The container itself was SIGKILL'd. p4d has 1.1TB RAM and devbox repro peaked at 150MB, so plain OOM is unlikely; remaining hypotheses are container cgroup limit, nvidia-runtime hook, or disk exhaustion. Two changes: - Build for the host's actual SM arch only (typically compute_80 on p4d) with -j1. Default GENCODE targets ~9 archs and at peak runs multiple nvcc forks; even with each fork small, the aggregate may trip an unknown limit. - Print free/df/nproc/cgroup memory.max BEFORE the build, and a trap EXIT that dumps the same plus memory.peak and partial build artifacts AFTER any failure (including SIGKILL — the trap runs on signal exits). Next failure (if any) will have actionable evidence in the captured log: which file was being compiled, how much memory the cgroup allowed, how full /tmp was when killed. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
PyTorch's efa-test was gated on a fresh image build via 'if: success()' of build-images + sanity + security + unit-test. So edits to test/efa/** or reusable-efa-tests.yml that don't touch the Dockerfile would skip PyTorch's EFA validation entirely — even though those test files are shared with vLLM. Mirror the vllm workflow: a new efa-test-change paths filter, and the efa-test job now fires on (build OR efa-test-change), falling back to the prod image when no rebuild happened. This way a PR like the current one (which only edits test/efa/scripts/ setup_nccl_tests.sh) actually validates that change against both PyTorch and vLLM DLCs. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
vLLM image's ENTRYPOINT is dockerd_entrypoint.sh which invokes 'python3 -m vllm.entrypoints.openai.api_server "$@"'. Running \`docker run -id <vllm-image> bash\` becomes \`api_server bash\`, vllm treats 'bash' as a model_tag and tries to download it from HF. Eventually it crashes and the container exits — by the time the slower side of the test (worker) runs setup_nccl_tests.sh, the container is gone and dockerd returns "container is not running" with exit 1. Fix: pass --entrypoint /bin/bash and 'sleep infinity' so the container stays alive regardless of which framework-specific entrypoint the image ships. PyTorch DLC's entrypoint is also bypassed, but that's harmless since the test only uses docker exec. This was hidden behind the prior failure mode (137 from setup_nccl_tests hitting the master container before its api_server crashed). Worker containers had time to fully crash by the time the loop reached them. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The reusable efa-test workflow uses 'efa-test-global / cancel-in-progress: false' to serialize p4d capacity. But GitHub allows only one *pending* run per group: when N>=2 callers (PyTorch + vLLM) target the same group on a new commit, the second pending one displaces the first with "Canceling since a higher priority waiting request for efa-test-global exists". Same effect on every push: existing pending efa-test gets cancelled by the next caller pushing into queue. Add a per-caller concurrency group keyed by workflow + PR number with cancel-in-progress: true on the calling efa-test job. Now: - Same PR re-pushed → previous efa-test of that workflow is cancelled (which is what we want — old commit is obsolete). - Different workflow (PyTorch vs vLLM) → different group → both queue on efa-test-global without displacing each other. - Different PR → different group → same. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Last commit overrode entrypoint to /bin/bash for both image families.
That fixed vLLM (whose dockerd_entrypoint.sh execs `vllm serve "$@"`
and crashes on `bash`). But it broke PyTorch: its entrypoint sets
LD_LIBRARY_PATH=/usr/local/cuda/compat:... when the host's nvidia
driver is older than the cuda-compat version. Without that env var,
NCCL fails to initialize on p4d hosts ("aws-ofi-nccl is not working").
Detect vLLM by image URI substring; only override the entrypoint
there. PyTorch keeps its original `entrypoint.sh bash` invocation.
Both paths still produce a long-lived container that docker exec can
land on.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Adding efa-test-change to the PyTorch workflow caused efa-test to run against the prod image on this PR (which only edits test/efa/**). The prod pytorch:2.11-cu130-amzn2023 image fails the EFA test with "aws-ofi-nccl is not working" — a preexisting issue unrelated to this PR's changes. Surfacing it here red-blocks merging. Revert to the original pattern: PyTorch efa-test only runs when the image was actually rebuilt by this PR. The prod image is presumed validated at its release. The vLLM workflow keeps the efa-test-change trigger because we're still iterating on the test against the vLLM prod image. Keep the per-caller concurrency block — it's still needed to prevent multiple workflows from cross-cancelling on efa-test-global. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Test failures show only 2 lines from mpirun (Test failure common.cu:1218 + "Process exited with code 2"), no NCCL_DEBUG output despite -x NCCL_DEBUG=INFO. Need actual evidence of: - whether libnccl is on ldconfig path - whether all_reduce_perf links against the expected libnccl - whether libfabric finds the EFA provider - whether aws-ofi-nccl plugin .so is in place Also cat the captured TRAINING_LOG file at the end since the master container is torn down on test exit and the file is otherwise lost. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…'t drop them Previous diagnostic dump (nvidia-smi, ldd, fi_info, etc.) was printed inline to stdout before mpirun, but Fabric/invoke's UnexpectedExit only keeps the last few KB of stdout in the failure trace — those lines got truncated from the captured output. Write diagnostics to /test/efa/logs/diagnostics.log first, then cat that file at the very end (right before validators), so the diagnostic content lands in the tail of stdout that survives truncation. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
ca89ff0 wrote diagnostics to a file and cat'd them, but Fabric still truncated to the last ~3KB of stdout — and the captured tail showed only the testEFA.log content + "aws-ofi-nccl is not working", with the diagnostics block dropped from the head of the truncation window. Reorder so diagnostics + final probes are the LAST output before the validators run. The probes are minimal: ldd output, libnccl paths, aws-ofi-nccl .so paths. Those three are enough to diagnose the runtime linker's view. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The current SSM-resolved latest AL2023 base-with-single-cuda AMI ships NVIDIA driver 580.150, but no matching nvidia-fabricmanager package exists in NVIDIA's cuda-rhel9 repo (only 580.65/580.95/580.159). On NVSwitch systems (p4d.24xlarge) cuInit returns CUDA_ERROR_SYSTEM_NOT_ YET_INITIALIZED (802) without a matching FM running, which makes the nccl-tests harness fail before NCCL even initializes — surfaces as the misleading "aws-ofi-nccl is not working" validator message. Pin to ami-0d2923a2dd541bdeb (us-west-2, 2026-05-01 build, driver 580.126.09 + matching FM preinstalled). Verified locally on 2026-05-20: cuInit succeeds, NCCL initializes, EFA provider selected and GDRDMA channels established between two p4d instances. Drop the pin once AWS DLAMI publishes an AMI where the bundled driver matches an available FM package. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
run_on_container wraps the command in `bash -c '<cmd>'`. Any single
quotes inside cmd (cut -d ' ', awk 'NR==2{...}', etc.) break the
outer wrapping and the docker exec sees a malformed command:
cut: option requires an argument -- 'd'
Replace the inline `sed | cut` with `cat` of the whole file, then
split lines and fields in Python — no shell quotes to manage.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…logs
Last green run hid all NIXL output: pytest's captured-log section only
shows the LOGGER.info "Running on ..." lines, not the result.stdout of
each run_on_container call. So we couldn't see the libfabric_smoke
output, the disagg PD orchestrator's prefill/decode handshake, or the
completion request response.
Wrap each NIXL step in a small _run_nixl helper that prints the cmd,
the result.stdout, the result.stderr, and the exit code with
========== markers — pytest is run with -s, so prints land in the
captured stdout that survives Fabric truncation.
Also cat /test/efa/logs/{prefill,decode,proxy}.log from inside the
containers at the end of the NIXL block. Those files are written by
nixl_disagg_pd*.sh and disappear when the EC2 fixture terminates the
instances; without dumping them here, debugging a failure means
relaunching by hand.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Same quiet-on-success problem the NIXL block had also affected the upstream NCCL path: a green run hid mpirun output, NCCL_DEBUG, the EFA provider selection, the diagnostics block, and the bandwidth extraction. So we couldn't see what actually happened — only "test passed in 1034s" with no audit trail. Hoist the _step helper to the top of the test and apply it to: - setup_nccl_tests (master + worker) - efa_sanity - nccl_allreduce - nixl_libfabric_smoke (already wrapped, just renamed) - nixl decode_launch + disagg_pd_orchestrator (renamed) Every step now prints cmd, stdout, stderr, exit code with ====== markers so green runs leave evidence we can grep for "Selected provider is efa", "NET/Libfabric/0/GDRDMA", "nixl_disagg_pd test passed", etc. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…bric
Switch kv_role from kv_both → kv_producer (prefill) / kv_consumer (decode)
and assert via Prometheus /metrics that decode never ran a local prefill.
Before: both sides as kv_both meant decode could silently fall back to
re-prefilling the prompt locally if the libfabric KV-transfer channel was
broken. The completion-text check would still pass and we'd never notice.
After:
- decode (kv_consumer) refuses to prefill locally; if KV bytes don't
arrive from prefill over libfabric, the decode request hangs and the
orchestrator's curl times out → hard failure.
- After the completion succeeds, scrape /metrics from both servers and
assert:
prefill: vllm:prompt_tokens_total >= 6 (it did the prefill)
decode: vllm:prompt_tokens_total == 0 (it did NOT prefill — proof
KV came over the wire)
decode: vllm:generation_tokens_total >= 1 (decode produced tokens)
The decode.prompt_tokens_total == 0 line is the smoking-gun assertion:
the only way decode can generate tokens for a 6-token prompt without
seeing those 6 tokens locally is if the prefill node shipped its KV
cache to decode over libfabric/EFA.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Merge of main into vllm-efa-test re-introduced 'import pytest' from upstream, but no @pytest.fixture / pytest.* call exists in this file — the verbose _step refactor removed the only usage. ruff hook fails CI's check-changes job; remove the import to make pre-commit green. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The merge of main (PR #6114, 956819d) added warn=True to a run_on_container() call. Our verbose-_step refactor (9522545) wraps that same call inside a local _step() helper which doesn't accept warn, so the merge produced _step(..., warn=True) — TypeError at runtime. _step already prints stdout/stderr/exit on success and pytest captures the UnexpectedExit on failure, so warn=True is no longer needed (it was only useful when the broken DLAMI was masking real errors). Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Root cause of the previous run's failure (decode.prompt_tokens=6 instead
of 0): NixlConnector in vLLM 0.21.0 does NOT enforce kv_role at the
engine level. Whether the decoder fetches remote KV vs re-prefills is
driven entirely by the per-request `kv_transfer_params` dict — which
our toy proxy was not shipping. So the kv_role=kv_consumer setup was a
no-op and decode silently re-prefilled the full prompt locally, even
while libfabric came up cleanly.
Fixes:
- toy_proxy_server.py: do the upstream-style 2-step handshake. Inject
placeholder {do_remote_decode: True, ...} into the prefill body with
max_tokens=1, read remote_engine_id/block_ids/host/port from prefill's
response, then forward those into the decode request so D's scheduler
uses the remote-pull path.
- nixl_disagg_pd.sh + nixl_disagg_pd_decode.sh: revert kv_role to
kv_both (matches upstream tests/v1/kv_connector/nixl_integration/
run_accuracy_test.sh). Also export VLLM_NIXL_SIDE_CHANNEL_HOST set
to the box's primary IP so the cross-host side channel is reachable.
- nixl_disagg_pd.sh metrics assertion: relax decode.prompt_tokens from
==0 to <=1. With the handshake, decode legitimately processes the
last token before generation; only the full-prompt re-prefill (==6)
indicates the connector failed.
Reference: upstream's tests/v1/kv_connector/nixl_integration/
toy_proxy_server.py at vllm-project/vllm.
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Wrap the orchestrator _step in try/finally so the prefill/proxy/decode log dumps fire even when the orchestrator script raises UnexpectedExit. Today's failure (decode.prompt_tokens=6 — KV transfer over libfabric did not happen) is opaque without those logs because we can't tell whether: - prefill returned populated kv_transfer_params or null, - the proxy successfully extracted remote_engine_id/block_ids/etc, - decode's NixlConnector saw the do_remote_prefill flag. The proxy log in particular is on the master container and would show which side of the handshake silently dropped, but the previous code only dumped on success — exactly when we don't need the diagnostics. Diagnostic-only change; no behavior modifications. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Apply the launch flags that the upstream tests/v1/kv_connector/ nixl_integration/run_accuracy_test.sh uses on both prefill and decode: - --block-size 128 — must match across P and D for remote_block_ids to map correctly. OPT defaults to 16; upstream pins 128 explicitly so the lookup at scheduler.py works regardless of model. - VLLM_KV_CACHE_LAYOUT=HND — required by NixlConnector. Without it the attention backend can pick a layout NIXL doesn't support, silently falling back to local prefill. - kv_load_failure_policy=fail in kv_connector_extra_config — turns a missing/invalid KV handoff into a hard error instead of a silent re-prefill. Symptoms become loud (HTTP 5xx) instead of "passed but decode.prompt_tokens=6". - Drop UCX_NET_DEVICES=all on both — UCX-only env, no-op for the LIBFABRIC backend we use. Also enrich the proxy diagnostic so the next failure (if any) is trivially debuggable: dump the full kv_transfer_params dict and the sorted key list returned by prefill, so we can see at a glance whether all required keys (do_remote_prefill, remote_block_ids, remote_engine_id, remote_request_id, remote_host, remote_port, tp_size, remote_num_tokens) are present. Reference: vllm/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh at the v0.21.0 tag. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
added 2 commits
May 21, 2026 22:17
Replace the prompt_tokens_total / cache-hit assertion with a direct NixlConnector counter check. The previous run proved (via vLLM's own log line "KV Transfer metrics: Num successful transfers=1, Throughput 1003 MB/s") that NIXL+LIBFABRIC over EFA is fully working — the test was failing on a wrong proof metric. vllm:prompt_tokens_total counts prompt tokens regardless of whether the KV was reused, so it can never distinguish "KV pulled from remote" from "KV recomputed locally". vllm:nixl_xfer_time_seconds_count is a histogram counter that increments per successful NIXL transfer; > 0 proves blocks crossed the wire from prefill to decode. Pair with vllm:nixl_num_failed_transfers == 0 to fail loudly on transport errors. Drop the prefill.prompt_tokens assertion (no longer needed — the NIXL counter on decode is the authoritative proof). Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Two related changes now that AWS DLAMI shipped a working AL2023 base image (matching nvidia-fabricmanager + driver pair as of 2026-05-21): 1. .github/scripts/efa/ec2_helpers.py — revert the ami-0d2923a2dd541bdeb us-west-2 pin and go back to aws_session.get_latest_ami() everywhere. The pin was a workaround for a 12-day window (May 8-19) when DLAMI bumped to driver 580.150 without a matching fabricmanager package, which broke cuInit on p4d (NVSwitch) and surfaced as misleading "aws-ofi-nccl is not working" failures. AWS fixed it in the 2026-05-21 build, verified by the 23 GB/s NCCL allreduce + working NIXL transfer in the passing CI run. 2. .github/workflows/pr-pytorch-ec2-cuda.yml — add efa-test-change to check-changes (paths: test/efa/**, .github/scripts/efa/**, .github/workflows/reusable-efa-tests.yml). Mirror the sanity-test fallback so efa-test runs against the prod PyTorch image when build-images is skipped (PRs that touch only test/efa/** without the docker context). This means changes to the EFA test fixture itself get verified on PyTorch too — not just vLLM. Signed-off-by: Yadan Wei <yadanwei@amazon.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
pr-vllm-ec2.ymlso vLLM Ubuntu images get NCCL-over-EFA validation on every build.RUN_NIXL_TESTS=1(vLLM only): a libfabric plugin packaging smoke, and a multi-node disaggregated prefill/decode test that strictly verifies KV cache transfer over libfabric/EFA via NixlConnector Prometheus metrics.Why
PyTorch DLCs already have
test/efa/test_efa.pyrunning against every image. vLLM ships the same EFA stack (installer, OpenMPI, aws-ofi-nccl) but never had its own EFA test. NIXL with the libfabric backend is the path vLLM uses for disaggregated PD over EFA — it deserves coverage too.What changed
Reuses the existing PyTorch EFA fixture, doesn't duplicate it.
test/efa/test_efa.pyRUN_NIXL_TESTS=1. Wraps each step in a verbose_stephelper that prints stdout/stderr/exit, and dumps prefill/proxy/decode logs even on failure (try/finally).test/efa/scripts/nccl_allreduce.shCUDA_HOME=/usr/local/cudainstead of erroring out (PyTorch sets it; vLLM doesn't).test/efa/scripts/setup_nccl_tests.sh(new)apt install libnccl-devif missing and compilesall_reduce_perfagainstNCCL_HOME=/usr.test/efa/scripts/nixl_libfabric_smoke.py(new)nixl, instantiates an agent, callscreate_backend("LIBFABRIC", {"provider": "efa"}). Catches packaging regressions on thenixl-cu*wheel.test/efa/scripts/nixl_disagg_pd.sh(new)vllm servewithNixlConnector + LIBFABRIC + kv_load_failure_policy:fail, polls worker decode server, launches proxy, sends a/v1/completionsrequest, then scrapes decode's/metricsand assertsvllm:nixl_xfer_time_seconds_count >= 1andvllm:nixl_num_failed_transfers == 0— directly proves at least one KV cache block crossed prefill→decode over libfabric/EFA.test/efa/scripts/nixl_disagg_pd_decode.sh(new)kv_role:"kv_both"(NixlConnector ignoreskv_role; per-requestkv_transfer_paramsdoes the routing) with matching--block-size 128andVLLM_KV_CACHE_LAYOUT=HNDenv var.test/efa/scripts/toy_proxy_server.py(new)kv_transfer_params={do_remote_decode:True,...}into prefill body withmax_tokens=1, extract populatedremote_engine_id/block_ids/host/port/request_idfrom prefill response, forward those into decode's request body. ~110 lines, modeled after upstream'stests/v1/kv_connector/nixl_integration/toy_proxy_server.py..github/workflows/reusable-efa-tests.ymlrun-nixl-testsboolean input; passes through to pytest asRUN_NIXL_TESTS. Per-PR concurrency group keyed on workflow + PR number..github/workflows/pr-vllm-ec2.ymlefa-testjob (gated onbuild-image + sanity-test + security-test) calling the existing reusable workflow withrun-nixl-tests: true.Verified on real p4d (CI run 26269852212)
Backend LIBFABRIC was instantiated.KV Transfer metrics: Num successful transfers=1, Avg MB per transfer=4.5, Throughput (MB/s)=1003, andvllm:nixl_xfer_time_seconds_countconfirms at least one transfer arrived at decode.Test plan
pr-vllm-ec2efa-testjob runs against the new vLLM image and is green.pr-pytorch-ec2-cuda) still passes —setup_nccl_tests.shshort-circuits on the preinstalled binary;RUN_NIXL_TESTSdefaults to0.kv_load_failure_policy:failmakes NIXL transport errors return 5xx (instead of silently re-prefilling); thevllm:nixl_*metric assertions catch a missing transfer even when the completion text is coherent.Notes
kv_roleis advisory at the engine level in vLLM 0.21.0 — what actually triggers the remote-pull path is the per-requestkv_transfer_paramsdict the proxy injects. Both servers run withkv_bothmatching upstream'stests/v1/kv_connector/nixl_integration/run_accuracy_test.sh.vllm:prompt_tokens_total(that always counts the full prompt, regardless of whether KV came from cache or recompute) — it'svllm:nixl_xfer_time_seconds_count, the histogram counter NixlConnector increments per successful transfer.