Skip to content

test(vllm): add EFA test for vLLM Ubuntu (NCCL + NIXL)#6113

Open
Yadan-Wei wants to merge 31 commits into
mainfrom
vllm-efa-test
Open

test(vllm): add EFA test for vLLM Ubuntu (NCCL + NIXL)#6113
Yadan-Wei wants to merge 31 commits into
mainfrom
vllm-efa-test

Conversation

@Yadan-Wei
Copy link
Copy Markdown
Contributor

@Yadan-Wei Yadan-Wei commented May 19, 2026

Summary

  • Wires the existing 2x p4d EFA fixture into pr-vllm-ec2.yml so vLLM Ubuntu images get NCCL-over-EFA validation on every build.
  • Adds two NIXL EFA checks gated on RUN_NIXL_TESTS=1 (vLLM only): a libfabric plugin packaging smoke, and a multi-node disaggregated prefill/decode test that strictly verifies KV cache transfer over libfabric/EFA via NixlConnector Prometheus metrics.

Why

PyTorch DLCs already have test/efa/test_efa.py running against every image. vLLM ships the same EFA stack (installer, OpenMPI, aws-ofi-nccl) but never had its own EFA test. NIXL with the libfabric backend is the path vLLM uses for disaggregated PD over EFA — it deserves coverage too.

What changed

Reuses the existing PyTorch EFA fixture, doesn't duplicate it.

File Change
test/efa/test_efa.py Adds optional NIXL smoke + disaggregated PD steps after the existing NCCL test, gated on RUN_NIXL_TESTS=1. Wraps each step in a verbose _step helper that prints stdout/stderr/exit, and dumps prefill/proxy/decode logs even on failure (try/finally).
test/efa/scripts/nccl_allreduce.sh Defaults CUDA_HOME=/usr/local/cuda instead of erroring out (PyTorch sets it; vLLM doesn't).
test/efa/scripts/setup_nccl_tests.sh (new) No-op on PyTorch DLCs. On vLLM Ubuntu, apt install libnccl-dev if missing and compiles all_reduce_perf against NCCL_HOME=/usr.
test/efa/scripts/nixl_libfabric_smoke.py (new) Imports nixl, instantiates an agent, calls create_backend("LIBFABRIC", {"provider": "efa"}). Catches packaging regressions on the nixl-cu* wheel.
test/efa/scripts/nixl_disagg_pd.sh (new) Master orchestrator. Launches prefill vllm serve with NixlConnector + LIBFABRIC + kv_load_failure_policy:fail, polls worker decode server, launches proxy, sends a /v1/completions request, then scrapes decode's /metrics and asserts vllm:nixl_xfer_time_seconds_count >= 1 and vllm:nixl_num_failed_transfers == 0 — directly proves at least one KV cache block crossed prefill→decode over libfabric/EFA.
test/efa/scripts/nixl_disagg_pd_decode.sh (new) Daemonized worker decode endpoint. Uses kv_role:"kv_both" (NixlConnector ignores kv_role; per-request kv_transfer_params does the routing) with matching --block-size 128 and VLLM_KV_CACHE_LAYOUT=HND env var.
test/efa/scripts/toy_proxy_server.py (new) Minimal FastAPI proxy doing the upstream-style 2-step handshake: inject kv_transfer_params={do_remote_decode:True,...} into prefill body with max_tokens=1, extract populated remote_engine_id/block_ids/host/port/request_id from prefill response, forward those into decode's request body. ~110 lines, modeled after upstream's tests/v1/kv_connector/nixl_integration/toy_proxy_server.py.
.github/workflows/reusable-efa-tests.yml New run-nixl-tests boolean input; passes through to pytest as RUN_NIXL_TESTS. Per-PR concurrency group keyed on workflow + PR number.
.github/workflows/pr-vllm-ec2.yml Adds efa-test job (gated on build-image + sanity-test + security-test) calling the existing reusable workflow with run-nixl-tests: true.

Verified on real p4d (CI run 26269852212)

  • NCCL all_reduce_perf across 2x p4d.24xlarge: 23 GB/s busbw at 1 GiB, EFA provider selected (4 NICs, GDRDMA on all 4 channels).
  • NIXL libfabric smoke: Backend LIBFABRIC was instantiated.
  • NIXL disagg PD: prefill→decode completion returns coherent text, NixlConnector emits KV Transfer metrics: Num successful transfers=1, Avg MB per transfer=4.5, Throughput (MB/s)=1003, and vllm:nixl_xfer_time_seconds_count confirms at least one transfer arrived at decode.

Test plan

  • pr-vllm-ec2 efa-test job runs against the new vLLM image and is green.
  • PyTorch CI path (pr-pytorch-ec2-cuda) still passes — setup_nccl_tests.sh short-circuits on the preinstalled binary; RUN_NIXL_TESTS defaults to 0.
  • Both EFA tests share a per-PR concurrency group so PyTorch and vLLM EFA runs don't fight for p4d capacity within the same PR.
  • Failure modes are loud: kv_load_failure_policy:fail makes NIXL transport errors return 5xx (instead of silently re-prefilling); the vllm:nixl_* metric assertions catch a missing transfer even when the completion text is coherent.

Notes

  • NixlConnector's kv_role is advisory at the engine level in vLLM 0.21.0 — what actually triggers the remote-pull path is the per-request kv_transfer_params dict the proxy injects. Both servers run with kv_both matching upstream's tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh.
  • The proof metric is not vllm:prompt_tokens_total (that always counts the full prompt, regardless of whether KV came from cache or recompute) — it's vllm:nixl_xfer_time_seconds_count, the histogram counter NixlConnector increments per successful transfer.

Wires the existing EFA test harness into pr-vllm-ec2.yml. Same 2x p4d
fixture covers three checks against EFA:

- NCCL collectives over EFA (existing test, now image-agnostic).
- NIXL libfabric plugin packaging smoke (new, single-process).
- NIXL disaggregated prefill/decode across nodes with LIBFABRIC backend
  (new, two vLLM servers + minimal proxy, gated on RUN_NIXL_TESTS=1).

setup_nccl_tests.sh is a no-op on PyTorch DLCs (binary preinstalled) and
compiles nccl-tests against the nvidia-nccl-cu12 wheel on vLLM Ubuntu.
nccl_allreduce.sh now defaults CUDA_HOME=/usr/local/cuda for images that
do not export it.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei and others added 28 commits May 19, 2026 22:34
Without a dedicated change filter, build-change stayed false on EFA-only
edits (test/efa/**, reusable-efa-tests.yml), build-image was skipped, and
efa-test never fired. Mirrors the sanity-test/telemetry-test/model-tests
pattern: fall back to the prod image when no rebuild happened.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
verifiable.cu compiled for nccl-tests' default 9 archs peaks at 8+GB
and got SIGKILL'd on the test host (exit 137). Detect the compute
capability via nvidia-smi and build only for that one arch; serialize
make to keep memory pressure flat.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
nccl-tests' verifiable.cu compiled with nvcc OOM-kills the build on the
test host even with single-arch and -j1 (>8GB peak per file). It's also
unconditionally linked into every _perf binary, so we can't strip it.

Drop the build entirely. Use torch.distributed all_reduce when the
preinstalled binary isn't present (vLLM image): same NCCL→aws-ofi-nccl→
EFA path, no compile step. nccl_allreduce.sh picks the implementation at
runtime; existing log validators (aws-ofi-nccl, "Selected provider is
efa", Libfabric, GDRDMA) work for both since they parse NCCL_DEBUG=INFO
output. Bandwidth threshold (3 GB/s) is now read from a JSON line that
torch_allreduce.py emits on rank 0.

PyTorch DLC path unchanged: setup_nccl_tests.sh short-circuits, the
existing all_reduce_perf binary is used, the existing perf check parses
the same column 11 it always did.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
master's V1 EFA test (test/dlc_tests/container_tests/bin/efa/build_all_reduce_perf.sh)
builds nccl-tests inside the test container with NCCL_HOME=/usr/local and
default NVCC_GENCODE — and works on p4d. Earlier OOMs were from a different
NCCL_HOME path (the python wheel) which can produce different template
instantiation against alternate headers.

Match master:
- NCCL_HOME=/usr/local first; fall back to the nvidia-nccl-cu12 wheel only
  when /usr/local/include/nccl.h is absent (vLLM image case).
- No NVCC_GENCODE override — let nccl-tests use its defaults.
- Drop the torch.distributed fallback and revert nccl_allreduce.sh to the
  original single-binary path.

Removes torch_allreduce.py.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Last run failed at \`python3 -c \"import nvidia.nccl\"\` because import
succeeded but \`nvidia.nccl.__file__\` was None — namespace package
behavior differs across cu12 and cu13 wheel builds.

Replace the import probe with a path-walk over Python's site-packages /
dist-packages dirs, looking for the canonical nvidia/nccl/include/nccl.h
layout that both cu12 and cu13 wheels use. Add diagnostic output (sys.path
+ find) so the next failure (if any) surfaces actual evidence instead of
an empty WHEEL_DIR.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
verifiable.cu compiled against nvidia-nccl-cu13 wheel headers OOMs nvcc
(SIGKILL at exit 137) because the wheel ships heavier-templated headers
from a different NCCL build than master uses. master's V1 PyTorch DLC
builds NCCL from source into /usr/local, producing the thin headers
nccl-tests expects.

Cheapest reliable fix: apt-install libnccl-dev at test time. Same NCCL
version the image runs against (Ubuntu repo tracks the upstream NCCL
release), thin headers, ~50MB transient. Use NCCL_HOME=/usr which is
where apt puts it. The container is torn down after the test so this
doesn't persist into the published image.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Previous run was killed at ~787s with exit 137. DEFAULT_TIMEOUT (600s)
applied to setup_nccl_tests.sh wasn't enough — verifiable.cu is
template-heavy and the build legitimately takes ~13 min on a p4d.

Earlier hypothesis was nvcc OOM, but the log shows nccl.h was already
at /usr/include before the apt install (libnccl-dev was already present
from the image's EFA install path), so the apt-headers-fix attempt
didn't change the compile inputs at all. The kill was a fabric/invoke
timeout sending SIGKILL to docker exec, surfacing as 137.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Latest CI run failed with exit 137 from docker exec — UnexpectedExit
from Fabric, not CommandTimedOut, so the 1500s timeout wasn't hit. The
container itself was SIGKILL'd. p4d has 1.1TB RAM and devbox repro
peaked at 150MB, so plain OOM is unlikely; remaining hypotheses are
container cgroup limit, nvidia-runtime hook, or disk exhaustion.

Two changes:
- Build for the host's actual SM arch only (typically compute_80 on
  p4d) with -j1. Default GENCODE targets ~9 archs and at peak runs
  multiple nvcc forks; even with each fork small, the aggregate may
  trip an unknown limit.
- Print free/df/nproc/cgroup memory.max BEFORE the build, and a trap
  EXIT that dumps the same plus memory.peak and partial build artifacts
  AFTER any failure (including SIGKILL — the trap runs on signal exits).

Next failure (if any) will have actionable evidence in the captured
log: which file was being compiled, how much memory the cgroup
allowed, how full /tmp was when killed.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
PyTorch's efa-test was gated on a fresh image build via 'if: success()'
of build-images + sanity + security + unit-test. So edits to test/efa/**
or reusable-efa-tests.yml that don't touch the Dockerfile would skip
PyTorch's EFA validation entirely — even though those test files are
shared with vLLM. Mirror the vllm workflow: a new efa-test-change paths
filter, and the efa-test job now fires on (build OR efa-test-change),
falling back to the prod image when no rebuild happened.

This way a PR like the current one (which only edits test/efa/scripts/
setup_nccl_tests.sh) actually validates that change against both
PyTorch and vLLM DLCs.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
vLLM image's ENTRYPOINT is dockerd_entrypoint.sh which invokes
'python3 -m vllm.entrypoints.openai.api_server "$@"'. Running
\`docker run -id <vllm-image> bash\` becomes \`api_server bash\`, vllm
treats 'bash' as a model_tag and tries to download it from HF.
Eventually it crashes and the container exits — by the time the slower
side of the test (worker) runs setup_nccl_tests.sh, the container is
gone and dockerd returns "container is not running" with exit 1.

Fix: pass --entrypoint /bin/bash and 'sleep infinity' so the container
stays alive regardless of which framework-specific entrypoint the
image ships. PyTorch DLC's entrypoint is also bypassed, but that's
harmless since the test only uses docker exec.

This was hidden behind the prior failure mode (137 from setup_nccl_tests
hitting the master container before its api_server crashed). Worker
containers had time to fully crash by the time the loop reached them.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The reusable efa-test workflow uses 'efa-test-global / cancel-in-progress: false'
to serialize p4d capacity. But GitHub allows only one *pending* run per
group: when N>=2 callers (PyTorch + vLLM) target the same group on a new
commit, the second pending one displaces the first with "Canceling since
a higher priority waiting request for efa-test-global exists".

Same effect on every push: existing pending efa-test gets cancelled by
the next caller pushing into queue.

Add a per-caller concurrency group keyed by workflow + PR number with
cancel-in-progress: true on the calling efa-test job. Now:
- Same PR re-pushed → previous efa-test of that workflow is cancelled
  (which is what we want — old commit is obsolete).
- Different workflow (PyTorch vs vLLM) → different group → both queue
  on efa-test-global without displacing each other.
- Different PR → different group → same.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Last commit overrode entrypoint to /bin/bash for both image families.
That fixed vLLM (whose dockerd_entrypoint.sh execs `vllm serve "$@"`
and crashes on `bash`). But it broke PyTorch: its entrypoint sets
LD_LIBRARY_PATH=/usr/local/cuda/compat:... when the host's nvidia
driver is older than the cuda-compat version. Without that env var,
NCCL fails to initialize on p4d hosts ("aws-ofi-nccl is not working").

Detect vLLM by image URI substring; only override the entrypoint
there. PyTorch keeps its original `entrypoint.sh bash` invocation.

Both paths still produce a long-lived container that docker exec can
land on.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Adding efa-test-change to the PyTorch workflow caused efa-test to run
against the prod image on this PR (which only edits test/efa/**). The
prod pytorch:2.11-cu130-amzn2023 image fails the EFA test with
"aws-ofi-nccl is not working" — a preexisting issue unrelated to this
PR's changes. Surfacing it here red-blocks merging.

Revert to the original pattern: PyTorch efa-test only runs when the
image was actually rebuilt by this PR. The prod image is presumed
validated at its release. The vLLM workflow keeps the efa-test-change
trigger because we're still iterating on the test against the vLLM
prod image.

Keep the per-caller concurrency block — it's still needed to prevent
multiple workflows from cross-cancelling on efa-test-global.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Test failures show only 2 lines from mpirun (Test failure common.cu:1218
+ "Process exited with code 2"), no NCCL_DEBUG output despite
-x NCCL_DEBUG=INFO. Need actual evidence of:
- whether libnccl is on ldconfig path
- whether all_reduce_perf links against the expected libnccl
- whether libfabric finds the EFA provider
- whether aws-ofi-nccl plugin .so is in place

Also cat the captured TRAINING_LOG file at the end since the master
container is torn down on test exit and the file is otherwise lost.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…'t drop them

Previous diagnostic dump (nvidia-smi, ldd, fi_info, etc.) was printed
inline to stdout before mpirun, but Fabric/invoke's UnexpectedExit only
keeps the last few KB of stdout in the failure trace — those lines got
truncated from the captured output.

Write diagnostics to /test/efa/logs/diagnostics.log first, then cat that
file at the very end (right before validators), so the diagnostic
content lands in the tail of stdout that survives truncation.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
ca89ff0 wrote diagnostics to a file and cat'd them, but Fabric still
truncated to the last ~3KB of stdout — and the captured tail showed
only the testEFA.log content + "aws-ofi-nccl is not working", with
the diagnostics block dropped from the head of the truncation window.

Reorder so diagnostics + final probes are the LAST output before the
validators run. The probes are minimal: ldd output, libnccl paths,
aws-ofi-nccl .so paths. Those three are enough to diagnose the runtime
linker's view.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The current SSM-resolved latest AL2023 base-with-single-cuda AMI ships
NVIDIA driver 580.150, but no matching nvidia-fabricmanager package
exists in NVIDIA's cuda-rhel9 repo (only 580.65/580.95/580.159). On
NVSwitch systems (p4d.24xlarge) cuInit returns CUDA_ERROR_SYSTEM_NOT_
YET_INITIALIZED (802) without a matching FM running, which makes the
nccl-tests harness fail before NCCL even initializes — surfaces as the
misleading "aws-ofi-nccl is not working" validator message.

Pin to ami-0d2923a2dd541bdeb (us-west-2, 2026-05-01 build, driver
580.126.09 + matching FM preinstalled). Verified locally on 2026-05-20:
cuInit succeeds, NCCL initializes, EFA provider selected and GDRDMA
channels established between two p4d instances.

Drop the pin once AWS DLAMI publishes an AMI where the bundled driver
matches an available FM package.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
run_on_container wraps the command in `bash -c '<cmd>'`. Any single
quotes inside cmd (cut -d ' ', awk 'NR==2{...}', etc.) break the
outer wrapping and the docker exec sees a malformed command:

    cut: option requires an argument -- 'd'

Replace the inline `sed | cut` with `cat` of the whole file, then
split lines and fields in Python — no shell quotes to manage.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…logs

Last green run hid all NIXL output: pytest's captured-log section only
shows the LOGGER.info "Running on ..." lines, not the result.stdout of
each run_on_container call. So we couldn't see the libfabric_smoke
output, the disagg PD orchestrator's prefill/decode handshake, or the
completion request response.

Wrap each NIXL step in a small _run_nixl helper that prints the cmd,
the result.stdout, the result.stderr, and the exit code with
========== markers — pytest is run with -s, so prints land in the
captured stdout that survives Fabric truncation.

Also cat /test/efa/logs/{prefill,decode,proxy}.log from inside the
containers at the end of the NIXL block. Those files are written by
nixl_disagg_pd*.sh and disappear when the EC2 fixture terminates the
instances; without dumping them here, debugging a failure means
relaunching by hand.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Same quiet-on-success problem the NIXL block had also affected the
upstream NCCL path: a green run hid mpirun output, NCCL_DEBUG, the
EFA provider selection, the diagnostics block, and the bandwidth
extraction. So we couldn't see what actually happened — only "test
passed in 1034s" with no audit trail.

Hoist the _step helper to the top of the test and apply it to:
- setup_nccl_tests (master + worker)
- efa_sanity
- nccl_allreduce
- nixl_libfabric_smoke (already wrapped, just renamed)
- nixl decode_launch + disagg_pd_orchestrator (renamed)

Every step now prints cmd, stdout, stderr, exit code with ====== markers
so green runs leave evidence we can grep for "Selected provider is efa",
"NET/Libfabric/0/GDRDMA", "nixl_disagg_pd test passed", etc.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
…bric

Switch kv_role from kv_both → kv_producer (prefill) / kv_consumer (decode)
and assert via Prometheus /metrics that decode never ran a local prefill.

Before: both sides as kv_both meant decode could silently fall back to
re-prefilling the prompt locally if the libfabric KV-transfer channel was
broken. The completion-text check would still pass and we'd never notice.

After:
- decode (kv_consumer) refuses to prefill locally; if KV bytes don't
  arrive from prefill over libfabric, the decode request hangs and the
  orchestrator's curl times out → hard failure.
- After the completion succeeds, scrape /metrics from both servers and
  assert:
    prefill: vllm:prompt_tokens_total >= 6   (it did the prefill)
    decode:  vllm:prompt_tokens_total == 0   (it did NOT prefill — proof
                                              KV came over the wire)
    decode:  vllm:generation_tokens_total >= 1 (decode produced tokens)

The decode.prompt_tokens_total == 0 line is the smoking-gun assertion:
the only way decode can generate tokens for a 6-token prompt without
seeing those 6 tokens locally is if the prefill node shipped its KV
cache to decode over libfabric/EFA.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Merge of main into vllm-efa-test re-introduced 'import pytest' from
upstream, but no @pytest.fixture / pytest.* call exists in this file —
the verbose _step refactor removed the only usage. ruff hook fails CI's
check-changes job; remove the import to make pre-commit green.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
The merge of main (PR #6114, 956819d) added warn=True to a
run_on_container() call. Our verbose-_step refactor (9522545) wraps
that same call inside a local _step() helper which doesn't accept warn,
so the merge produced _step(..., warn=True) — TypeError at runtime.

_step already prints stdout/stderr/exit on success and pytest captures
the UnexpectedExit on failure, so warn=True is no longer needed (it was
only useful when the broken DLAMI was masking real errors).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Root cause of the previous run's failure (decode.prompt_tokens=6 instead
of 0): NixlConnector in vLLM 0.21.0 does NOT enforce kv_role at the
engine level. Whether the decoder fetches remote KV vs re-prefills is
driven entirely by the per-request `kv_transfer_params` dict — which
our toy proxy was not shipping. So the kv_role=kv_consumer setup was a
no-op and decode silently re-prefilled the full prompt locally, even
while libfabric came up cleanly.

Fixes:
- toy_proxy_server.py: do the upstream-style 2-step handshake. Inject
  placeholder {do_remote_decode: True, ...} into the prefill body with
  max_tokens=1, read remote_engine_id/block_ids/host/port from prefill's
  response, then forward those into the decode request so D's scheduler
  uses the remote-pull path.
- nixl_disagg_pd.sh + nixl_disagg_pd_decode.sh: revert kv_role to
  kv_both (matches upstream tests/v1/kv_connector/nixl_integration/
  run_accuracy_test.sh). Also export VLLM_NIXL_SIDE_CHANNEL_HOST set
  to the box's primary IP so the cross-host side channel is reachable.
- nixl_disagg_pd.sh metrics assertion: relax decode.prompt_tokens from
  ==0 to <=1. With the handshake, decode legitimately processes the
  last token before generation; only the full-prompt re-prefill (==6)
  indicates the connector failed.

Reference: upstream's tests/v1/kv_connector/nixl_integration/
toy_proxy_server.py at vllm-project/vllm.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Wrap the orchestrator _step in try/finally so the prefill/proxy/decode
log dumps fire even when the orchestrator script raises UnexpectedExit.
Today's failure (decode.prompt_tokens=6 — KV transfer over libfabric did
not happen) is opaque without those logs because we can't tell whether:

- prefill returned populated kv_transfer_params or null,
- the proxy successfully extracted remote_engine_id/block_ids/etc,
- decode's NixlConnector saw the do_remote_prefill flag.

The proxy log in particular is on the master container and would show
which side of the handshake silently dropped, but the previous code
only dumped on success — exactly when we don't need the diagnostics.
Diagnostic-only change; no behavior modifications.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Apply the launch flags that the upstream tests/v1/kv_connector/
nixl_integration/run_accuracy_test.sh uses on both prefill and decode:

- --block-size 128 — must match across P and D for remote_block_ids
  to map correctly. OPT defaults to 16; upstream pins 128 explicitly so
  the lookup at scheduler.py works regardless of model.
- VLLM_KV_CACHE_LAYOUT=HND — required by NixlConnector. Without it the
  attention backend can pick a layout NIXL doesn't support, silently
  falling back to local prefill.
- kv_load_failure_policy=fail in kv_connector_extra_config — turns a
  missing/invalid KV handoff into a hard error instead of a silent
  re-prefill. Symptoms become loud (HTTP 5xx) instead of "passed but
  decode.prompt_tokens=6".
- Drop UCX_NET_DEVICES=all on both — UCX-only env, no-op for the
  LIBFABRIC backend we use.

Also enrich the proxy diagnostic so the next failure (if any) is
trivially debuggable: dump the full kv_transfer_params dict and the
sorted key list returned by prefill, so we can see at a glance whether
all required keys (do_remote_prefill, remote_block_ids, remote_engine_id,
remote_request_id, remote_host, remote_port, tp_size, remote_num_tokens)
are present.

Reference: vllm/tests/v1/kv_connector/nixl_integration/run_accuracy_test.sh
at the v0.21.0 tag.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Yadan Wei added 2 commits May 21, 2026 22:17
Replace the prompt_tokens_total / cache-hit assertion with a direct
NixlConnector counter check. The previous run proved (via vLLM's own
log line "KV Transfer metrics: Num successful transfers=1, Throughput
1003 MB/s") that NIXL+LIBFABRIC over EFA is fully working — the test
was failing on a wrong proof metric.

vllm:prompt_tokens_total counts prompt tokens regardless of whether
the KV was reused, so it can never distinguish "KV pulled from remote"
from "KV recomputed locally". vllm:nixl_xfer_time_seconds_count is a
histogram counter that increments per successful NIXL transfer; > 0
proves blocks crossed the wire from prefill to decode. Pair with
vllm:nixl_num_failed_transfers == 0 to fail loudly on transport
errors. Drop the prefill.prompt_tokens assertion (no longer needed —
the NIXL counter on decode is the authoritative proof).

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Two related changes now that AWS DLAMI shipped a working AL2023 base
image (matching nvidia-fabricmanager + driver pair as of 2026-05-21):

1. .github/scripts/efa/ec2_helpers.py — revert the
   ami-0d2923a2dd541bdeb us-west-2 pin and go back to
   aws_session.get_latest_ami() everywhere. The pin was a workaround
   for a 12-day window (May 8-19) when DLAMI bumped to driver 580.150
   without a matching fabricmanager package, which broke cuInit on
   p4d (NVSwitch) and surfaced as misleading "aws-ofi-nccl is not
   working" failures. AWS fixed it in the 2026-05-21 build, verified
   by the 23 GB/s NCCL allreduce + working NIXL transfer in the
   passing CI run.

2. .github/workflows/pr-pytorch-ec2-cuda.yml — add efa-test-change to
   check-changes (paths: test/efa/**, .github/scripts/efa/**,
   .github/workflows/reusable-efa-tests.yml). Mirror the sanity-test
   fallback so efa-test runs against the prod PyTorch image when
   build-images is skipped (PRs that touch only test/efa/** without
   the docker context). This means changes to the EFA test fixture
   itself get verified on PyTorch too — not just vLLM.

Signed-off-by: Yadan Wei <yadanwei@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant