fix(topology): prevent source pump deadlock during sink config reload#25330
Open
joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
Open
fix(topology): prevent source pump deadlock during sink config reload#25330joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
joshcoughlan wants to merge 5 commits intovectordotdev:masterfrom
Conversation
During config reload, when a sink in wait_for_sinks is being changed, remove_inputs previously sent Pause to the upstream fanout. This caused the source pump to block in wait_for_replacements (waiting for a Replace message that only arrives after connect_diff, which runs after shutdown_diff completes). Since shutdown_diff waits for the old sink to finish, this created a circular dependency that stalled all sources. The fix uses Remove instead of Pause for sinks in wait_for_sinks during reload, and correspondingly uses Add instead of Replace when reconnecting them in connect_diff. This allows the source pump to continue sending events to other sinks while the old sink drains, breaking the circular dependency. This is buffer-type agnostic — it fixes the deadlock for memory and disk buffered sinks alike, and only affects the reload path. Refs: vectordotdev#24125 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Multi-sink scenario mirroring the production failure: demo_logs -> route -> two http sinks where one downstream returns 429 after a delay, backing up events at the sink. After a soak period, an encoding.except_fields edit + SIGHUP triggers the deadlock. The orchestrator probes tap output and prometheus metrics at four checkpoints (initial, post-degradation, post-baseline-reload, post-config-change-reload) and prints a colorized PASS/FAIL summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add a minimal Dockerfile so the repro can build a Vector image from the current source tree on host arch (no cross-compile, no Rosetta) when no prebuilt image is supplied. - Split the merged vector.yaml into base-config.yaml (bind-mounted from the repo) and a rendered sinks.yaml in $WORKDIR (also bind-mounted). - Pass --config explicitly per file at docker run + validate time so the test works against any image, including timberio/vector:* images that ship a default config in /etc/vector. - After rewriting sinks.yaml, gate the SIGHUP on "vector validate" succeeding inside the container (1s sleep + 5 attempts with exponential backoff). Absorbs Docker for Mac bind-mount propagation lag that was racing the signal and causing Vector to read a truncated file. - Misc template/comment cleanups so the renderer doesn't substitute its own placeholder docs into the rendered YAML. Verified: post-edit reload reproduces the bug on timberio/vector:0.50.0-debian and completes successfully on the patched local build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reload's shutdown_diff awaits each old sink task in wait_for_sinks
(buffer-reuse and resource-conflict cases) so that ownership of the
buffer / port can be reclaimed before the new sink starts. The await
had no upper bound: if the old sink's downstream was permanently
failing — e.g. an http sink wedged on retriable HTTP 429 — the task
never finished, shutdown_diff blocked forever, connect_diff never
ran, the new sinks never got wired up, and backpressure from the
zero-consumer fanout halted the source. Recovery required restarting
the process.
Wrap both wait_for_sinks awaits in a 60s tokio::time::timeout
(RELOAD_DRAIN_TIMEOUT). On expiry:
- Emit a structured warn with component_id, timeout_secs, and
(for changed sinks) reuse_buffers, so the event is observable
and operators can see which sink stranded events.
- Detach the old task; it continues retrying in the background
and exits whenever its in-flight work finally fails.
- For sinks that were buffer-reuse candidates, drop the buffer_tx
entry so the builder's existing fallback constructs a fresh
buffer instead of waiting forever to reclaim the old one.
The timeout strictly improves on prior behaviour: an unbounded
wedge currently is recovered by SIGTERM/SIGKILL, which loses every
in-flight event across every pipeline. The timeout scopes loss to
the one stuck sink and lets the rest of the topology resume.
Verified against testing/github-24125: the post-edit reload now
completes ~60s after the SIGHUP (warn fires, new sinks wire up,
source resumes) where it previously hung indefinitely.
Refs: vectordotdev#24125
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a deadlock during config reload that stalls all sources indefinitely when a changed sink shares a resource (e.g. a bound port) with the new sink, or when the sink is in `reuse_buffers` on the SIGHUP path.
Root cause: During reload, `remove_inputs` sent `Pause` to the upstream fanout for sinks in `wait_for_sinks`. The source pump then blocked in `wait_for_replacements` waiting for a `Replace` message that only arrives after `connect_diff`, which runs after `shutdown_diff`, which itself was waiting for the old sink to finish. Circular dependency, every source stalls.
Fix: For sinks in `wait_for_sinks`, use a full `Remove` (not `Pause`) from the upstream fanout, with a corresponding `Add` (not `Replace`) when reconnecting in `connect_diff`. The source pump can keep sending events to other sinks while the old sink drains. Also bounds `wait_for_sinks` with a drain timeout so a misbehaving sink can't hold reload forever.
Buffer-type agnostic — fixes the deadlock for memory and disk buffered sinks alike, only affects the reload path.
Vector configuration
Reproduction harness included under `testing/github-24125/` — uses a `demo_logs` source fanning out to two `http` sinks listening on the same port range, then SIGHUPs Vector to trigger the reload path. Without the fix, all sources stall; with the fix, reload completes within the timeout.
How did you test this PR?
Change Type
Is this a breaking change?
Does this PR include user facing changes?
References