Skip to content

Fix workflow polling hang when a step stays new behind a paused branch#1648

Open
mvdbeek wants to merge 1 commit into
galaxyproject:masterfrom
mvdbeek:fix-invocation-polling-terminal-latch
Open

Fix workflow polling hang when a step stays new behind a paused branch#1648
mvdbeek wants to merge 1 commit into
galaxyproject:masterfrom
mvdbeek:fix-invocation-polling-terminal-latch

Conversation

@mvdbeek
Copy link
Copy Markdown
Member

@mvdbeek mvdbeek commented May 26, 2026

When an upstream job errors, Galaxy pauses the downstream step and can leave a further step in the new state. On a poll where the invocation is ready with a paused and errored job, invocation_scheduling_terminal returns True, so _poll_main_workflow stops refreshing the invocation and new_steps is frozen non-empty. all_subworkflows_complete() then blocked on it forever, so the poll loop never returned (observed in IWC CI, ran ~5.5h until cancelled).

Skip the new_steps check once scheduling is terminal, since it can no longer change. Add a simulation scenario and regression test, with a tick cap in the test polling tracker so a future non-terminating loop fails instead of hanging.

When an upstream job errors, Galaxy pauses the downstream step and can leave a
further step in the new state. On a poll where the invocation is ready with a
paused and errored job, invocation_scheduling_terminal returns True, so
_poll_main_workflow stops refreshing the invocation and new_steps is frozen
non-empty. all_subworkflows_complete() then blocked on it forever, so the poll
loop never returned (observed in IWC CI, ran ~5.5h until cancelled).

Skip the new_steps check once scheduling is terminal, since it can no longer
change. Add a simulation scenario and regression test, with a tick cap in the
test polling tracker so a future non-terminating loop fails instead of hanging.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant