Skip to content

fix: Add cleanup steps to prevent kuttl namespace deletion timeouts#789

Merged
adwk67 merged 1 commit intomainfrom
fix/test-namespace-deletion-timeouts
May 8, 2026
Merged

fix: Add cleanup steps to prevent kuttl namespace deletion timeouts#789
adwk67 merged 1 commit intomainfrom
fix/test-namespace-deletion-timeouts

Conversation

@adwk67
Copy link
Copy Markdown
Member

@adwk67 adwk67 commented May 8, 2026

Summary

  • Add a final cleanup test step to all 13 kuttl test suites that use (or may use)
    the KubernetesExecutor, to prevent namespace deletion timeouts.
  • The cleanup deletes the AirflowCluster CR (triggering orderly StatefulSet scale-down),
    waits up to 120s for pods to terminate, then force-deletes any stragglers.
  • Conditional (.yaml.j2) for tests that parameterise the executor type; unconditional
    (.yaml) for tests that always use the KubernetesExecutor.

Root cause

The Vector sidecar container in KubernetesExecutor DAG task pods runs Vector as a
background process (vector ... &), with bash as PID 1. When SIGTERM arrives during
namespace deletion:

  1. Bash (PID 1) receives SIGTERM but has no trap — it waits for children.
  2. Vector (not PID 1) ignores SIGTERM entirely.
  3. The _STACKABLE_POST_HOOK in the base container (sleep 10; touch shutdown) fails
    because sleep itself gets killed by SIGTERM before creating the shutdown file.
  4. The pod sits in Terminating for the full terminationGracePeriodSeconds (300s).

This was not visible before the kuttl v0.11.1 → v0.20.0 bump (2026-04-22), because
kuttl v0.11.1 fired namespace deletion and moved on without waiting.

Proper fix (operator-rs)

This PR is a workaround. The proper fix belongs in operator-rs
(crates/stackable-operator/src/product_logging/framework.rs, around line 1444).
There is already a commented-out alternative in the code (lines 1440–1443) that
uses exec to make Vector PID 1 so it receives and handles SIGTERM directly:

// bash -c 'sleep 1 && if [ ! -f "...shutdown" ]; then mkdir -p ... && inotifywait ...; fi && kill 1' &
// exec vector --config ...

This approach should be completed and enabled. Once that fix lands in operator-rs,
these cleanup steps can be removed.

Test plan

  • Ran all 11 KubernetesExecutor test variants locally with cleanup steps
  • Critical test (KE + Vector logging) namespace Terminating time dropped from ~7 min to ~54s
  • CeleryExecutor variants unaffected (Jinja2 conditional renders empty file)
  • No test failures introduced
--- PASS: kuttl (3524.76s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/mount-dags-configmap_airflow-latest-3.1.6_openshift-false_executor-kubernetes (318.34s)
        --- PASS: kuttl/harness/ldap_airflow-latest-3.1.6_ldap-authentication-server-verification-tls_openshift-false_executor-kubernetes (223.52s)
        --- PASS: kuttl/harness/remote-logging_airflow-latest-3.1.6_openshift-false_executor-kubernetes (339.06s)
        --- PASS: kuttl/harness/external-access_airflow-3.1.6_openshift-false_executor-kubernetes (175.36s)
        --- PASS: kuttl/harness/triggerer_airflow-latest-3.1.6_openshift-false_executor-kubernetes (251.92s)
        --- PASS: kuttl/harness/smoke_airflow-3.1.6_openshift-false_executor-kubernetes (211.90s)
        --- PASS: kuttl/harness/logging_airflow-3.1.6_openshift-false_executor-kubernetes (615.99s)
        --- PASS: kuttl/harness/ldap_airflow-latest-3.1.6_ldap-authentication-no-tls_openshift-false_executor-kubernetes (212.74s)
        --- PASS: kuttl/harness/ldap_airflow-latest-3.1.6_ldap-authentication-insecure-tls_openshift-false_executor-kubernetes (338.48s)
        --- PASS: kuttl/harness/mount-dags-gitsync_airflow-latest-3.1.6_openshift-false_executor-kubernetes_access-ssh (420.55s)
        --- PASS: kuttl/harness/mount-dags-gitsync_airflow-latest-3.1.6_openshift-false_executor-kubernetes_access-https (416.58s)
PASS

--- PASS: kuttl (433.99s)
    --- PASS: kuttl/harness (0.00s)
        --- PASS: kuttl/harness/versioning_airflow-latest-3.1.6_openshift-false (433.64s)
PASS

🤖 Generated with Claude Code

KubernetesExecutor DAG task pods with a Vector sidecar do not shut down
gracefully on SIGTERM — Vector runs as a background process (not PID 1)
and ignores the signal, causing pods to wait out the full 300s
terminationGracePeriodSeconds before being force-killed. Since kuttl
v0.15.0 waits for namespace deletion to complete, this blocks the test
run past kuttl's timeout.

Add a cleanup step to all KubernetesExecutor tests that deletes the
AirflowCluster CR and force-deletes any remaining pods before kuttl
tears down the namespace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adwk67 adwk67 self-assigned this May 8, 2026
Copy link
Copy Markdown
Member

@NickLarsenNZ NickLarsenNZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The --force would normally mask the issue, but it's fine until the real fix is in.

@adwk67 adwk67 added this pull request to the merge queue May 8, 2026
Merged via the queue into main with commit 4ff4771 May 8, 2026
10 checks passed
@adwk67 adwk67 deleted the fix/test-namespace-deletion-timeouts branch May 8, 2026 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants