Skip to content

fix: Add orderly shutdown steps to all kuttl tests#963

Closed
adwk67 wants to merge 1 commit intomainfrom
fix/test-namespace-deletion-timeouts
Closed

fix: Add orderly shutdown steps to all kuttl tests#963
adwk67 wants to merge 1 commit intomainfrom
fix/test-namespace-deletion-timeouts

Conversation

@adwk67
Copy link
Copy Markdown
Member

@adwk67 adwk67 commented May 7, 2026

Summary

  • Adds a 90-shutdown-kafka.yaml step to all 11 kuttl test suites that gracefully scales down Kafka before kuttl deletes the namespace
  • ZK-mode tests (7 suites): patches the CRD to set broker replicas to 0, with a force-delete fallback if pods don't terminate within 120s
  • KRaft-mode tests (4 suites): scales brokers to 0 via CRD, deletes the KafkaCluster CR to stop operator reconciliation, then force-deletes controller pods (controllers can't be scaled via CRD due to a "no Kraft controllers found to build ConfigMap" error)
  • Sets gracefulShutdownTimeout: 60s in install templates to bound the worst-case shutdown wait (default is 30 minutes)

Problem

During namespace deletion, ZooKeeper/controllers and Kafka brokers are terminated simultaneously. In ZK-mode, Kafka's controlled-shutdown retries ZK connections indefinitely, keeping the process alive for the full terminationGracePeriodSeconds (up to 30 minutes) and blocking namespace deletion well past kuttl's 300s timeout.

Test plan

  • Full nightly suite run (26/26 PASS) on Replicated k3s cluster (K8s 1.35.3)

Related: #956

Definition of Done Checklist

  • Not all of these items are applicable to all PRs, the author should update this template to only leave the boxes in that are relevant
  • Please make sure all these things are done and tick the boxes

Author

  • Changes are OpenShift compatible
  • CRD changes approved
  • CRD documentation for all fields, following the style guide.
  • Helm chart can be installed and deployed operator works
  • Integration tests passed (for non trivial changes)
  • Changes need to be "offline" compatible
  • Links to generated (nightly) docs added
  • Release note snippet added

Reviewer

  • Code contains useful comments
  • Code contains useful logging statements
  • (Integration-)Test cases added
  • Documentation added or updated. Follows the style guide.
  • Changelog updated
  • Cargo.toml only contains references to git tags (not specific commits or branches)

Acceptance

  • Feature Tracker has been updated
  • Proper release label has been added
  • Links to generated (nightly) docs added
  • Release note snippet added
  • Add type/deprecation label & add to the deprecation schedule
  • Add type/experimental label & add to the experimental features tracker

…ce deletion timeouts

During namespace deletion, ZooKeeper and Kafka are terminated
simultaneously. Kafka's controlled-shutdown retries ZK connections
indefinitely, keeping the process alive for the full grace period and
blocking namespace deletion past kuttl's 300s timeout.

For ZK-mode tests: scale brokers to 0 via the CRD so the operator
performs an orderly shutdown before ZK is removed.

For KRaft-mode tests: scale brokers to 0, delete the KafkaCluster CR
to stop reconciliation, then force-delete controller pods. Controllers
cannot be scaled via the CRD due to a "no Kraft controllers found to
build ConfigMap" error.

All tests also set gracefulShutdownTimeout: 60s to bound the worst-case
wait. Validated with a full nightly suite run (26/26 PASS).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@adwk67
Copy link
Copy Markdown
Member Author

adwk67 commented May 8, 2026

Closing in favour of #956 (and #955) as those PRs are less invasive and thus less prone to masking issues.

@adwk67 adwk67 closed this May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant