Skip to content

fix: resolve readiness WaitReady blocking for 5 minutes on startup#408

Open
AlinsRan wants to merge 1 commit intomasterfrom
fix/readiness-wait-ready-block-v2
Open

fix: resolve readiness WaitReady blocking for 5 minutes on startup#408
AlinsRan wants to merge 1 commit intomasterfrom
fix/readiness-wait-ready-block-v2

Conversation

@AlinsRan
Copy link
Copy Markdown
Contributor

@AlinsRan AlinsRan commented May 8, 2026

Root Cause

readiness.Start() is asynchronous. A controller's reconcile loop can call Done() before Start() finishes registering resources into r.state. When this happens:

  1. Done() checks r.state[gvk] — entry doesn't exist yet (race condition)
  2. Done() returns early without doing anything
  3. Start() later registers the resource into r.state
  4. Nobody ever calls Done() again for that resource
  5. WaitReady blocks until the full 5-minute timeout

Changes

  • Core fix: Done() now blocks on <-r.started before touching r.state, ensuring Start() has finished registering all resources before any Done() call can proceed
  • Semantic fix: WaitReady() returns false on timeout instead of true — timed out ≠ ready
  • Cleanup: Remove unnecessary mutex in registerState() — safe because Done() is now guaranteed to run after Start() closes r.started, so registerState() is always called before any concurrent Done() access
  • Observability: Add log statements to trace readiness lifecycle

Testing

  • go build ./... passes
  • go vet ./... passes

Summary by CodeRabbit

  • Bug Fixes

    • Fixed readiness timeout logic to return correct status
    • Improved readiness state initialization to ensure proper synchronization
  • Chores

    • Enhanced readiness state logging with more detailed diagnostic information

Backport fixes from upstream apache/apisix-ingress-controller#2663.

Root cause: readiness.Start() is asynchronous. If a controller's
reconcile loop calls Done() before Start() finishes registering
resources, Done() finds no state entry and returns early. The resource
is never removed from state, causing WaitReady to block until the
5-minute timeout.

Changes:
- Done() now waits for Start() to complete (<-r.started) before
  operating on state, eliminating the race condition
- WaitReady() returns false on timeout instead of true (semantic fix:
  timed-out != ready)
- Remove unnecessary mutex in registerState() since Done() is now
  guaranteed to run after Start() closes r.started
- Add log statements for easier debugging of readiness lifecycle
Copilot AI review requested due to automatic review settings May 8, 2026 03:31
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 8, 2026

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9463509e-f0b3-43b2-8247-79ccb5efa967

📥 Commits

Reviewing files that changed from the base of the PR and between b02e842 and e269dd2.

📒 Files selected for processing (1)
  • internal/manager/readiness/manager.go

📝 Walkthrough

Walkthrough

The readiness manager now synchronizes initialization and resource cleanup: Done() waits for Start() to complete before modifying state. WaitReady() correctly returns false on timeout instead of true. Logging is enhanced with explicit registration counts and resource details.

Changes

Readiness Manager Lifecycle Fixes

Layer / File(s) Summary
Initialization Synchronization
internal/manager/readiness/manager.go
Done() blocks on <-r.started before acquiring the lock and removing resource state, ensuring the manager is initialized before cleanup proceeds.
Timeout Handling Correction
internal/manager/readiness/manager.go
WaitReady() timeout branch now returns false instead of true when the timeout expires.
Logging Enhancements
internal/manager/readiness/manager.go
Start() logging replaces expected field with registered_count and adds V(1)-level output of full resources list; "readiness manager started" is logged on completion.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
E2e Test Quality Review ⚠️ Warning No E2E tests included. The PR fixes a critical concurrency bug in readiness manager startup but provides zero test coverage to prevent regression. Add E2E tests: (1) startup initialization, (2) Done() called before/after Start(), (3) WaitReady() timeout returns false, (4) state cleanup when all resources done. Simulate controller reconciliation calling Done().
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly addresses the main bug fix: resolving WaitReady blocking for 5 minutes on startup, which is the primary issue fixed in this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Check ✅ Passed No security vulnerabilities found. Readiness manager logs only non-sensitive metadata (namespace, name, counts). No credential handling, database ops, authorization logic, or crypto operations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/readiness-wait-ready-block-v2

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backports an upstream fix to the readiness manager to prevent startup full-sync from being blocked for the full 5-minute timeout due to a race between asynchronous Start() initialization and early Done() calls from controller reconciles.

Changes:

  • Gate Done() execution on readiness initialization by waiting on r.started before accessing r.state.
  • Fix WaitReady() timeout semantics to return false (timeout ≠ ready).
  • Adjust readiness lifecycle logging and remove the mutex from registerState().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

r.isReady.Store(true)
close(r.done)
}
r.log.Info("readiness manager started")
Comment on lines 155 to +159
if r.IsReady() {
return
}
<-r.started

r.mu.Lock()
defer r.mu.Unlock()
gvk := types.GvkOf(obj)
r.log.Info("marking resource as done", "gvk", gvk, "name", nn, "state_count", len(r.state[gvk]))
@AlinsRan AlinsRan reopened this May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

conformance test report - apisix-standalone mode

apiVersion: gateway.networking.k8s.io/v1
date: "2026-05-08T03:51:52Z"
gatewayAPIChannel: experimental
gatewayAPIVersion: v1.3.0
implementation:
  contact: null
  organization: APISIX
  project: apisix-ingress-controller
  url: https://github.com/apache/apisix-ingress-controller.git
  version: v2.0.0
kind: ConformanceReport
mode: default
profiles:
- core:
    result: success
    statistics:
      Failed: 0
      Passed: 12
      Skipped: 0
  name: GATEWAY-GRPC
  summary: Core tests succeeded.
- core:
    result: partial
    skippedTests:
    - HTTPRouteHTTPSListener
    statistics:
      Failed: 0
      Passed: 32
      Skipped: 1
  extended:
    result: partial
    skippedTests:
    - HTTPRouteRedirectPortAndScheme
    statistics:
      Failed: 0
      Passed: 11
      Skipped: 1
    supportedFeatures:
    - GatewayAddressEmpty
    - GatewayPort8080
    - HTTPRouteBackendProtocolWebSocket
    - HTTPRouteDestinationPortMatching
    - HTTPRouteHostRewrite
    - HTTPRouteMethodMatching
    - HTTPRoutePathRewrite
    - HTTPRoutePortRedirect
    - HTTPRouteQueryParamMatching
    - HTTPRouteRequestMirror
    - HTTPRouteResponseHeaderModification
    - HTTPRouteSchemeRedirect
    unsupportedFeatures:
    - GatewayHTTPListenerIsolation
    - GatewayInfrastructurePropagation
    - GatewayStaticAddresses
    - HTTPRouteBackendProtocolH2C
    - HTTPRouteBackendRequestHeaderModification
    - HTTPRouteBackendTimeout
    - HTTPRouteParentRefPort
    - HTTPRoutePathRedirect
    - HTTPRouteRequestMultipleMirrors
    - HTTPRouteRequestPercentageMirror
    - HTTPRouteRequestTimeout
  name: GATEWAY-HTTP
  summary: Core tests partially succeeded with 1 test skips. Extended tests partially
    succeeded with 1 test skips.
- core:
    result: partial
    skippedTests:
    - TLSRouteSimpleSameNamespace
    statistics:
      Failed: 0
      Passed: 10
      Skipped: 1
  name: GATEWAY-TLS
  summary: Core tests partially succeeded with 1 test skips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

conformance test report - apisix mode

apiVersion: gateway.networking.k8s.io/v1
date: "2026-05-08T03:52:31Z"
gatewayAPIChannel: experimental
gatewayAPIVersion: v1.3.0
implementation:
  contact: null
  organization: APISIX
  project: apisix-ingress-controller
  url: https://github.com/apache/apisix-ingress-controller.git
  version: v2.0.0
kind: ConformanceReport
mode: default
profiles:
- core:
    result: success
    statistics:
      Failed: 0
      Passed: 12
      Skipped: 0
  name: GATEWAY-GRPC
  summary: Core tests succeeded.
- core:
    failedTests:
    - HTTPRouteInvalidBackendRefUnknownKind
    result: failure
    skippedTests:
    - HTTPRouteHTTPSListener
    statistics:
      Failed: 1
      Passed: 31
      Skipped: 1
  extended:
    result: partial
    skippedTests:
    - HTTPRouteRedirectPortAndScheme
    statistics:
      Failed: 0
      Passed: 11
      Skipped: 1
    supportedFeatures:
    - GatewayAddressEmpty
    - GatewayPort8080
    - HTTPRouteBackendProtocolWebSocket
    - HTTPRouteDestinationPortMatching
    - HTTPRouteHostRewrite
    - HTTPRouteMethodMatching
    - HTTPRoutePathRewrite
    - HTTPRoutePortRedirect
    - HTTPRouteQueryParamMatching
    - HTTPRouteRequestMirror
    - HTTPRouteResponseHeaderModification
    - HTTPRouteSchemeRedirect
    unsupportedFeatures:
    - GatewayHTTPListenerIsolation
    - GatewayInfrastructurePropagation
    - GatewayStaticAddresses
    - HTTPRouteBackendProtocolH2C
    - HTTPRouteBackendRequestHeaderModification
    - HTTPRouteBackendTimeout
    - HTTPRouteParentRefPort
    - HTTPRoutePathRedirect
    - HTTPRouteRequestMultipleMirrors
    - HTTPRouteRequestPercentageMirror
    - HTTPRouteRequestTimeout
  name: GATEWAY-HTTP
  summary: Core tests failed with 1 test failures. Extended tests partially succeeded
    with 1 test skips.
- core:
    result: partial
    skippedTests:
    - TLSRouteSimpleSameNamespace
    statistics:
      Failed: 0
      Passed: 10
      Skipped: 1
  name: GATEWAY-TLS
  summary: Core tests partially succeeded with 1 test skips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

conformance test report

apiVersion: gateway.networking.k8s.io/v1
date: "2026-05-08T04:13:28Z"
gatewayAPIChannel: experimental
gatewayAPIVersion: v1.3.0
implementation:
  contact: null
  organization: APISIX
  project: apisix-ingress-controller
  url: https://github.com/apache/apisix-ingress-controller.git
  version: v2.0.0
kind: ConformanceReport
mode: default
profiles:
- core:
    failedTests:
    - GRPCExactMethodMatching
    - GRPCRouteHeaderMatching
    - GRPCRouteListenerHostnameMatching
    - GatewayModifyListeners
    result: failure
    statistics:
      Failed: 4
      Passed: 8
      Skipped: 0
  name: GATEWAY-GRPC
  summary: Core tests failed with 4 test failures.
- core:
    failedTests:
    - GatewayModifyListeners
    result: failure
    skippedTests:
    - HTTPRouteHTTPSListener
    statistics:
      Failed: 1
      Passed: 31
      Skipped: 1
  extended:
    failedTests:
    - HTTPRouteBackendProtocolWebSocket
    result: failure
    skippedTests:
    - HTTPRouteRedirectPortAndScheme
    statistics:
      Failed: 1
      Passed: 10
      Skipped: 1
    supportedFeatures:
    - GatewayAddressEmpty
    - GatewayPort8080
    - HTTPRouteBackendProtocolWebSocket
    - HTTPRouteDestinationPortMatching
    - HTTPRouteHostRewrite
    - HTTPRouteMethodMatching
    - HTTPRoutePathRewrite
    - HTTPRoutePortRedirect
    - HTTPRouteQueryParamMatching
    - HTTPRouteRequestMirror
    - HTTPRouteResponseHeaderModification
    - HTTPRouteSchemeRedirect
    unsupportedFeatures:
    - GatewayHTTPListenerIsolation
    - GatewayInfrastructurePropagation
    - GatewayStaticAddresses
    - HTTPRouteBackendProtocolH2C
    - HTTPRouteBackendRequestHeaderModification
    - HTTPRouteBackendTimeout
    - HTTPRouteParentRefPort
    - HTTPRoutePathRedirect
    - HTTPRouteRequestMultipleMirrors
    - HTTPRouteRequestPercentageMirror
    - HTTPRouteRequestTimeout
  name: GATEWAY-HTTP
  summary: Core tests failed with 1 test failures. Extended tests failed with 1 test
    failures.
- core:
    failedTests:
    - GatewayModifyListeners
    - TLSRouteSimpleSameNamespace
    result: failure
    statistics:
      Failed: 2
      Passed: 9
      Skipped: 0
  name: GATEWAY-TLS
  summary: Core tests failed with 2 test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants