feat: Virtual cluster lifecycle state model#89
feat: Virtual cluster lifecycle state model#89SamBarker wants to merge 16 commits intokroxylicious:mainfrom
Conversation
tombentley
left a comment
There was a problem hiding this comment.
Thanks @SamBarker I think this is useful work, but have doubts that we really need to have "healthy" as a state.
|
|
||
| 1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it. | ||
|
|
||
| 2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down. |
There was a problem hiding this comment.
It's not immediately apparent to me how this would work in practice. Suppose I have a producer pipelining Produce requests. How do we intend to stop that flow in such a way that the producer doesn't end up with an unacknowledged request?
You mention later on about timeout, but:
- the broker doesn't know what the client's timeout it configured to, so it doesn't know how long it needs to wait
- if the client timesout request 1 while thinking that 2, 3 and 4 are in flight then it will send request 5
There was a problem hiding this comment.
The exact mechanics of how the proxy tracks in-flight requests per connection are an implementation concern rather than a design one. The design's position is that draining is best-effort: clients are given a time window to complete naturally, with the drain timeout as the hard backstop after which connections are closed regardless. Kafka clients are required to handle connection loss, so forced closure after timeout is protocol-compliant — the goal is to avoid unnecessary disruption, not to guarantee zero errors. I've clarified this in the proposal.
There was a problem hiding this comment.
Sorry @SamBarker, but I think this does need to be explained.
In the reloading proposal it's been observed that idempotent (but not transactional) producers get their idempotency from the scope of a connection. So trying to minimise duplicate records is clearly a concern for some users, and I think it's only fair that you at least sketch how you think this will work, so everyone can see what the trade-offs are.
clients are given a time window to complete naturally
This is simply not the case, at least as far as I understand what's proposed.
For a real Kafka broker, the clean shutdown process involves the broker telling the controller that it needs to stop, the controller selects replacement leaders for all the partitions that that broker is currently leading, and only once the broker is not leading any partitions does it shut down. In that way producers avoid having any requests in-flight to the broker that's shutting down.
The proxy lacks that mechanism, which means that we are delivering a weaker idempotency guarantee under disconnections than Kafka does. That's an existing problem. But it will be made worse by building a dynamic restart feature which more frequent disconnections.
To me it looks like this is the proposal where we should be figuring out a solution and everyone can assure themselves that it will work well enough.
There was a problem hiding this comment.
Sorry, I've got local changes that were meant to go out in sync with comment.
Also I thought you were pushing for the opposite solution, let the protocol handle it all and skip the drain.
I'll push the changes shortly and come back to this.
There was a problem hiding this comment.
The idempotent producer concern is real and the proposal now acknowledges it explicitly. However, this is an existing proxy limitation that predates this proposal — any proxy restart today drops all connections with the same effect. A reload mechanism building on this lifecycle model actually improves the situation by confining disruption to a single virtual cluster rather than the whole process.
Solving the fundamental problem — replicating broker-style partition leader migration to avoid session disruption entirely — is a substantial piece of work well beyond the scope of a lifecycle model, with a multitude of possible solutions that the proxy cannot currently express. The proposal notes the limitation and directs users to consider how analogous situations (broker crash, network split) are handled without the proxy, since clients must already be prepared for those. Users requiring exactly-once guarantees across reconnections should use transactional producers.
There was a problem hiding this comment.
The draining section has been updated to clarify in-flight semantics and correct an earlier misstatement about idempotent producers.
On the specific concern about idempotent producers: a proxy-initiated drain is no different from any other connection loss event from the client's perspective. Idempotent producers retain their Producer ID and sequence numbers in client memory across reconnections — a connection drop does not start a new session. When the client reconnects after a drain, it retries unacknowledged requests with the same (PID, epoch, sequence) tuple, and the broker deduplicates them using its existing producer state. A new PID is only assigned on producer process restart, not on reconnection. The drain timeout is the hard backstop after which connections close, but this is protocol-compliant — Kafka clients are required to handle connection loss, and the idempotent producer's deduplication mechanism is designed to survive exactly this scenario.
References:
- KIP-360 documents how the broker retains producer state and how idempotent producers can safely bump their epoch when broker state is lost — the key point being that the PID is retained client-side across reconnections.
- The Idempotent Producer design doc confirms that deduplication is based on (PID, epoch, sequence) per partition, not per connection: "The broker maintains in memory the sequence numbers it receives for each topic partition from every PID."
- KIP-854 covers broker-side producer state expiry — the state is retained well beyond any reasonable drain timeout.
The lifecycle model actually improves the posture for idempotent producers compared to the status quo. Today, any configuration change requires a full process restart, dropping every connection across every virtual cluster. With per-cluster lifecycle and reload, only the affected cluster's connections drain — the blast radius shrinks from "all clusters" to "one cluster" and planned restarts become less frequent.
I think the proposal goes far enough on draining. The lifecycle model defines that draining exists as a state and its high-level semantics (reject new connections, reject new requests, allow in-flight to complete, timeout as backstop). The detailed mechanics — specific error codes for rejected requests, interaction with pipelining, TCP-level behaviour — are implementation concerns that don't change the lifecycle state model and are better resolved in the reload proposal where draining becomes a frequent operation rather than a rare shutdown event.
|
|
||
| Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it. | ||
|
|
||
| ### Proxy-level lifecycle |
There was a problem hiding this comment.
You've defined how proxy started works wrt the VM state machines. It's not really clear to me whether we need a proposal about the rest, unless you think that that state is also exposed via metrics/mgmt etc.
There was a problem hiding this comment.
Yes — proxy-level state would also be exposed via metrics and management endpoints, which is the argument for capturing it as future work. The per-cluster metrics defined in this proposal are one layer; a proxy-level lifecycle would add an aggregate layer above them covering port binding, process startup/shutdown sequencing, and overall proxy health. Defining it now felt premature, but it's a natural next step once the per-cluster model is established.
There was a problem hiding this comment.
The future enhancements section flags proxy-level lifecycle as a problem space that exists, not as work we're committed to. Whether proxy-level state needs its own formal model, observable via metrics/management endpoints, or is simply emergent from the per-VC lifecycles plus standard process management is an open question — and one that might not need answering.
Every proxy has some form of process-level lifecycle (HAProxy expresses it through old/new process coexistence during reload, nginx through worker generations, httpd through child worker replacement), but none of them formalise it as a state machine — it's just how startup and shutdown work. Kroxylicious may end up the same way: the proxy starts VCs, manages their lifecycles, and exits. If that's sufficient, no proposal is needed.
I've reworded the section to make it clearer that these are genuine open questions without obvious answers (port binding significance, health aggregation across deployment models) rather than planned future work.
There was a problem hiding this comment.
@tombentley similar question here, are we happy?
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary errors during planned shutdown) and its ecosystem context — brokers drain before stopping and operators expect the same. Makes explicit that the drain timeout is the hard backstop and that forced closure after timeout remains protocol-compliant. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose identifier is their name. The state machine tracks the lifecycle of the entity, not a configuration instance — a reload is a transition the entity passes through, not a replacement of it. Consequences made explicit: - stopped means permanently removed from configuration, not just idle - reload transitions happen on the same entity; recovery transition covers any retry regardless of what configuration is supplied - renaming a cluster is a destructive operation; documentation should warn users of this In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or during reload — so startupPolicy was too narrow a name. Renamed to partialInitialisationPolicy to reflect that it governs the proxy's behaviour when initialisation is partial (some clusters failed, some succeeded). Values changed from fail-fast/best-effort to serve-none/serve-others, which describe the outcome from the operator's perspective rather than the mechanism. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89 Circuit-breaking: note under the 'runtime health as lifecycle state' rejected alternative that circuit-breaking is a manifestation of runtime health concerns — the same reasoning that excludes runtime health from the lifecycle model excludes circuit-breaking with it. In-flight: clarify that the proxy takes the same view of in-flight as the Kafka client (sent but not yet acknowledged). Draining rejects both new connections and new requests on existing connections — only existing in-flight requests are completed, ensuring the cluster quiesces. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader migration before connection close), so connection drops during draining cause idempotent producers to lose session state on reconnect. This is an existing proxy limitation; a reload mechanism building on this lifecycle model can improve the situation by confining disruption to a single virtual cluster rather than a full process restart. The proposal now notes this explicitly and directs readers to consider analogous broker crash/netsplit scenarios, which clients must already handle. Users requiring exactly-once guarantees across reconnections should use transactional producers. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Introduce proposal 014 defining a lifecycle state machine for virtual clusters: initializing, degraded, healthy, draining, failed, and stopped. This provides a foundation for resilient startup (best-effort mode), graceful shutdown with drain timeouts, runtime health distinction (degraded vs healthy), and configuration reload via the hot reload proposal (012). Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
014 and 015 are already claimed by other proposals. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…ve tenancy motivation - Replace healthy/degraded states with single 'accepting' state that makes no health claims — lifecycle tracks what the proxy is doing, not runtime health - Reframe motivation around failure domains rather than multi-tenancy - Replace ASCII state diagram with excalidraw PNG - Add 'Runtime health as lifecycle state' rejected alternative explaining why health is orthogonal to lifecycle Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Health criteria vary by deployment and may become per-destination once request-level routing is in play. Defining a health model prematurely would constrain future design without immediate value. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Fix startup wording: clusters that fail to start never become available (they cannot be "taken down" if they never came up) - Expand shutdown description to note Kafka protocol resilience - Replace "not modelled" with "notional" for clarity - Disambiguate "operator" → "Kubernetes operator" - Remove internal Java representation section — implementation detail better suited to PR review, not design proposal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary errors during planned shutdown) and its ecosystem context — brokers drain before stopping and operators expect the same. Makes explicit that the drain timeout is the hard backstop and that forced closure after timeout remains protocol-compliant. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose identifier is their name. The state machine tracks the lifecycle of the entity, not a configuration instance — a reload is a transition the entity passes through, not a replacement of it. Consequences made explicit: - stopped means permanently removed from configuration, not just idle - reload transitions happen on the same entity; recovery transition covers any retry regardless of what configuration is supplied - renaming a cluster is a destructive operation; documentation should warn users of this In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments: - Management endpoint must expose per-cluster state and failure reason - Metrics must capture current state, time in state, and transition count (with suggested names following Prometheus/Micrometer conventions) - Proposal is intentionally non-exhaustive; implementations may add more In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or during reload — so startupPolicy was too narrow a name. Renamed to partialInitialisationPolicy to reflect that it governs the proxy's behaviour when initialisation is partial (some clusters failed, some succeeded). Values changed from fail-fast/best-effort to serve-none/serve-others, which describe the outcome from the operator's perspective rather than the mechanism. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89 Circuit-breaking: note under the 'runtime health as lifecycle state' rejected alternative that circuit-breaking is a manifestation of runtime health concerns — the same reasoning that excludes runtime health from the lifecycle model excludes circuit-breaking with it. In-flight: clarify that the proxy takes the same view of in-flight as the Kafka client (sent but not yet acknowledged). Draining rejects both new connections and new requests on existing connections — only existing in-flight requests are completed, ensuring the cluster quiesces. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
'Accepting' implies the proxy is merely open to new connections. 'Serving' better captures what the proxy is actually doing in this state — actively handling traffic for the cluster, not just admitting connections. This also aligns terminology with the partialInitialisationPolicy values (serve-none/serve-others). Also update compatibility section to reference the new policy name and values, and fix a stale description that referred to "connections completing" rather than "in-flight requests completing". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader migration before connection close), so connection drops during draining cause idempotent producers to lose session state on reconnect. This is an existing proxy limitation; a reload mechanism building on this lifecycle model can improve the situation by confining disruption to a single virtual cluster rather than a full process restart. The proposal now notes this explicitly and directs readers to consider analogous broker crash/netsplit scenarios, which clients must already handle. Users requiring exactly-once guarantees across reconnections should use transactional producers. In response to PR kroxylicious#89 review feedback. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The future enhancements section now acknowledges that proxy-level lifecycle involves genuinely unsettled questions (port binding failure significance, health aggregation across deployment models) rather than presenting it as committed future work. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The previous text incorrectly stated that connection drops during draining cause idempotent producers to lose their session state. In fact, producers retain their PID and sequence numbers in client memory across reconnections — the broker deduplicates retries using (PID, epoch, sequence) per partition, not per connection. A new PID is only assigned on producer process restart. References: KIP-360, Idempotent Producer design doc, KIP-854. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
2c59c54 to
c22b1c8
Compare
tombentley
left a comment
There was a problem hiding this comment.
Thanks @SamBarker, I think this is looking pretty good. I had a question is about how the configuration API for the termination behaviour.
|
|
||
| ```yaml | ||
| proxy: | ||
| partialInitialisationPolicy: serve-none # default — any cluster failure prevents serving traffic |
There was a problem hiding this comment.
I'm on-board with the broad concept, but:
- I don't find this name to be very intuitive
- Is there an equivalent thing which defines how the proxy handles failures during reload? I don't think we'd really need a separate config for that, (or did you have a use case in mind?)
This really seems to be about specifying the user's tolerance for virtual clusters entering/remaining in the failed state. Currently the state machine diagram shows a retry loop, but the way you're describing it here is that loop is, in actual fact, the kubernetes pod crash loop. And therefore that control loop does not exist when the proxy is not running in kube.
Perhaps we should borrow some ideas from how Kubernetes allows pods to fail, but apply them directly in the proxy (whether or not it happens to be running in Kube). Specifically, some number of retries, a retry delay ± a backoff. And implicitly we terminate when we run out of retries for any VC. (You can configure the number of reties as Integer.MAX_VALUE if you don't want to terminate). We can add some other policy than terminate later on if it's needed.
There was a problem hiding this comment.
I can see where your coming from. I had hoped that initialisation would be read as VC initialisation rather than proxy startup.
How about restructuring as:
proxy:
onVirtualClusterFailure:
serve: none # default
# serve: remaining # keep serving healthy clustersonVirtualClusterFailure avoids ambiguity with target clusters. The values describe the proxy-level consequence rather than the individual cluster's fate.
I think retry is orthogonal to this policy. serve answers "what about the rest of the proxy?" while retry answers "what about the failed cluster?" They compose independently. By making onVirtualClusterFailure a structured block rather than a flat string, new keys can be added alongside serve in future proposals without breaking existing configs. For example:
proxy:
onVirtualClusterFailure:
serve: remaining
# future extensions — new keys, existing configs unaffected:
# retry:
# maxRetries: 3
# backoff: 5s
# escalate:
# url: http://example.slack.com/
# messageFormat: "%s failed with %s"
# args: [VIRTUAL_CLUSTER, EXCEPTION_MESSAGE]This proposal defines serve with its two values. The rejected alternative on automatic retry still applies — we're not committing to implementing retry here, but the config structure leaves room for it without a breaking change.
| 1. All `serving` clusters transition to `draining`. | ||
| 2. All `failed` clusters transition directly to `stopped`. | ||
| 3. New connections are rejected for draining clusters. New requests on existing connections are also rejected. | ||
| 4. For each existing connection, the proxy waits for in-flight requests to complete, up to a configurable drain timeout. The proxy takes the same view of in-flight as the Kafka client: a request is in-flight from the moment the client sends it until the client receives a response. |
There was a problem hiding this comment.
The proxy cannot know when/whether the client receives a response that's been forwarded to it (at least not without inferring that fact from the receipt of another request, but that defeats the point).
In any case, I just don't see that this can really be completely solved by the proxy in its current shape. To solve it properly I think we'd need to "virtualise the brokers". That would allow the proxy to leverage Kafka's own technique for shutting down brokers cleanly: Tell the client that this broker is no longer the leader or coordinator for anything, so that all the clients update their metadata to find the new leader/coordinator. Doing that would require a proper cluster-of-proxies, which is not something we have today.
So I think what you're describing here is not perfect, but it's as good are we're likely to get for now, and we shouldn't block this proposal while we figure out a more perfect solution.
There was a problem hiding this comment.
Agreed on both points. "As best it can" now hedges the in-flight definition — the proxy observes its own reads and writes, not what the client has actually sent or received.
And yes, virtualising the brokers would let the proxy leverage Kafka's own clean shutdown mechanism (leader migration via metadata updates). That would be a really compelling capability if we can figure out how to make it work. For now, best-effort draining with a timeout backstop is as good as we can get without it.
Add liveness probes alongside readiness probes in the observability discussion. Hedge in-flight definition to acknowledge the proxy observes its own reads/writes, not the client's. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Restructure as a block to separate orthogonal concerns: `serve` controls proxy-level behaviour (none vs remaining), leaving room for future extensions like retry and escalation as additional keys. Rename avoids ambiguity with target clusters and better describes the trigger (a virtual cluster failure) rather than the mechanism. Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com> Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
| # serve: remaining # serve clusters that initialised successfully | ||
| ``` | ||
|
|
||
| When `serve` is set to `remaining`, the proxy serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes, liveness probes, or monitoring systems can apply their own thresholds (e.g. "all clusters must be serving" vs "at least one cluster must not be failed"). The Kubernetes operator would typically set this policy. |
There was a problem hiding this comment.
Can you help me understand how will the experience look with rollback on failure for hot-reload ?
This is what I think would happen if we mark it as "remaining"
Eg:-
- There are 3 virtual clusters running (cluster-1, cluster-2, cluster-3)
- We apply a change to cluster-2
- cluster-2 fails to come up, initiate rollback
- cluster-1, cluster-3 will still continue to serve the traffic during this rollback phase
- cluster-2 is rolled back to previous state and is up and running
The only risk I see with "remaining" is during the app startup, we will have to put in some extra logic to ensure that the entire start state is stable
Summary
Introduces proposal 016 defining a lifecycle state model for virtual clusters. Virtual clusters are modelled as entities in the Domain-Driven Design sense — each has a persistent identity (its name) that carries through state changes. Reload is a transition the entity passes through, not a replacement of it.
The model defines five states:
Key Design Decisions
servingnotaccepting— the state name reflects that the proxy is actively serving traffic, not merely open to new connectionsservingis not split into healthy/degraded; runtime health is orthogonal to lifecycle and better addressed by readiness probes and metrics. Circuit-breaking is noted as a manifestation of the same concern.partialInitialisationPolicy: serve-none | serve-others— governs proxy behaviour when cluster initialisation fails, whether on startup or reload. Default isserve-none(matches current behaviour).stoppedis terminal — reload routes throughdraining→initializing, never throughstoppedRelationship to Other Proposals
This provides a foundation for 012 - Hot Reload, which can define reload transitions on this state model. Note: the Hot Reload proposal's "Cluster modification: remove + add" framing describes internal runtime mechanics; from the lifecycle model's perspective the named entity persists through modification.
Test plan