Skip to content

feat: Virtual cluster lifecycle state model#89

Open
SamBarker wants to merge 16 commits intokroxylicious:mainfrom
SamBarker:virtual-cluster-lifecycle
Open

feat: Virtual cluster lifecycle state model#89
SamBarker wants to merge 16 commits intokroxylicious:mainfrom
SamBarker:virtual-cluster-lifecycle

Conversation

@SamBarker
Copy link
Copy Markdown
Member

@SamBarker SamBarker commented Feb 17, 2026

Summary

Introduces proposal 016 defining a lifecycle state model for virtual clusters. Virtual clusters are modelled as entities in the Domain-Driven Design sense — each has a persistent identity (its name) that carries through state changes. Reload is a transition the entity passes through, not a replacement of it.

The model defines five states:

  • initializing — cluster being set up; not yet serving traffic
  • serving — proxy has completed setup and is serving traffic
  • draining — no new connections or requests accepted; existing in-flight requests given the opportunity to complete
  • failed — configuration not viable; all partially-acquired resources released
  • stopped — terminal; cluster permanently removed from configuration

Key Design Decisions

  • Virtual clusters are entities — each cluster's identifier is its name; the state machine tracks the entity, not a configuration instance. Renaming a cluster is a destructive operation.
  • serving not accepting — the state name reflects that the proxy is actively serving traffic, not merely open to new connections
  • Runtime health is out of scopeserving is not split into healthy/degraded; runtime health is orthogonal to lifecycle and better addressed by readiness probes and metrics. Circuit-breaking is noted as a manifestation of the same concern.
  • Draining takes the client's view of in-flight — a request is in-flight from the moment the client sends it until it receives a response. New requests on existing connections are rejected during drain to ensure the cluster quiesces.
  • Graceful drain is best-effort — the drain timeout is the hard backstop; aligns with Kafka ecosystem norms (brokers drain before stopping)
  • partialInitialisationPolicy: serve-none | serve-others — governs proxy behaviour when cluster initialisation fails, whether on startup or reload. Default is serve-none (matches current behaviour).
  • stopped is terminal — reload routes through draininginitializing, never through stopped
  • Observability is public API — management endpoint and metrics (current state, time in state, transition count) are committed to in the proposal

Relationship to Other Proposals

This provides a foundation for 012 - Hot Reload, which can define reload transitions on this state model. Note: the Hot Reload proposal's "Cluster modification: remove + add" framing describes internal runtime mechanics; from the lifecycle model's perspective the named entity persists through modification.

Test plan

  • Review state definitions for clarity and completeness
  • Verify state transition diagram consistency with transition descriptions
  • Review rejected alternatives for completeness
  • Consider whether future enhancements section adequately captures known follow-up work

@SamBarker SamBarker requested a review from a team as a code owner February 17, 2026 03:43
Copy link
Copy Markdown
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamBarker I think this is useful work, but have doubts that we really need to have "healthy" as a state.


1. **Startup is all-or-nothing.** If one virtual cluster fails to start (e.g. port conflict, filter initialisation failure), the entire proxy process fails. Other clusters that could have started successfully are taken down with it.

2. **Shutdown is unstructured.** The proxy stops accepting connections and closes channels, but there is no formal draining phase that ensures in-flight Kafka requests complete before the connection is torn down.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not immediately apparent to me how this would work in practice. Suppose I have a producer pipelining Produce requests. How do we intend to stop that flow in such a way that the producer doesn't end up with an unacknowledged request?

You mention later on about timeout, but:

  • the broker doesn't know what the client's timeout it configured to, so it doesn't know how long it needs to wait
  • if the client timesout request 1 while thinking that 2, 3 and 4 are in flight then it will send request 5

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact mechanics of how the proxy tracks in-flight requests per connection are an implementation concern rather than a design one. The design's position is that draining is best-effort: clients are given a time window to complete naturally, with the drain timeout as the hard backstop after which connections are closed regardless. Kafka clients are required to handle connection loss, so forced closure after timeout is protocol-compliant — the goal is to avoid unnecessary disruption, not to guarantee zero errors. I've clarified this in the proposal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @SamBarker, but I think this does need to be explained.

In the reloading proposal it's been observed that idempotent (but not transactional) producers get their idempotency from the scope of a connection. So trying to minimise duplicate records is clearly a concern for some users, and I think it's only fair that you at least sketch how you think this will work, so everyone can see what the trade-offs are.

clients are given a time window to complete naturally

This is simply not the case, at least as far as I understand what's proposed.

For a real Kafka broker, the clean shutdown process involves the broker telling the controller that it needs to stop, the controller selects replacement leaders for all the partitions that that broker is currently leading, and only once the broker is not leading any partitions does it shut down. In that way producers avoid having any requests in-flight to the broker that's shutting down.

The proxy lacks that mechanism, which means that we are delivering a weaker idempotency guarantee under disconnections than Kafka does. That's an existing problem. But it will be made worse by building a dynamic restart feature which more frequent disconnections.

To me it looks like this is the proposal where we should be figuring out a solution and everyone can assure themselves that it will work well enough.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've got local changes that were meant to go out in sync with comment.

Also I thought you were pushing for the opposite solution, let the protocol handle it all and skip the drain.

I'll push the changes shortly and come back to this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idempotent producer concern is real and the proposal now acknowledges it explicitly. However, this is an existing proxy limitation that predates this proposal — any proxy restart today drops all connections with the same effect. A reload mechanism building on this lifecycle model actually improves the situation by confining disruption to a single virtual cluster rather than the whole process.

Solving the fundamental problem — replicating broker-style partition leader migration to avoid session disruption entirely — is a substantial piece of work well beyond the scope of a lifecycle model, with a multitude of possible solutions that the proxy cannot currently express. The proposal notes the limitation and directs users to consider how analogous situations (broker crash, network split) are handled without the proxy, since clients must already be prepared for those. Users requiring exactly-once guarantees across reconnections should use transactional producers.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The draining section has been updated to clarify in-flight semantics and correct an earlier misstatement about idempotent producers.

On the specific concern about idempotent producers: a proxy-initiated drain is no different from any other connection loss event from the client's perspective. Idempotent producers retain their Producer ID and sequence numbers in client memory across reconnections — a connection drop does not start a new session. When the client reconnects after a drain, it retries unacknowledged requests with the same (PID, epoch, sequence) tuple, and the broker deduplicates them using its existing producer state. A new PID is only assigned on producer process restart, not on reconnection. The drain timeout is the hard backstop after which connections close, but this is protocol-compliant — Kafka clients are required to handle connection loss, and the idempotent producer's deduplication mechanism is designed to survive exactly this scenario.

References:

  • KIP-360 documents how the broker retains producer state and how idempotent producers can safely bump their epoch when broker state is lost — the key point being that the PID is retained client-side across reconnections.
  • The Idempotent Producer design doc confirms that deduplication is based on (PID, epoch, sequence) per partition, not per connection: "The broker maintains in memory the sequence numbers it receives for each topic partition from every PID."
  • KIP-854 covers broker-side producer state expiry — the state is retained well beyond any reasonable drain timeout.

The lifecycle model actually improves the posture for idempotent producers compared to the status quo. Today, any configuration change requires a full process restart, dropping every connection across every virtual cluster. With per-cluster lifecycle and reload, only the affected cluster's connections drain — the blast radius shrinks from "all clusters" to "one cluster" and planned restarts become less frequent.

I think the proposal goes far enough on draining. The lifecycle model defines that draining exists as a state and its high-level semantics (reject new connections, reject new requests, allow in-flight to complete, timeout as backstop). The detailed mechanics — specific error codes for rejected requests, interaction with pipelining, TCP-level behaviour — are implementation concerns that don't change the lifecycle state model and are better resolved in the reload proposal where draining becomes a frequent operation rather than a rare shutdown event.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tombentley can we resolve this thread?


Some configuration changes will likely always require draining — for example, changes to the upstream cluster identity or TLS configuration that invalidate existing connections. The optimisation is about identifying changes where draining can be safely skipped, not eliminating it.

### Proxy-level lifecycle
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've defined how proxy started works wrt the VM state machines. It's not really clear to me whether we need a proposal about the rest, unless you think that that state is also exposed via metrics/mgmt etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — proxy-level state would also be exposed via metrics and management endpoints, which is the argument for capturing it as future work. The per-cluster metrics defined in this proposal are one layer; a proxy-level lifecycle would add an aggregate layer above them covering port binding, process startup/shutdown sequencing, and overall proxy health. Defining it now felt premature, but it's a natural next step once the per-cluster model is established.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The future enhancements section flags proxy-level lifecycle as a problem space that exists, not as work we're committed to. Whether proxy-level state needs its own formal model, observable via metrics/management endpoints, or is simply emergent from the per-VC lifecycles plus standard process management is an open question — and one that might not need answering.

Every proxy has some form of process-level lifecycle (HAProxy expresses it through old/new process coexistence during reload, nginx through worker generations, httpd through child worker replacement), but none of them formalise it as a state machine — it's just how startup and shutdown work. Kroxylicious may end up the same way: the proxy starts VCs, manages their lifecycles, and exits. If that's sufficient, no proposal is needed.

I've reworded the section to make it clearer that these are genuine open questions without obvious answers (port binding significance, health aggregation across deployment models) rather than planned future work.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tombentley similar question here, are we happy?

SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
- Fix startup wording: clusters that fail to start never become available
  (they cannot be "taken down" if they never came up)
- Expand shutdown description to note Kafka protocol resilience
- Replace "not modelled" with "notional" for clarity
- Disambiguate "operator" → "Kubernetes operator"
- Remove internal Java representation section — implementation detail
  better suited to PR review, not design proposal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
Adds a paragraph explaining the value of draining (reducing unnecessary
errors during planned shutdown) and its ecosystem context — brokers drain
before stopping and operators expect the same. Makes explicit that the
drain timeout is the hard backstop and that forced closure after timeout
remains protocol-compliant.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
Virtual clusters are entities (in the Domain-Driven Design sense) whose
identifier is their name. The state machine tracks the lifecycle of the
entity, not a configuration instance — a reload is a transition the
entity passes through, not a replacement of it.

Consequences made explicit:
- stopped means permanently removed from configuration, not just idle
- reload transitions happen on the same entity; recovery transition
  covers any retry regardless of what configuration is supplied
- renaming a cluster is a destructive operation; documentation should
  warn users of this

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
Replace vague "implementation concern" deferral with concrete commitments:
- Management endpoint must expose per-cluster state and failure reason
- Metrics must capture current state, time in state, and transition count
  (with suggested names following Prometheus/Micrometer conventions)
- Proposal is intentionally non-exhaustive; implementations may add more

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
The policy applies whenever clusters initialise — on first startup or
during reload — so startupPolicy was too narrow a name. Renamed to
partialInitialisationPolicy to reflect that it governs the proxy's
behaviour when initialisation is partial (some clusters failed, some
succeeded).

Values changed from fail-fast/best-effort to serve-none/serve-others,
which describe the outcome from the operator's perspective rather than
the mechanism.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
…oxylicious#89

Circuit-breaking: note under the 'runtime health as lifecycle state'
rejected alternative that circuit-breaking is a manifestation of runtime
health concerns — the same reasoning that excludes runtime health from
the lifecycle model excludes circuit-breaking with it.

In-flight: clarify that the proxy takes the same view of in-flight as
the Kafka client (sent but not yet acknowledged). Draining rejects both
new connections and new requests on existing connections — only
existing in-flight requests are completed, ensuring the cluster quiesces.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
@SamBarker SamBarker requested a review from a team as a code owner March 17, 2026 00:43
SamBarker added a commit to SamBarker/kroxylicious-design that referenced this pull request Mar 17, 2026
The proxy cannot replicate broker clean shutdown (partition leader
migration before connection close), so connection drops during draining
cause idempotent producers to lose session state on reconnect. This is
an existing proxy limitation; a reload mechanism building on this
lifecycle model can improve the situation by confining disruption to a
single virtual cluster rather than a full process restart.

The proposal now notes this explicitly and directs readers to consider
analogous broker crash/netsplit scenarios, which clients must already
handle. Users requiring exactly-once guarantees across reconnections
should use transactional producers.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
SamBarker and others added 12 commits March 31, 2026 11:09
Introduce proposal 014 defining a lifecycle state machine for virtual
clusters: initializing, degraded, healthy, draining, failed, and stopped.

This provides a foundation for resilient startup (best-effort mode),
graceful shutdown with drain timeouts, runtime health distinction
(degraded vs healthy), and configuration reload via the hot reload
proposal (012).

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
014 and 015 are already claimed by other proposals.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…ve tenancy motivation

- Replace healthy/degraded states with single 'accepting' state that makes no
  health claims — lifecycle tracks what the proxy is doing, not runtime health
- Reframe motivation around failure domains rather than multi-tenancy
- Replace ASCII state diagram with excalidraw PNG
- Add 'Runtime health as lifecycle state' rejected alternative explaining why
  health is orthogonal to lifecycle

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Health criteria vary by deployment and may become per-destination
once request-level routing is in play. Defining a health model
prematurely would constrain future design without immediate value.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
- Fix startup wording: clusters that fail to start never become available
  (they cannot be "taken down" if they never came up)
- Expand shutdown description to note Kafka protocol resilience
- Replace "not modelled" with "notional" for clarity
- Disambiguate "operator" → "Kubernetes operator"
- Remove internal Java representation section — implementation detail
  better suited to PR review, not design proposal

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Adds a paragraph explaining the value of draining (reducing unnecessary
errors during planned shutdown) and its ecosystem context — brokers drain
before stopping and operators expect the same. Makes explicit that the
drain timeout is the hard backstop and that forced closure after timeout
remains protocol-compliant.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Virtual clusters are entities (in the Domain-Driven Design sense) whose
identifier is their name. The state machine tracks the lifecycle of the
entity, not a configuration instance — a reload is a transition the
entity passes through, not a replacement of it.

Consequences made explicit:
- stopped means permanently removed from configuration, not just idle
- reload transitions happen on the same entity; recovery transition
  covers any retry regardless of what configuration is supplied
- renaming a cluster is a destructive operation; documentation should
  warn users of this

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Replace vague "implementation concern" deferral with concrete commitments:
- Management endpoint must expose per-cluster state and failure reason
- Metrics must capture current state, time in state, and transition count
  (with suggested names following Prometheus/Micrometer conventions)
- Proposal is intentionally non-exhaustive; implementations may add more

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The policy applies whenever clusters initialise — on first startup or
during reload — so startupPolicy was too narrow a name. Renamed to
partialInitialisationPolicy to reflect that it governs the proxy's
behaviour when initialisation is partial (some clusters failed, some
succeeded).

Values changed from fail-fast/best-effort to serve-none/serve-others,
which describe the outcome from the operator's perspective rather than
the mechanism.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
…oxylicious#89

Circuit-breaking: note under the 'runtime health as lifecycle state'
rejected alternative that circuit-breaking is a manifestation of runtime
health concerns — the same reasoning that excludes runtime health from
the lifecycle model excludes circuit-breaking with it.

In-flight: clarify that the proxy takes the same view of in-flight as
the Kafka client (sent but not yet acknowledged). Draining rejects both
new connections and new requests on existing connections — only
existing in-flight requests are completed, ensuring the cluster quiesces.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
'Accepting' implies the proxy is merely open to new connections.
'Serving' better captures what the proxy is actually doing in this
state — actively handling traffic for the cluster, not just admitting
connections. This also aligns terminology with the
partialInitialisationPolicy values (serve-none/serve-others).

Also update compatibility section to reference the new policy name and
values, and fix a stale description that referred to "connections
completing" rather than "in-flight requests completing".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The proxy cannot replicate broker clean shutdown (partition leader
migration before connection close), so connection drops during draining
cause idempotent producers to lose session state on reconnect. This is
an existing proxy limitation; a reload mechanism building on this
lifecycle model can improve the situation by confining disruption to a
single virtual cluster rather than a full process restart.

The proposal now notes this explicitly and directs readers to consider
analogous broker crash/netsplit scenarios, which clients must already
handle. Users requiring exactly-once guarantees across reconnections
should use transactional producers.

In response to PR kroxylicious#89 review feedback.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The future enhancements section now acknowledges that proxy-level
lifecycle involves genuinely unsettled questions (port binding failure
significance, health aggregation across deployment models) rather than
presenting it as committed future work.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
The previous text incorrectly stated that connection drops during
draining cause idempotent producers to lose their session state.
In fact, producers retain their PID and sequence numbers in client
memory across reconnections — the broker deduplicates retries using
(PID, epoch, sequence) per partition, not per connection. A new PID
is only assigned on producer process restart.

References: KIP-360, Idempotent Producer design doc, KIP-854.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
@SamBarker SamBarker force-pushed the virtual-cluster-lifecycle branch from 2c59c54 to c22b1c8 Compare March 31, 2026 00:32
Copy link
Copy Markdown
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @SamBarker, I think this is looking pretty good. I had a question is about how the configuration API for the termination behaviour.


```yaml
proxy:
partialInitialisationPolicy: serve-none # default — any cluster failure prevents serving traffic
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on-board with the broad concept, but:

  • I don't find this name to be very intuitive
  • Is there an equivalent thing which defines how the proxy handles failures during reload? I don't think we'd really need a separate config for that, (or did you have a use case in mind?)

This really seems to be about specifying the user's tolerance for virtual clusters entering/remaining in the failed state. Currently the state machine diagram shows a retry loop, but the way you're describing it here is that loop is, in actual fact, the kubernetes pod crash loop. And therefore that control loop does not exist when the proxy is not running in kube.

Perhaps we should borrow some ideas from how Kubernetes allows pods to fail, but apply them directly in the proxy (whether or not it happens to be running in Kube). Specifically, some number of retries, a retry delay ± a backoff. And implicitly we terminate when we run out of retries for any VC. (You can configure the number of reties as Integer.MAX_VALUE if you don't want to terminate). We can add some other policy than terminate later on if it's needed.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see where your coming from. I had hoped that initialisation would be read as VC initialisation rather than proxy startup.

How about restructuring as:

proxy:
  onVirtualClusterFailure:
    serve: none          # default
    # serve: remaining   # keep serving healthy clusters

onVirtualClusterFailure avoids ambiguity with target clusters. The values describe the proxy-level consequence rather than the individual cluster's fate.

I think retry is orthogonal to this policy. serve answers "what about the rest of the proxy?" while retry answers "what about the failed cluster?" They compose independently. By making onVirtualClusterFailure a structured block rather than a flat string, new keys can be added alongside serve in future proposals without breaking existing configs. For example:

proxy:
  onVirtualClusterFailure:
    serve: remaining
    # future extensions — new keys, existing configs unaffected:
    # retry:
    #   maxRetries: 3
    #   backoff: 5s
    # escalate:
    #   url: http://example.slack.com/
    #   messageFormat: "%s failed with %s"
    #   args: [VIRTUAL_CLUSTER, EXCEPTION_MESSAGE]

This proposal defines serve with its two values. The rejected alternative on automatic retry still applies — we're not committing to implementing retry here, but the config structure leaves room for it without a breaking change.

1. All `serving` clusters transition to `draining`.
2. All `failed` clusters transition directly to `stopped`.
3. New connections are rejected for draining clusters. New requests on existing connections are also rejected.
4. For each existing connection, the proxy waits for in-flight requests to complete, up to a configurable drain timeout. The proxy takes the same view of in-flight as the Kafka client: a request is in-flight from the moment the client sends it until the client receives a response.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proxy cannot know when/whether the client receives a response that's been forwarded to it (at least not without inferring that fact from the receipt of another request, but that defeats the point).

In any case, I just don't see that this can really be completely solved by the proxy in its current shape. To solve it properly I think we'd need to "virtualise the brokers". That would allow the proxy to leverage Kafka's own technique for shutting down brokers cleanly: Tell the client that this broker is no longer the leader or coordinator for anything, so that all the clients update their metadata to find the new leader/coordinator. Doing that would require a proper cluster-of-proxies, which is not something we have today.

So I think what you're describing here is not perfect, but it's as good are we're likely to get for now, and we shouldn't block this proposal while we figure out a more perfect solution.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on both points. "As best it can" now hedges the in-flight definition — the proxy observes its own reads and writes, not what the client has actually sent or received.

And yes, virtualising the brokers would let the proxy leverage Kafka's own clean shutdown mechanism (leader migration via metadata updates). That would be a really compelling capability if we can figure out how to make it work. For now, best-effort draining with a timeout backstop is as good as we can get without it.

Add liveness probes alongside readiness probes in the observability
discussion. Hedge in-flight definition to acknowledge the proxy
observes its own reads/writes, not the client's.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
Restructure as a block to separate orthogonal concerns: `serve`
controls proxy-level behaviour (none vs remaining), leaving room
for future extensions like retry and escalation as additional keys.

Rename avoids ambiguity with target clusters and better describes
the trigger (a virtual cluster failure) rather than the mechanism.

Assisted-by: Claude claude-opus-4-6 <noreply@anthropic.com>
Signed-off-by: Sam Barker <sam@quadrocket.co.uk>
# serve: remaining # serve clusters that initialised successfully
```

When `serve` is set to `remaining`, the proxy serves traffic for clusters that initialised successfully, while reporting failed clusters via health endpoints and logs. Kubernetes readiness probes, liveness probes, or monitoring systems can apply their own thresholds (e.g. "all clusters must be serving" vs "at least one cluster must not be failed"). The Kubernetes operator would typically set this policy.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you help me understand how will the experience look with rollback on failure for hot-reload ?

This is what I think would happen if we mark it as "remaining"

Eg:-

  1. There are 3 virtual clusters running (cluster-1, cluster-2, cluster-3)
  2. We apply a change to cluster-2
  3. cluster-2 fails to come up, initiate rollback
  4. cluster-1, cluster-3 will still continue to serve the traffic during this rollback phase
  5. cluster-2 is rolled back to previous state and is up and running

The only risk I see with "remaining" is during the app startup, we will have to put in some extra logic to ensure that the entire start state is stable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants