Skip to content

[Feature] Readiness Probe in multi-node etcd  #7

@ishan16696

Description

@ishan16696

Feature (What you would like to be added):
Currently, readinessProbe of etcd is set to an endpoint /healthz of HTTP server running in a backup sidecar.
This behaviour needed to be updated or improved as readinessProbe of clustered-etcd should depend on whether there is etcd-leader present or not then only it should serve the incoming write requests.

Motivation (Why is this needed?):

Approach/Hint to the implement solution (optional):
Approaches :

  1. ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs
    etcdctl endpoint health command performs a GET on the "health" key(source)

    • fails when there is no etcd leader or when Quorum is lost as GET request will fail if there is no etcd leader present.

    Advantages of this Method (etcdctl endpoint health).

    • We don't have to worry about such scenarios of causing outage as now snapshotter failure won't fails the readinessProbe of etcd.
    • If there is no Quorum present, kubelet will mark the etcd-members as NotReady and they won't able to serve the write as well as read requests.

    Disadvantages of this Method (etcdctl endpoint health).

    • Owner check feature depends on endpoint /healthz of HTTP server because when Owner check fails it fails the readinessProbe of etcd by setting the HTTP status to 503 but this Owner check in multi-node scenario is already being discussed here.
    • It completely decouples the snapshotter of backup sidecar and readinessProbe of etcd, backup sidecar won't able to control when to let the traffic come in.
  2. /health endpoint of etcd.
    /health endpoint returns false if one of the following conditions is met (source):

    • there is no etcd leader or leader-election is currently going on.
    • the latency of a QGET request exceeds 1sec

    Advantages and Disadvantage of Method 2 (/health endpoint).

    • similar to method 1.
  3. Use endpoint /healthz of HTTP server running in backup sidecar with modifications in such a way that whenever backup-restore leader is elected it should set HTTP server status to 200 for itself as well for all backup-restore followers and set the HTTP server status to 503 when there is no etcd-leader present.
    Advantages of this Method (/healthz).

    • We still have some coupling between snapshotter of backup sidecar and readinessProbe of etcd, backup sidecar will able to control when to let the traffic come in for etcd.

    Disadvantages of this Method (/healthz).

    • It will takes time to implement as well as to handle edge cases.

    Future Scope:

    • Go with method 2 as it give us flexibility to set the readinessProbe from backup-sidecar and switch to gRPC instead of sending REST requests.

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/enhancementEnhancement, improvement, extensionlifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.priority/4Priority (lower number equals higher priority)status/acceptedIssue was accepted as something we need to work on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions