[Feature] Readiness Probe in multi-node etcd 

**Feature (What you would like to be added):**
Currently, [readinessProbe of etcd](https://github.com/gardener/etcd-druid/blob/master/charts/etcd/templates/etcd-statefulset.yaml#L55-L60) is set to an endpoint `/healthz` of [HTTP server](https://github.com/gardener/etcd-backup-restore/blob/322d3c74fcd5ea7be9961ce5e9d7ffed48ecc34e/pkg/server/httpAPI.go#L168) running in a backup sidecar.
This behaviour needed to be updated or improved as readinessProbe of `clustered-etcd`  should depend on whether there is `etcd-leader` present or not then only it should serve the incoming write requests.

**Motivation (Why is this needed?):**

**Approach/Hint to the implement solution (optional):**
 Approaches :
1.  `ETCDCTL_API=3 etcdctl endpoint health --endpoints=${ENDPOINTS} --command-timeout=Xs` 
`etcdctl endpoint health` command performs a GET on the "health" key([source](https://github.com/etcd-io/etcd/blob/48ac46dab517cb5f97905e19a31ca7dfdbf265a4/etcdctl/ctlv3/command/ep_command.go#L134-L137))
    - fails when there is no etcd leader or when Quorum is lost as  GET request will fail if there is no etcd leader present.

     **Advantages** of this Method  (`etcdctl endpoint health`).
       -  We don't have to worry about [such scenarios](https://github.com/gardener/etcd-druid/issues/147) of causing outage as now snapshotter failure won't fails the readinessProbe of etcd.
       -  If there is no Quorum present, kubelet will mark the `etcd-members` as `NotReady` and they won't able to serve the write as well as read requests.
       
      **Disadvantages** of this Method  (`etcdctl endpoint health`).
      - [Owner check feature](https://github.com/gardener/etcd-backup-restore/blob/322d3c74fcd5ea7be9961ce5e9d7ffed48ecc34e/pkg/server/backuprestoreserver.go#L261-L267) depends on endpoint `/healthz` of [HTTP server](https://github.com/gardener/etcd-backup-restore/blob/322d3c74fcd5ea7be9961ce5e9d7ffed48ecc34e/pkg/server/httpAPI.go#L168) because when Owner check fails it fails the readinessProbe of etcd by setting the [HTTP status to 503](https://github.com/gardener/etcd-backup-restore/blob/322d3c74fcd5ea7be9961ce5e9d7ffed48ecc34e/pkg/server/backuprestoreserver.go#L267) but this Owner check in multi-node scenario is already being discussed [here](https://github.com/gardener/etcd-druid/issues/242).
      -  It completely decouples the snapshotter of backup sidecar and readinessProbe of etcd, backup sidecar won't able to control when to let the traffic come in.
     
2.  `/health`  endpoint of etcd.
`/health` endpoint returns `false` if one of the following conditions is met ([source](https://github.com/etcd-io/etcd/blob/v3.4.14/etcdserver/api/etcdhttp/metrics.go#L106-L119)):
    - there is no etcd leader or leader-election is currently going on.
    - the latency of a QGET request exceeds 1sec

     **Advantages** and **Disadvantage** of Method 2 (`/health` endpoint).
       -  similar to method 1.
  
3.  Use endpoint `/healthz` of [HTTP server](https://github.com/gardener/etcd-backup-restore/blob/322d3c74fcd5ea7be9961ce5e9d7ffed48ecc34e/pkg/server/httpAPI.go#L168) running in backup sidecar with modifications in such a way that whenever `backup-restore leader` is elected it should set `HTTP server status to 200` for itself as well for all  `backup-restore followers` and set the `HTTP server status to 503` when there is no etcd-leader present.
    **Advantages** of this Method  (`/healthz`).
     -  We still have some  coupling between `snapshotter` of backup sidecar and `readinessProbe` of etcd, backup sidecar will able to control when to let the traffic come in for etcd.

    **Disadvantages** of this Method  (`/healthz`).
    - It will takes time to implement as well as to handle edge cases.

    **Future Scope:**
    - Go with method 2 as it give us flexibility to set the `readinessProbe` from backup-sidecar and switch to gRPC instead of  sending REST requests. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Readiness Probe in multi-node etcd #7

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] Readiness Probe in multi-node etcd #7

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions