etcd recovery fails

Kube-apiserver sends two requests, one second apart, to etcd every 10 seconds [0]. When the etcd certs change on disk (e.g. after etcdadm reset and etcdadm init are invoked), the requests are rejected [1]. This appears to have some impact on etcd performance--still investigating.

When cctl recovers an etcd cluster, it first brings down the existing (potentially degraded) cluster and then brings it back up. This changes the CA certs on disk. When etcd performance is impacted, adding a third member can fail (though typically succeeds on a retry).

The workaround may require 

Action Items:
1. Investigate whether the rejected requests impact etcd performance. If they do, cctl can stop all kube-apiserver instances before--not after--recovering the etcd cluster.
2. Consider adding retries to etcdadm's etcd API calls
3. Modify recovery test to use three instead of two masters, and also use the etcd benchmark tool to increase the size of the database.

[0] https://github.com/etcd-io/etcd/issues/9285.
[1]
```
Oct 19 21:07:41 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33182" (error "EOF", ServerName "")                            
Oct 19 21:07:42 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33186" (error "EOF", ServerName "")                            
Oct 19 21:07:51 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33202" (error "EOF", ServerName "")                            
Oct 19 21:07:52 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33206" (error "EOF", ServerName "")                            
Oct 19 21:08:01 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33222" (error "EOF", ServerName "")                           
Oct 19 21:08:02 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33226" (error "EOF", ServerName "")                            
Oct 19 21:08:11 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33242" (error "EOF", ServerName "")                            
Oct 19 21:08:12 coreos-daniel-478-10-105-16-132platform9.sys etcd[15649]: rejected connection from "127.0.0.1:33246" (error "EOF", ServerName "")                            
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd recovery fails #142

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

etcd recovery fails #142

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions