Skip to content

[bug]: Leadership election faulty when network timeout issues present #784

@slacki123

Description

@slacki123

Describe the bug

Still related to the recently closed
#677

it is possible to get the operator in a state where leader election with high availability does not behave as expected

To reproduce

  1. create an operator in kubernetes with two pods
  2. allow the operator to do some work with some custom resources in another namespace than the operator
  3. apply a network policy on the namespace to stop all network traffic on the same namespace as the deployed operator
cat <<EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-kube-api
  namespace: my-operator-namespace
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - to:
    - ipBlock:
        cidr: 10.0.0.1/32
    ports:
    - protocol: TCP
      port: 443
EOF
  1. wait for a while, until the timeout errors appear (takes about 10 minutes or so)
  2. remove the network policy

at this point you will see that either both pods are acting as leaders, and both are processing resources, or neither pods doing any work until a restart of either pod happens...

Expected behavior

When the deny network policy is applied for 15 minutes+ and then removed, only one pod should continue processing while other pod should be idle

also if the process exited with an error after a number of retries have been unsuccessful, that would be okay too, but this is up for a wider discussion

Screenshots

No response

Additional Context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions