Skip to content

Investigate: Oracle Linux 9/10 multi-master k3s fails on OCI — port 2380 blocked between nodes #26

@lexfrei

Description

@lexfrei

Summary

When provisioning a 3-node multi-master Cozystack cluster on Oracle Cloud Infrastructure with examples/rhel/site.yml, k3s embedded etcd cannot establish peer communication: agent nodes cannot reach the bootstrap server's port 2380 (etcd peer). Port 6443 (kube-apiserver) works fine over the same VNIC. The same Tofu/OCI configuration with Ubuntu 22.04 / 24.04 reaches all 87/87 HelmReleases Ready.

Reproduction

  • 3-node OCI cluster (VM.Standard3.Flex, 4 OCPU / 32 GiB / 256 GB), Oracle Linux 9.7 or 10.1 with default UEK kernel
  • Same VCN, same subnet, NSG configured INGRESS all from 0.0.0.0/0 and EGRESS all to 0.0.0.0/0
  • All three nodes in the server inventory group
  • cozystack_flush_iptables: true
  • Run examples/rhel/site.yml

Result: agent nodes (server[1], server[2]) get MemberAdd request timed out, transport: authentication handshake failed: context deadline exceeded. From an OL agent: bash -c 'echo > /dev/tcp/<server-private-ip>/2380' returns BLOCKED (TCP SYN times out, no SYN-ACK). The same test for :6443 succeeds.

What is not the cause

  • iptables INPUT is empty / policy ACCEPT after the playbook flushes (verified)
  • firewalld and nftables services are inactive (verified)
  • NSG / security list permit all (verified — same as the working Ubuntu cluster)
  • etcd is listening on the external IP (ss -lnt confirms LISTEN ... <ip>:2380)
  • Local connect to <own-ip>:2380 from the same node works
  • iptables-save shows no REJECT/DROP except harmless KUBE-FIREWALL (loopback only) and OVN-POSTROUTING (set-bound)
  • iptables-save does emit # Warning: iptables-legacy tables present, use iptables-legacy-save to see them, but iptables-legacy-save is not present in the package

Hypotheses to investigate

  • SELinux Enforcing on OL is blocking unprivileged TLS handshake to etcd peer port (Ubuntu has AppArmor with permissive defaults)
  • Hidden iptables-legacy ruleset is filtering at kernel netfilter layer
  • oracle-cloud-agent adds packet-filtering or network policy that is not visible via standard tooling
  • OCI VNIC source/destination check or stateful tracking interacts with OL kernel differently than Ubuntu
  • Path MTU / TCP MSS issue specific to OL kernel that breaks etcd peer TLS handshake (large packets)

What works

Ubuntu 22.04 / 24.04 on OCI with the same module: 3-node multi-master, 87/87 HelmReleases Ready. Documented in README.

Scope

Out of scope for feat/node-prerequisites PR — playbook prepares nodes correctly, the failure is at OS/cloud interaction layer. Filed for separate investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions