-
Notifications
You must be signed in to change notification settings - Fork 12
ISSUE-186: Set CNI_MTU=1400 on kindnet daemonset to fix TCP DNS issues #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughInserts a CI step in the CNCF conformance workflow that configures node networking to MTU=1400, updates interface MTUs, restarts kube-proxy and CoreDNS with rollouts and waits, pauses for stabilization, and verifies TCP-based DNS resolution from both nodes. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/cncf-conformance.yaml:
- Around line 98-116: Increase the rollout timeout to 300s and replace the final
verification command so it asserts the CNI_MTU env var is set to 1400 (don’t
just print the env array and avoid the broken jsonpath escaping). Update the
kubectl commands that target daemonset/kube-kindnet-ds and CNI_MTU via the
existing make env CMD wrapper: change the rollout status timeout from 120s to
300s and change the verification step to run a command that exits non‑zero on
failure (for example use kubectl get ... -o json piped to grep -q
"CNI_MTU.*1400" or to jq -e to select the
.spec.template.spec.containers[0].env[] | select(.name=="CNI_MTU") | .value ==
"1400"), ensuring correct shell quoting so the make env CMD receives valid
syntax.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/cncf-conformance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
.github/workflows/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
CI/CD workflows should validate builds with
builders.yaml, test quickstart scripts withinstallers.yaml, support manual release workflow viarelease.yaml, and run daily OKD ARM builds viarelease-okd.yaml
Files:
.github/workflows/cncf-conformance.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: quick-rpm-and-clean (ubuntu-24.04)
- GitHub Check: isolated-network (kindnet, ubuntu-24.04, 0)
- GitHub Check: fedora-bootc (ubuntu-24.04-arm)
- GitHub Check: isolated-network (ovnk, ubuntu-24.04, 1)
- GitHub Check: fedora-bootc (ubuntu-24.04)
- GitHub Check: ubuntu-rpm2deb
- GitHub Check: centos10-bootc
- GitHub Check: centos9-bootc
- GitHub Check: quick-start-and-clean (ubuntu-24.04)
- GitHub Check: quick-start-and-clean (ubuntu-24.04-arm)
- GitHub Check: quick-rpm-and-clean (ubuntu-24.04-arm)
c8010c9 to
c656aad
Compare
bd32a1c to
f7dbad0
Compare
| if [ -f "$CNI_CONFIG" ]; then | ||
| # Replace any existing mtu value with 1400, or add it if not present | ||
| if grep -q "\"mtu\"" "$CNI_CONFIG"; then | ||
| sed -i "s/\"mtu\": *[0-9]*/\"mtu\": 1400/g" "$CNI_CONFIG" | ||
| else | ||
| # Add mtu to the ptp plugin config | ||
| sed -i "s/\"type\": *\"ptp\"/\"type\": \"ptp\", \"mtu\": 1400/g" "$CNI_CONFIG" | ||
| fi | ||
| echo " Updated CNI config:" | ||
| grep -o "\"mtu\": *[0-9]*" "$CNI_CONFIG" || echo " (mtu not found in config)" | ||
| else | ||
| echo " Warning: CNI config not found at $CNI_CONFIG" | ||
| fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has JSON format. We must use jq for querying and updating it.
| for iface in $(ip -o link show | grep -E "cni|kindnet|bridge" | awk -F": " "{print \$2}" | cut -d@ -f1); do | ||
| ip link set dev "$iface" mtu 1400 2>/dev/null && echo " Set MTU=1400 on $iface" || true | ||
| done | ||
| ' || true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we allow this to fail?
| # causing TCP packets (larger than UDP) to be fragmented/dropped. | ||
| # Kindnet auto-detects MTU but may get it wrong in nested container environments. | ||
| echo "Patching CNI config with MTU=1400 on all nodes..." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is complex enough that it "deserves" to be moved into a script
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In @.github/workflows/cncf-conformance.yaml:
- Around line 147-161: The Step 6 TCP DNS check currently only logs per-attempt
success/failure; change it to track failures and exit non-zero if any node fails
all attempts. Inside the podman exec block for nodes microshift-okd-1 and
microshift-okd-2, introduce a dns_fail flag (e.g., dns_fail=1 then set
dns_fail=0 on any successful result) that is checked after the 3 attempts: if
dns_fail is still set, echo a clear error and exit 1 so the workflow fails;
ensure the outer script propagates that non-zero exit (or use set -o
pipefail/set -e) so the CI run fails when TCP DNS cannot be resolved.
🧹 Nitpick comments (2)
.github/workflows/cncf-conformance.yaml (2)
107-121: Consider usingjqfor JSON manipulation.The sed-based approach with double-quote patterns works but is fragile. Using
jqwould be more robust and idiomatic for modifying JSON configuration.♻️ Proposed refactor using jq
echo "=== Step 1: Configure MTU on CNI config ===" for node in microshift-okd-1 microshift-okd-2; do echo " - Configuring MTU on ${node}" sudo podman exec "${node}" bash -c ' CNI_CONFIG="/etc/cni/net.d/10-kindnet.conflist" if [ -f "$CNI_CONFIG" ]; then - if grep -q "\"mtu\"" "$CNI_CONFIG"; then - sed -i "s/\"mtu\": *[0-9]*/\"mtu\": 1400/g" "$CNI_CONFIG" - else - sed -i "s/\"type\": *\"ptp\"/\"type\": \"ptp\", \"mtu\": 1400/g" "$CNI_CONFIG" - fi - grep -o "\"mtu\": *[0-9]*" "$CNI_CONFIG" || echo " (mtu not in config)" + jq ".plugins[0].mtu = 1400" "$CNI_CONFIG" > "${CNI_CONFIG}.tmp" && mv "${CNI_CONFIG}.tmp" "$CNI_CONFIG" + jq -r ".plugins[0].mtu // \"(mtu not set)\"" "$CNI_CONFIG" fi ' done
153-153: Redundant empty-string check.The condition checks non-empty twice:
[ -n "$result" ]and[ "$result" != "" ]are equivalent.♻️ Simplified condition
- if [ -n "$result" ] && [ "$result" != "" ]; then + if [ -n "$result" ]; then
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.github/workflows/cncf-conformance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
.github/workflows/*.yaml
📄 CodeRabbit inference engine (CLAUDE.md)
CI/CD workflows should validate builds with
builders.yaml, test quickstart scripts withinstallers.yaml, support manual release workflow viarelease.yaml, and run daily OKD ARM builds viarelease-okd.yaml
Files:
.github/workflows/cncf-conformance.yaml
| echo "=== Step 6: Verify TCP DNS works ===" | ||
| for node in microshift-okd-1 microshift-okd-2; do | ||
| echo " Testing TCP DNS from ${node}..." | ||
| sudo podman exec "${node}" bash -c ' | ||
| for i in 1 2 3; do | ||
| result=$(dig +tcp +short kubernetes.default.svc.cluster.local @10.43.0.10 2>&1) | ||
| if [ -n "$result" ] && [ "$result" != "" ]; then | ||
| echo " Attempt $i: OK ($result)" | ||
| else | ||
| echo " Attempt $i: FAILED" | ||
| fi | ||
| sleep 1 | ||
| done | ||
| ' | ||
| done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
DNS verification doesn't fail on errors.
Step 6 validates TCP DNS but only logs results without failing the workflow if DNS doesn't work. Since this PR specifically addresses TCP DNS issues, the verification should fail the workflow if DNS resolution fails after all the applied fixes.
🔒 Proposed fix to fail on DNS errors
echo "=== Step 6: Verify TCP DNS works ==="
for node in microshift-okd-1 microshift-okd-2; do
echo " Testing TCP DNS from ${node}..."
sudo podman exec "${node}" bash -c '
+ failed=0
for i in 1 2 3; do
result=$(dig +tcp +short kubernetes.default.svc.cluster.local @10.43.0.10 2>&1)
if [ -n "$result" ]; then
echo " Attempt $i: OK ($result)"
+ exit 0
else
echo " Attempt $i: FAILED"
+ failed=1
fi
sleep 1
done
+ if [ $failed -eq 1 ]; then
+ echo " ERROR: TCP DNS verification failed after 3 attempts"
+ exit 1
+ fi
'
done🤖 Prompt for AI Agents
In @.github/workflows/cncf-conformance.yaml around lines 147 - 161, The Step 6
TCP DNS check currently only logs per-attempt success/failure; change it to
track failures and exit non-zero if any node fails all attempts. Inside the
podman exec block for nodes microshift-okd-1 and microshift-okd-2, introduce a
dns_fail flag (e.g., dns_fail=1 then set dns_fail=0 on any successful result)
that is checked after the 3 attempts: if dns_fail is still set, echo a clear
error and exit 1 so the workflow fails; ensure the outer script propagates that
non-zero exit (or use set -o pipefail/set -e) so the CI run fails when TCP DNS
cannot be resolved.
f7dbad0 to
b4d2084
Compare
b4d2084 to
b82ee36
Compare
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.