Skip to content

Conversation

@kasturinarra
Copy link

@kasturinarra kasturinarra commented Jan 9, 2026

Summary by CodeRabbit

  • Chores
    • Configure CI network MTU to 1400 for conformance runs to improve network stability.
    • Restart cluster networking services and DNS during CI, wait for rollouts and a stabilization pause.
    • Verify TCP-based DNS resolution from cluster nodes to reduce flaky network-related test failures.

✏️ Tip: You can customize this high-level summary in your review settings.

@kasturinarra kasturinarra requested a review from a team as a code owner January 9, 2026 11:42
@coderabbitai
Copy link

coderabbitai bot commented Jan 9, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Inserts a CI step in the CNCF conformance workflow that configures node networking to MTU=1400, updates interface MTUs, restarts kube-proxy and CoreDNS with rollouts and waits, pauses for stabilization, and verifies TCP-based DNS resolution from both nodes.

Changes

Cohort / File(s)
CI Networking Configure
/.github/workflows/cncf-conformance.yaml
Adds a shell step (uses set -euo pipefail) inserted before the "Configure hostname resolution for cluster nodes" step that: updates 10-kindnet.conflist CNI MTU to 1400 on both nodes, adjusts non-loopback interface MTUs >1400 to 1400, restarts and waits for rollout of kube-proxy and coredns, pauses for stabilization, and validates TCP DNS resolution via dig +tcp from both nodes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested reviewers

  • ggiguash
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding CNI MTU configuration to fix TCP DNS issues in the CI workflow.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/cncf-conformance.yaml:
- Around line 98-116: Increase the rollout timeout to 300s and replace the final
verification command so it asserts the CNI_MTU env var is set to 1400 (don’t
just print the env array and avoid the broken jsonpath escaping). Update the
kubectl commands that target daemonset/kube-kindnet-ds and CNI_MTU via the
existing make env CMD wrapper: change the rollout status timeout from 120s to
300s and change the verification step to run a command that exits non‑zero on
failure (for example use kubectl get ... -o json piped to grep -q
"CNI_MTU.*1400" or to jq -e to select the
.spec.template.spec.containers[0].env[] | select(.name=="CNI_MTU") | .value ==
"1400"), ensuring correct shell quoting so the make env CMD receives valid
syntax.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1045975 and c8010c9.

📒 Files selected for processing (1)
  • .github/workflows/cncf-conformance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
.github/workflows/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

CI/CD workflows should validate builds with builders.yaml, test quickstart scripts with installers.yaml, support manual release workflow via release.yaml, and run daily OKD ARM builds via release-okd.yaml

Files:

  • .github/workflows/cncf-conformance.yaml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: quick-rpm-and-clean (ubuntu-24.04)
  • GitHub Check: isolated-network (kindnet, ubuntu-24.04, 0)
  • GitHub Check: fedora-bootc (ubuntu-24.04-arm)
  • GitHub Check: isolated-network (ovnk, ubuntu-24.04, 1)
  • GitHub Check: fedora-bootc (ubuntu-24.04)
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: centos10-bootc
  • GitHub Check: centos9-bootc
  • GitHub Check: quick-start-and-clean (ubuntu-24.04)
  • GitHub Check: quick-start-and-clean (ubuntu-24.04-arm)
  • GitHub Check: quick-rpm-and-clean (ubuntu-24.04-arm)

@kasturinarra kasturinarra marked this pull request as draft January 9, 2026 14:14
@kasturinarra kasturinarra force-pushed the fix_issue_186 branch 2 times, most recently from bd32a1c to f7dbad0 Compare January 9, 2026 18:54
Comment on lines 114 to 125
if [ -f "$CNI_CONFIG" ]; then
# Replace any existing mtu value with 1400, or add it if not present
if grep -q "\"mtu\"" "$CNI_CONFIG"; then
sed -i "s/\"mtu\": *[0-9]*/\"mtu\": 1400/g" "$CNI_CONFIG"
else
# Add mtu to the ptp plugin config
sed -i "s/\"type\": *\"ptp\"/\"type\": \"ptp\", \"mtu\": 1400/g" "$CNI_CONFIG"
fi
echo " Updated CNI config:"
grep -o "\"mtu\": *[0-9]*" "$CNI_CONFIG" || echo " (mtu not found in config)"
else
echo " Warning: CNI config not found at $CNI_CONFIG"
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has JSON format. We must use jq for querying and updating it.

for iface in $(ip -o link show | grep -E "cni|kindnet|bridge" | awk -F": " "{print \$2}" | cut -d@ -f1); do
ip link set dev "$iface" mtu 1400 2>/dev/null && echo " Set MTU=1400 on $iface" || true
done
' || true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow this to fail?

# causing TCP packets (larger than UDP) to be fragmented/dropped.
# Kindnet auto-detects MTU but may get it wrong in nested container environments.
echo "Patching CNI config with MTU=1400 on all nodes..."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is complex enough that it "deserves" to be moved into a script

@ggiguash
Copy link
Contributor

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Jan 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @.github/workflows/cncf-conformance.yaml:
- Around line 147-161: The Step 6 TCP DNS check currently only logs per-attempt
success/failure; change it to track failures and exit non-zero if any node fails
all attempts. Inside the podman exec block for nodes microshift-okd-1 and
microshift-okd-2, introduce a dns_fail flag (e.g., dns_fail=1 then set
dns_fail=0 on any successful result) that is checked after the 3 attempts: if
dns_fail is still set, echo a clear error and exit 1 so the workflow fails;
ensure the outer script propagates that non-zero exit (or use set -o
pipefail/set -e) so the CI run fails when TCP DNS cannot be resolved.
🧹 Nitpick comments (2)
.github/workflows/cncf-conformance.yaml (2)

107-121: Consider using jq for JSON manipulation.

The sed-based approach with double-quote patterns works but is fragile. Using jq would be more robust and idiomatic for modifying JSON configuration.

♻️ Proposed refactor using jq
           echo "=== Step 1: Configure MTU on CNI config ==="
           for node in microshift-okd-1 microshift-okd-2; do
             echo "  - Configuring MTU on ${node}"
             sudo podman exec "${node}" bash -c '
               CNI_CONFIG="/etc/cni/net.d/10-kindnet.conflist"
               if [ -f "$CNI_CONFIG" ]; then
-                if grep -q "\"mtu\"" "$CNI_CONFIG"; then
-                  sed -i "s/\"mtu\": *[0-9]*/\"mtu\": 1400/g" "$CNI_CONFIG"
-                else
-                  sed -i "s/\"type\": *\"ptp\"/\"type\": \"ptp\", \"mtu\": 1400/g" "$CNI_CONFIG"
-                fi
-                grep -o "\"mtu\": *[0-9]*" "$CNI_CONFIG" || echo "    (mtu not in config)"
+                jq ".plugins[0].mtu = 1400" "$CNI_CONFIG" > "${CNI_CONFIG}.tmp" && mv "${CNI_CONFIG}.tmp" "$CNI_CONFIG"
+                jq -r ".plugins[0].mtu // \"(mtu not set)\"" "$CNI_CONFIG"
               fi
             '
           done

153-153: Redundant empty-string check.

The condition checks non-empty twice: [ -n "$result" ] and [ "$result" != "" ] are equivalent.

♻️ Simplified condition
-                if [ -n "$result" ] && [ "$result" != "" ]; then
+                if [ -n "$result" ]; then
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c656aad and f7dbad0.

📒 Files selected for processing (1)
  • .github/workflows/cncf-conformance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
.github/workflows/*.yaml

📄 CodeRabbit inference engine (CLAUDE.md)

CI/CD workflows should validate builds with builders.yaml, test quickstart scripts with installers.yaml, support manual release workflow via release.yaml, and run daily OKD ARM builds via release-okd.yaml

Files:

  • .github/workflows/cncf-conformance.yaml

Comment on lines +147 to +179
echo "=== Step 6: Verify TCP DNS works ==="
for node in microshift-okd-1 microshift-okd-2; do
echo " Testing TCP DNS from ${node}..."
sudo podman exec "${node}" bash -c '
for i in 1 2 3; do
result=$(dig +tcp +short kubernetes.default.svc.cluster.local @10.43.0.10 2>&1)
if [ -n "$result" ] && [ "$result" != "" ]; then
echo " Attempt $i: OK ($result)"
else
echo " Attempt $i: FAILED"
fi
sleep 1
done
'
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

DNS verification doesn't fail on errors.

Step 6 validates TCP DNS but only logs results without failing the workflow if DNS doesn't work. Since this PR specifically addresses TCP DNS issues, the verification should fail the workflow if DNS resolution fails after all the applied fixes.

🔒 Proposed fix to fail on DNS errors
           echo "=== Step 6: Verify TCP DNS works ==="
           for node in microshift-okd-1 microshift-okd-2; do
             echo "  Testing TCP DNS from ${node}..."
             sudo podman exec "${node}" bash -c '
+              failed=0
               for i in 1 2 3; do
                 result=$(dig +tcp +short kubernetes.default.svc.cluster.local @10.43.0.10 2>&1)
                 if [ -n "$result" ]; then
                   echo "    Attempt $i: OK ($result)"
+                  exit 0
                 else
                   echo "    Attempt $i: FAILED"
+                  failed=1
                 fi
                 sleep 1
               done
+              if [ $failed -eq 1 ]; then
+                echo "    ERROR: TCP DNS verification failed after 3 attempts"
+                exit 1
+              fi
             '
           done
🤖 Prompt for AI Agents
In @.github/workflows/cncf-conformance.yaml around lines 147 - 161, The Step 6
TCP DNS check currently only logs per-attempt success/failure; change it to
track failures and exit non-zero if any node fails all attempts. Inside the
podman exec block for nodes microshift-okd-1 and microshift-okd-2, introduce a
dns_fail flag (e.g., dns_fail=1 then set dns_fail=0 on any successful result)
that is checked after the 3 attempts: if dns_fail is still set, echo a clear
error and exit 1 so the workflow fails; ensure the outer script propagates that
non-zero exit (or use set -o pipefail/set -e) so the CI run fails when TCP DNS
cannot be resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants