chore: optimize localdns provisioning polling and iptables setup#8401
chore: optimize localdns provisioning polling and iptables setup#8401jingwenw15 wants to merge 9 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets node provisioning latency in the Linux localdns bootstrap path by tightening readiness polling and batching iptables rule installation to reduce xtables lock contention.
Changes:
- Reduced polling sleep interval in
start_localdns()andwait_for_localdns_ready()from 1s to 0.1s, adjusting attempt counts accordingly. - Switched pod-traffic conntrack-skip rule installation to a single
iptables-restorecall instead of multipleiptablesinvocations. - Removed the previously used
IPTABLEScommand wrapper variable.
| saved_rules=$(sudo iptables -w -t raw -S | grep "localdns: skip conntrack") | ||
| sudo iptables -w -t raw -S | grep "localdns: skip conntrack" | while IFS= read -r rule; do | ||
| # Convert -A to -D to delete the rule | ||
| sudo iptables -w -t raw $(echo "$rule" | sed 's/^-A/-D/') 2>/dev/null || true | ||
| done | ||
|
|
||
| # Flush any leftover conntrack entries before the negative test | ||
| for ip in 169.254.10.10 169.254.10.11; do | ||
| sudo conntrack -D -d "$ip" -p udp --dport 53 2>/dev/null || true | ||
| done | ||
|
|
||
| # Do a DNS lookup without NOTRACK — this SHOULD create conntrack entries | ||
| dig bing.com @169.254.10.10 +short +timeout=2 +tries=1 > /dev/null 2>&1 || true | ||
|
|
||
| ct_dns_neg=$(sudo conntrack -L -d 169.254.10.10 -p udp --dport 53 2>/dev/null | wc -l) | ||
| echo "Conntrack entries for 169.254.10.10:53 without NOTRACK: $ct_dns_neg" | ||
|
|
||
| # Restore NOTRACK rules | ||
| echo "$saved_rules" | while IFS= read -r rule; do | ||
| sudo iptables -w -t raw $rule 2>/dev/null || true | ||
| done |
There was a problem hiding this comment.
The negative-test restore path builds iptables commands by expanding the iptables -S output into $rule unquoted (sudo iptables ... $rule). Since iptables -S prints the comment as --comment "localdns: skip conntrack", unquoted expansion will split that into multiple argv tokens (quotes become literal characters), so both deletion and restoration are likely to fail and make this validator flaky (and may leave rules removed). Consider deleting by rule number (using iptables -L --line-numbers like localdns.sh cleanup does) and restoring rules via a known-safe mechanism (e.g., iptables-restore payload or re-adding rules with properly quoted args), and don’t ignore restore errors.
There was a problem hiding this comment.
No longer applicable — the negative test was removed entirely in ceade3b. The validator now only checks rule presence.
| restore_input="${restore_input} | ||
| -A ${chain} -m comment --comment \"localdns: skip conntrack\" ${rule_rest}" | ||
| done | ||
| restore_input="${restore_input} | ||
| COMMIT" |
There was a problem hiding this comment.
restore_input is constructed with a leading space before -A and COMMIT (note the newline literal followed by a space on the next line). That whitespace becomes part of the iptables-restore payload, which is non-standard and makes the script depend on iptables-restore being permissive about leading spaces. Consider building the payload without leading whitespace (e.g., via printf/$'\n' concatenation) so the generated restore script matches iptables-save format exactly.
| restore_input="${restore_input} | |
| -A ${chain} -m comment --comment \"localdns: skip conntrack\" ${rule_rest}" | |
| done | |
| restore_input="${restore_input} | |
| COMMIT" | |
| printf -v restore_input '%s\n-A %s -m comment --comment "localdns: skip conntrack" %s' "${restore_input}" "${chain}" "${rule_rest}" | |
| done | |
| printf -v restore_input '%s\nCOMMIT' "${restore_input}" |
There was a problem hiding this comment.
Verified — no leading whitespace in the generated payload. Bash multiline string assignment (restore_input="${restore_input} -A ...") does not inherit shell indentation. The -A and COMMIT lines start at column 1 as required by iptables-restore. Confirmed by inspecting actual output on prod VMs via e2e tests.
| # Verify NOTRACK rules exist in both chains for both protocols with the comment tag | ||
| for chain in OUTPUT PREROUTING; do | ||
| for proto in tcp udp; do | ||
| if ! echo "$rules" | grep -q -- "-A ${chain}.*-p ${proto}.*localdns: skip conntrack.*--dport 53.*NOTRACK"; then | ||
| echo "FAIL: missing NOTRACK rule for $proto in $chain chain" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The new validation can produce false positives because it doesn’t assert the destination IPs for the NOTRACK rules (it only checks chain/proto/comment/port). Since localdns programs rules for both 169.254.10.10 and 169.254.10.11, consider aligning this check with the existing raw-table pattern used by getIPTablesRulesCompatibleWithEBPFHostRouting (includes -d 169.254.10.(10|11)/32 and the expected comment/module ordering) so the test fails if rules are applied to the wrong IPs or in an unexpected shape.
| # Verify NOTRACK rules exist in both chains for both protocols with the comment tag | |
| for chain in OUTPUT PREROUTING; do | |
| for proto in tcp udp; do | |
| if ! echo "$rules" | grep -q -- "-A ${chain}.*-p ${proto}.*localdns: skip conntrack.*--dport 53.*NOTRACK"; then | |
| echo "FAIL: missing NOTRACK rule for $proto in $chain chain" | |
| exit 1 | |
| fi | |
| # Verify NOTRACK rules exist in both chains, for both protocols, and for both localdns listener IPs. | |
| # Match a stricter rule shape so the test fails if rules are attached to the wrong destination IP | |
| # or emitted in an unexpected form. | |
| for chain in OUTPUT PREROUTING; do | |
| for proto in tcp udp; do | |
| for ip in 169.254.10.10 169.254.10.11; do | |
| pattern="^-A ${chain} .* -d ${ip}/32 .* -p ${proto} .* -m comment --comment \\\"localdns: skip conntrack\\\" .* -m ${proto} --dport 53 .* (NOTRACK|CT --notrack)$" | |
| if ! echo "$rules" | grep -Eq -- "$pattern"; then | |
| echo "FAIL: missing localdns NOTRACK rule for chain=${chain} proto=${proto} ip=${ip}" | |
| exit 1 | |
| fi | |
| done |
There was a problem hiding this comment.
The full rule shape including destination IPs and match extension ordering is already validated by ValidateIPTablesCompatibleWithCiliumEBPF in the same test suite (regex: ^-A (PREROUTING|OUTPUT) -d 169\.254\.10\.(10|11)\/32 -p (tcp|udp) -m comment --comment "localdns: skip conntrack" -m (tcp|udp) --dport 53 -j NOTRACK$). Adding the same checks here would be redundant.
Reduce polling intervals from 1s to 0.1s in start_localdns() and wait_for_localdns_ready() since CoreDNS typically starts in <100ms. Batch iptables rules using iptables-restore instead of individual calls to avoid repeated xtables lock acquisition. Estimated total savings: ~1.3–2.7s during node provisioning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update add_iptable_rules_to_skip_conntrack_from_pods tests to mock iptables-restore via PATH instead of the old IPTABLES variable, and assert on the new iptables-restore input format (*raw, -A rules, COMMIT). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iptables requires -j (jump target) to be last in the rule. The comment match module must come before it, otherwise iptables-restore rejects the rule as invalid syntax. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Place -m comment immediately after chain name in iptables-restore input so that iptables -S displays comment before protocol match, matching the Cilium eBPF host routing regex. The previous ordering placed comment after --dport which caused nft backend to display it after the protocol match extension. Add ValidateLocalDNSIptablesRules e2e validator that checks: - localdns.sh uses iptables-restore (batched rules) - NOTRACK rules exist in both OUTPUT and PREROUTING chains - Comment tag is present for cleanup logic - NOTRACK is functional (no conntrack entries for localdns DNS traffic) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop NOTRACK rules temporarily, do a DNS lookup, and verify conntrack entries appear. This proves the conntrack check is actually capable of detecting entries and isn't silently passing. Rules are restored after. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use unfiltered conntrack lookup (no --dport) for the negative test since conntrack entry direction fields don't always match the query direction. TCP queries with retries and sleep for reliable detection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Strip down to core validation: verify NOTRACK rules exist in both OUTPUT and PREROUTING chains for tcp and udp with the localdns comment tag. Remove iptables-restore file check and conntrack enforcement tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SSH/bastion transport can inject \r characters into command output, causing grep to fail matching even when rules are visibly present. Remove 2>&1 (stderr warnings can corrupt rule output) and pipe through tr -d '\r' to normalize line endings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
23dfc20 to
64622f9
Compare
| setup() { | ||
| Include "./parts/linux/cloud-init/artifacts/localdns.sh" | ||
| LOCALDNS_NODE_LISTENER_IP="10.0.0.1" | ||
| LOCALDNS_CLUSTER_LISTENER_IP="10.0.0.2" | ||
| IPTABLES_RULES=("raw -t raw -p udp --dport 53 -j NOTRACK" "raw -t raw -p tcp --dport 53 -j NOTRACK") |
There was a problem hiding this comment.
This spec uses IPTABLES_RULES entries that start with raw -t raw ..., but add_iptable_rules_to_skip_conntrack_from_pods now parses each entry by treating the first token as the chain name. With the current test data the generated iptables-restore payload contains -A raw ... -t raw ... (table name used as a chain), so the test isn’t exercising the parsing logic against the production rule shape (OUTPUT/PREROUTING + -d <ip>). Consider generating IPTABLES_RULES via build_localdns_iptable_rules (or at least using OUTPUT/PREROUTING rules) in this Describe block so the test can catch regressions in the new batching/parsing logic.
| When call add_iptable_rules_to_skip_conntrack_from_pods | ||
| The output should include "Adding iptables rules to skip conntrack for queries to localdns." | ||
| The output should include "iptables -A raw -t raw -p udp --dport 53 -j NOTRACK" | ||
| The output should include "iptables -A raw -t raw -p tcp --dport 53 -j NOTRACK" | ||
| The output should include "*raw" | ||
| The output should include "-A raw -m comment --comment" | ||
| The output should include "-p udp" | ||
| The output should include "-p tcp" | ||
| The output should include "-j NOTRACK" | ||
| The output should include "COMMIT" |
There was a problem hiding this comment.
The iptables-restore mock prints the invoked args, but the test doesn’t assert that add_iptable_rules_to_skip_conntrack_from_pods uses --noflush (and -w). Since omitting --noflush would flush the raw table and be a serious regression, it’s worth adding an assertion that the output includes the expected iptables-restore called with args: ... --noflush string.
Avoid storing iptables output in a shell variable and grepping from echo, which caused intermittent failures. Instead pipe directly from iptables -S per chain, matching the pattern used by other e2e validators. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
start_localdns()andwait_for_localdns_ready()since CoreDNS typically starts in <100msiptables-restoreinstead of individual calls to avoid repeated xtables lock acquisitionIPTABLESvariableEstimated total savings: ~1.3–2.7s during node provisioning.
Telescope perf runs:
