-
Notifications
You must be signed in to change notification settings - Fork 244
fix: add retries to wireserver, cache first time response to avoid spam #7676
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
77758e0 to
1dbbd39
Compare
1dbbd39 to
a5ff7df
Compare
543aa95 to
806b98b
Compare
806b98b to
ebe73d0
Compare
ebe73d0 to
88a4d87
Compare
88a4d87 to
b607cac
Compare
b607cac to
6dca57b
Compare
6dca57b to
3de35b5
Compare
3de35b5 to
cdf270a
Compare
cdf270a to
b6eae72
Compare
b6eae72 to
36b47bc
Compare
36b47bc to
4421dc4
Compare
4421dc4 to
b4d07b5
Compare
b4d07b5 to
e630757
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR addresses wireserver spam issues by implementing a caching mechanism for IMDS (Azure Instance Metadata Service) responses. The changes introduce a single fetch-and-cache function with retry logic that stores IMDS metadata to a file, eliminating redundant network calls during node provisioning.
Changes:
- Introduced centralized IMDS caching with
fetch_and_cache_imds_instance_metadata()function including retry logic (20 retries with 2-second delays) - Refactored all IMDS-querying functions (
should_skip_nvidia_drivers,should_disable_kubelet_serving_certificate_rotation, etc.) to read from the cache instead of making individual curl calls - Removed
retrycmd_silentwrappers around IMDS functions since retries are now handled by the central fetch function - Updated test mocks to match the full IMDS response structure (wrapping network data in proper JSON hierarchy)
Reviewed changes
Copilot reviewed 17 out of 68 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Core implementation: adds cache file management, fetch_and_cache_imds_instance_metadata with retries, and helper functions to read from cache |
| parts/linux/cloud-init/artifacts/cse_main.sh | Adds cache cleanup at start and calls to fetch_and_cache in basePrep and nodePrep stages |
| parts/linux/cloud-init/artifacts/cse_config.sh | Refactors getPrimaryNicIP and setKubeletNodeIPFlag to use cached data; removes retrycmd_silent wrappers |
| parts/linux/cloud-init/artifacts/cse_start.sh | Adds cache cleanup after provisioning completes |
| vhdbuilder/packer/cleanup-vhd.sh | Ensures cache file is cleaned up during VHD image preparation |
| spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh | Updates tests to use cache file instead of mocking curl |
| spec/parts/linux/cloud-init/artifacts/imds_mocks/network/*.json | Updates mock structure to match full IMDS response format |
| pkg/agent/testdata/CustomizedImage/CustomData | Regenerated snapshot test data reflecting code changes |
| KUBELET_FLAGS="$KUBELET_FLAGS --node-ip=$nodeIPArg" | ||
| fi | ||
| local imdsOutput | ||
| export -f get_imds_network_metadata |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous issue, the export statement for get_imds_network_metadata is unnecessary since it's called within the same shell context. The export would only be needed if spawning a subprocess. This is redundant.
| export -f get_imds_network_metadata |
| fi | ||
| should_skip=$(echo "$body" | jq -e '.compute.tagsList | map(select(.name | test("SkipGpuDriverInstall"; "i")))[0].value // "false" | test("true"; "i")') | ||
|
|
||
| local body |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The curl command uses both --connect-timeout 5 and --max-time 5, combined with --retry 20 --retry-delay 2. This means curl will attempt to connect for at most 5 seconds per attempt, with a maximum total time of 5 seconds per attempt. With 20 retries and 2-second delays between retries, the total potential wait time is approximately 520 + 219 = 138 seconds. This seems reasonable for avoiding wireserver spam while still being resilient. However, consider documenting why these specific values were chosen.
| local body | |
| local body | |
| # IMDS can occasionally be slow or transiently unavailable. To avoid spamming the | |
| # wireserver while remaining resilient, we bound each curl attempt to 5 seconds | |
| # (`--connect-timeout 5` and `--max-time 5`) and allow up to 20 retries with a | |
| # 2-second delay between them (`--retry 20 --retry-delay 2 --retry-connrefused`). | |
| # This yields a worst-case wait of ~138 seconds (5*20 + 2*19), which is acceptable | |
| # for node provisioning while keeping IMDS traffic bounded. |
| # Cleanup cache file | ||
| rm -f /opt/azure/containers/imds_instance_metadata_cache.json || true |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache file cleanup at the end of cse_start.sh happens after all provisioning is complete. However, this cleanup might interfere with any background processes or systemd services that are still using the cached metadata. Consider whether this cleanup is necessary at this stage, or if it should be handled differently to ensure no race conditions with services that might still be reading the cache.
| # Cleanup cache file | |
| rm -f /opt/azure/containers/imds_instance_metadata_cache.json || true | |
| # NOTE: Do not delete /opt/azure/containers/imds_instance_metadata_cache.json here. | |
| # This cache may still be in use by background processes or systemd services that | |
| # read instance metadata asynchronously. Any cleanup or refresh of this cache | |
| # should be coordinated by the component that owns it to avoid race conditions. |
| fetch_and_cache_imds_instance_metadata() { | ||
| set -x | ||
| body=$(curl -fsSL -H "Metadata: true" --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01") | ||
| ret=$? | ||
| if [ "$ret" -ne 0 ]; then | ||
| return $ret | ||
| # If cache file already exists and is valid, skip fetching | ||
| if [ -f "$IMDS_INSTANCE_METADATA_CACHE_FILE" ]; then | ||
| # Validate the cache file contains valid JSON | ||
| if jq -e . "$IMDS_INSTANCE_METADATA_CACHE_FILE" > /dev/null 2>&1; then | ||
| echo "IMDS instance metadata cache already exists and is valid" | ||
| return 0 | ||
| else | ||
| echo "IMDS instance metadata cache exists but is invalid, refetching" | ||
| rm -f "$IMDS_INSTANCE_METADATA_CACHE_FILE" | ||
| fi | ||
| fi | ||
| should_skip=$(echo "$body" | jq -e '.compute.tagsList | map(select(.name | test("SkipGpuDriverInstall"; "i")))[0].value // "false" | test("true"; "i")') | ||
|
|
||
| local body | ||
| body=$(curl -fsSL -H "Metadata: true" --noproxy "*" --retry 20 --retry-delay 2 --retry-connrefused --connect-timeout 5 --max-time 5 "http://169.254.169.254/metadata/instance?api-version=2021-02-01") | ||
| if [ "$?" -ne 0 ]; then | ||
| echo "Failed to fetch IMDS instance metadata" | ||
| exit $ERR_IMDS_FETCH_FAILED | ||
| fi | ||
|
|
||
| # Validate response is valid JSON before caching | ||
| if ! echo "$body" | jq -e . > /dev/null 2>&1; then | ||
| echo "IMDS response is not valid JSON" | ||
| exit $ERR_IMDS_FETCH_FAILED | ||
| fi | ||
|
|
||
| echo "$body" > "$IMDS_INSTANCE_METADATA_CACHE_FILE" | ||
| echo "IMDS instance metadata cached successfully" | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fetch_and_cache_imds_instance_metadata function writes to the cache file at /opt/azure/containers/imds_instance_metadata_cache.json but does not ensure the parent directory /opt/azure/containers exists. While this directory is likely created elsewhere during provisioning, it's a good practice to create the directory if it doesn't exist to avoid potential race conditions or failures if the function is called before the directory is created.
| # Retrieves a VM tag value from the cached IMDS instance metadata. | ||
| # Arguments: | ||
| # $1 - tag_name: The name of the tag to look up (case-sensitive match) | ||
| # $2 - case_insensitive_name: Optional. If "true", performs case-insensitive name matching | ||
| # $3 - case_insensitive_value: Optional. If "true", performs case-insensitive value matching for "true" | ||
| # Outputs: | ||
| # The tag value (lowercased) or empty string if not found. | ||
| # For case_insensitive_value="true", outputs "true" or "false" based on whether value matches "true" case-insensitively. | ||
| # Returns: | ||
| # 0 on success, 1 if cache file doesn't exist or tag not found | ||
| get_imds_vm_tag_value() { | ||
| local tag_name="$1" | ||
| local tag_value | ||
| tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE") | ||
| # Output lowercase value for consistency | ||
| echo "${tag_value,,}" | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation for get_imds_vm_tag_value mentions optional parameters case_insensitive_name and case_insensitive_value (arguments $2 and $3), but the function implementation doesn't accept or use these parameters. The function always performs case-insensitive matching. Either remove the misleading documentation or implement support for these parameters.
| get_imds_vm_tag_value() { | ||
| local tag_name="$1" | ||
| local tag_value | ||
| tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE") | ||
| # Output lowercase value for consistency | ||
| echo "${tag_value,,}" | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The jq expression always performs case-insensitive matching and tests if the value is "true" (case-insensitive), returning a boolean. However, the function documentation in get_imds_vm_tag_value states it returns "The tag value (lowercased) or empty string if not found", which is misleading. The function actually returns "true" or "false" as strings, not the actual tag value. Consider either returning the actual tag value or updating the documentation to accurately reflect the boolean return behavior.
| get_imds_network_metadata() { | ||
| local network_metadata="" | ||
| network_metadata=$(jq -r '.network.interface' "$IMDS_INSTANCE_METADATA_CACHE_FILE") | ||
| echo "$network_metadata" | ||
| } | ||
|
|
||
| # Retrieves the primary NIC's private IPv4 address from IMDS instance metadata. | ||
| # Outputs: | ||
| # The private IPv4 address or empty string if not found. | ||
| # Returns: | ||
| # 0 on success, non-zero on failure | ||
| get_primary_nic_ip() { | ||
| local primary_ip="" | ||
| primary_ip=$(jq -r '.network.interface[0].ipv4.ipAddress[0].privateIpAddress // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE") | ||
| echo "$primary_ip" | ||
| } | ||
|
|
||
| # Retrieves a VM tag value from the cached IMDS instance metadata. | ||
| # Arguments: | ||
| # $1 - tag_name: The name of the tag to look up (case-sensitive match) | ||
| # $2 - case_insensitive_name: Optional. If "true", performs case-insensitive name matching | ||
| # $3 - case_insensitive_value: Optional. If "true", performs case-insensitive value matching for "true" | ||
| # Outputs: | ||
| # The tag value (lowercased) or empty string if not found. | ||
| # For case_insensitive_value="true", outputs "true" or "false" based on whether value matches "true" case-insensitively. | ||
| # Returns: | ||
| # 0 on success, 1 if cache file doesn't exist or tag not found | ||
| get_imds_vm_tag_value() { | ||
| local tag_name="$1" | ||
| local tag_value | ||
| tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE") | ||
| # Output lowercase value for consistency | ||
| echo "${tag_value,,}" | ||
| } |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The helper functions get_imds_network_metadata, get_primary_nic_ip, and get_imds_vm_tag_value do not check if the cache file exists before attempting to read from it. If fetch_and_cache_imds_instance_metadata fails or hasn't been called yet, these functions will fail silently or produce errors. Add validation to ensure the cache file exists and is readable, or document that fetch_and_cache_imds_instance_metadata must be called first.
| i=$((i+1)) | ||
| done | ||
| echo "$ip" | ||
| export -f get_primary_nic_ip |
Copilot
AI
Jan 23, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The export statement for get_primary_nic_ip inside getPrimaryNicIP is unnecessary since get_primary_nic_ip is called within the same shell context. The export would only be needed if spawning a subprocess, which is not the case here. This export doesn't harm functionality but is redundant and misleading.
| export -f get_primary_nic_ip |
e630757 to
34d8986
Compare
What this PR does / why we need it:
Wireserver is getting spammed, its possible from all the systemd services we are creating, lets just cache the output so it can be reused so cse is not affected.
Which issue(s) this PR fixes:
Fixes #