Skip to content

Conversation

@awesomenix
Copy link
Contributor

What this PR does / why we need it:

Wireserver is getting spammed, its possible from all the systemd services we are creating, lets just cache the output so it can be reused so cse is not affected.

Which issue(s) this PR fixes:
Fixes #

@awesomenix awesomenix force-pushed the nishp/robust/wireserver branch from 88a4d87 to b607cac Compare January 21, 2026 22:33
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses wireserver spam issues by implementing a caching mechanism for IMDS (Azure Instance Metadata Service) responses. The changes introduce a single fetch-and-cache function with retry logic that stores IMDS metadata to a file, eliminating redundant network calls during node provisioning.

Changes:

  • Introduced centralized IMDS caching with fetch_and_cache_imds_instance_metadata() function including retry logic (20 retries with 2-second delays)
  • Refactored all IMDS-querying functions (should_skip_nvidia_drivers, should_disable_kubelet_serving_certificate_rotation, etc.) to read from the cache instead of making individual curl calls
  • Removed retrycmd_silent wrappers around IMDS functions since retries are now handled by the central fetch function
  • Updated test mocks to match the full IMDS response structure (wrapping network data in proper JSON hierarchy)

Reviewed changes

Copilot reviewed 17 out of 68 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
parts/linux/cloud-init/artifacts/cse_helpers.sh Core implementation: adds cache file management, fetch_and_cache_imds_instance_metadata with retries, and helper functions to read from cache
parts/linux/cloud-init/artifacts/cse_main.sh Adds cache cleanup at start and calls to fetch_and_cache in basePrep and nodePrep stages
parts/linux/cloud-init/artifacts/cse_config.sh Refactors getPrimaryNicIP and setKubeletNodeIPFlag to use cached data; removes retrycmd_silent wrappers
parts/linux/cloud-init/artifacts/cse_start.sh Adds cache cleanup after provisioning completes
vhdbuilder/packer/cleanup-vhd.sh Ensures cache file is cleaned up during VHD image preparation
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Updates tests to use cache file instead of mocking curl
spec/parts/linux/cloud-init/artifacts/imds_mocks/network/*.json Updates mock structure to match full IMDS response format
pkg/agent/testdata/CustomizedImage/CustomData Regenerated snapshot test data reflecting code changes

KUBELET_FLAGS="$KUBELET_FLAGS --node-ip=$nodeIPArg"
fi
local imdsOutput
export -f get_imds_network_metadata
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the previous issue, the export statement for get_imds_network_metadata is unnecessary since it's called within the same shell context. The export would only be needed if spawning a subprocess. This is redundant.

Suggested change
export -f get_imds_network_metadata

Copilot uses AI. Check for mistakes.
fi
should_skip=$(echo "$body" | jq -e '.compute.tagsList | map(select(.name | test("SkipGpuDriverInstall"; "i")))[0].value // "false" | test("true"; "i")')

local body
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The curl command uses both --connect-timeout 5 and --max-time 5, combined with --retry 20 --retry-delay 2. This means curl will attempt to connect for at most 5 seconds per attempt, with a maximum total time of 5 seconds per attempt. With 20 retries and 2-second delays between retries, the total potential wait time is approximately 520 + 219 = 138 seconds. This seems reasonable for avoiding wireserver spam while still being resilient. However, consider documenting why these specific values were chosen.

Suggested change
local body
local body
# IMDS can occasionally be slow or transiently unavailable. To avoid spamming the
# wireserver while remaining resilient, we bound each curl attempt to 5 seconds
# (`--connect-timeout 5` and `--max-time 5`) and allow up to 20 retries with a
# 2-second delay between them (`--retry 20 --retry-delay 2 --retry-connrefused`).
# This yields a worst-case wait of ~138 seconds (5*20 + 2*19), which is acceptable
# for node provisioning while keeping IMDS traffic bounded.

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +53
# Cleanup cache file
rm -f /opt/azure/containers/imds_instance_metadata_cache.json || true
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache file cleanup at the end of cse_start.sh happens after all provisioning is complete. However, this cleanup might interfere with any background processes or systemd services that are still using the cached metadata. Consider whether this cleanup is necessary at this stage, or if it should be handled differently to ensure no race conditions with services that might still be reading the cache.

Suggested change
# Cleanup cache file
rm -f /opt/azure/containers/imds_instance_metadata_cache.json || true
# NOTE: Do not delete /opt/azure/containers/imds_instance_metadata_cache.json here.
# This cache may still be in use by background processes or systemd services that
# read instance metadata asynchronously. Any cleanup or refresh of this cache
# should be coordinated by the component that owns it to avoid race conditions.

Copilot uses AI. Check for mistakes.
Comment on lines 664 to 693
fetch_and_cache_imds_instance_metadata() {
set -x
body=$(curl -fsSL -H "Metadata: true" --noproxy "*" "http://169.254.169.254/metadata/instance?api-version=2021-02-01")
ret=$?
if [ "$ret" -ne 0 ]; then
return $ret
# If cache file already exists and is valid, skip fetching
if [ -f "$IMDS_INSTANCE_METADATA_CACHE_FILE" ]; then
# Validate the cache file contains valid JSON
if jq -e . "$IMDS_INSTANCE_METADATA_CACHE_FILE" > /dev/null 2>&1; then
echo "IMDS instance metadata cache already exists and is valid"
return 0
else
echo "IMDS instance metadata cache exists but is invalid, refetching"
rm -f "$IMDS_INSTANCE_METADATA_CACHE_FILE"
fi
fi
should_skip=$(echo "$body" | jq -e '.compute.tagsList | map(select(.name | test("SkipGpuDriverInstall"; "i")))[0].value // "false" | test("true"; "i")')

local body
body=$(curl -fsSL -H "Metadata: true" --noproxy "*" --retry 20 --retry-delay 2 --retry-connrefused --connect-timeout 5 --max-time 5 "http://169.254.169.254/metadata/instance?api-version=2021-02-01")
if [ "$?" -ne 0 ]; then
echo "Failed to fetch IMDS instance metadata"
exit $ERR_IMDS_FETCH_FAILED
fi

# Validate response is valid JSON before caching
if ! echo "$body" | jq -e . > /dev/null 2>&1; then
echo "IMDS response is not valid JSON"
exit $ERR_IMDS_FETCH_FAILED
fi

echo "$body" > "$IMDS_INSTANCE_METADATA_CACHE_FILE"
echo "IMDS instance metadata cached successfully"
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch_and_cache_imds_instance_metadata function writes to the cache file at /opt/azure/containers/imds_instance_metadata_cache.json but does not ensure the parent directory /opt/azure/containers exists. While this directory is likely created elsewhere during provisioning, it's a good practice to create the directory if it doesn't exist to avoid potential race conditions or failures if the function is called before the directory is created.

Copilot uses AI. Check for mistakes.
Comment on lines +719 to +735
# Retrieves a VM tag value from the cached IMDS instance metadata.
# Arguments:
# $1 - tag_name: The name of the tag to look up (case-sensitive match)
# $2 - case_insensitive_name: Optional. If "true", performs case-insensitive name matching
# $3 - case_insensitive_value: Optional. If "true", performs case-insensitive value matching for "true"
# Outputs:
# The tag value (lowercased) or empty string if not found.
# For case_insensitive_value="true", outputs "true" or "false" based on whether value matches "true" case-insensitively.
# Returns:
# 0 on success, 1 if cache file doesn't exist or tag not found
get_imds_vm_tag_value() {
local tag_name="$1"
local tag_value
tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
# Output lowercase value for consistency
echo "${tag_value,,}"
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for get_imds_vm_tag_value mentions optional parameters case_insensitive_name and case_insensitive_value (arguments $2 and $3), but the function implementation doesn't accept or use these parameters. The function always performs case-insensitive matching. Either remove the misleading documentation or implement support for these parameters.

Copilot uses AI. Check for mistakes.
Comment on lines +729 to +735
get_imds_vm_tag_value() {
local tag_name="$1"
local tag_value
tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
# Output lowercase value for consistency
echo "${tag_value,,}"
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The jq expression always performs case-insensitive matching and tests if the value is "true" (case-insensitive), returning a boolean. However, the function documentation in get_imds_vm_tag_value states it returns "The tag value (lowercased) or empty string if not found", which is misleading. The function actually returns "true" or "false" as strings, not the actual tag value. Consider either returning the actual tag value or updating the documentation to accurately reflect the boolean return behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +702 to +735
get_imds_network_metadata() {
local network_metadata=""
network_metadata=$(jq -r '.network.interface' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
echo "$network_metadata"
}

# Retrieves the primary NIC's private IPv4 address from IMDS instance metadata.
# Outputs:
# The private IPv4 address or empty string if not found.
# Returns:
# 0 on success, non-zero on failure
get_primary_nic_ip() {
local primary_ip=""
primary_ip=$(jq -r '.network.interface[0].ipv4.ipAddress[0].privateIpAddress // ""' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
echo "$primary_ip"
}

# Retrieves a VM tag value from the cached IMDS instance metadata.
# Arguments:
# $1 - tag_name: The name of the tag to look up (case-sensitive match)
# $2 - case_insensitive_name: Optional. If "true", performs case-insensitive name matching
# $3 - case_insensitive_value: Optional. If "true", performs case-insensitive value matching for "true"
# Outputs:
# The tag value (lowercased) or empty string if not found.
# For case_insensitive_value="true", outputs "true" or "false" based on whether value matches "true" case-insensitively.
# Returns:
# 0 on success, 1 if cache file doesn't exist or tag not found
get_imds_vm_tag_value() {
local tag_name="$1"
local tag_value
tag_value=$(jq -r --arg name "$tag_name" '.compute.tagsList | map(select(.name | test($name; "i")))[0].value // "false" | test("true"; "i")' "$IMDS_INSTANCE_METADATA_CACHE_FILE")
# Output lowercase value for consistency
echo "${tag_value,,}"
}
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper functions get_imds_network_metadata, get_primary_nic_ip, and get_imds_vm_tag_value do not check if the cache file exists before attempting to read from it. If fetch_and_cache_imds_instance_metadata fails or hasn't been called yet, these functions will fail silently or produce errors. Add validation to ensure the cache file exists and is readable, or document that fetch_and_cache_imds_instance_metadata must be called first.

Copilot uses AI. Check for mistakes.
i=$((i+1))
done
echo "$ip"
export -f get_primary_nic_ip
Copy link

Copilot AI Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export statement for get_primary_nic_ip inside getPrimaryNicIP is unnecessary since get_primary_nic_ip is called within the same shell context. The export would only be needed if spawning a subprocess, which is not the case here. This export doesn't harm functionality but is redundant and misleading.

Suggested change
export -f get_primary_nic_ip

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants