Skip to content

Conversation

@08volt
Copy link
Member

@08volt 08volt commented Nov 21, 2025

Objective

Improves the observability of L4 Load Balancer provisioning by propagating GCP resources in Service.status.conditions.

Context & Motivation

Currently, the Service status for Load Balancers primarily reflects the ingress IP but lacks visibility into the state of dependent cloud resources.

This change introduces a mechanism to report the successful allocation of specific GCP resources, enabling better introspection.

Design & Implementation

Feature Gating:

Introduced a new controller-manager flag --enable-l4lb-svc-conditions (default: false) to gate this behavior, ensuring backward compatibility and safe rollout.

Added enableLBServiceConditions to the gce.Cloud struct configuration.

API Changes:

Refactored ensureInternalLoadBalancer to return *v1.ServiceStatus instead of *v1.LoadBalancerStatus. This allows the function to propagate both the Ingress IP and the new Conditions.

Defined standard condition types in gce_conditions.go:

  • ServiceLoadBalancerBackendService
  • ServiceLoadBalancerForwardingRule
  • ServiceLoadBalancerHealthCheck
  • ServiceLoadBalancerFirewallRule
  • ServiceLoadBalancerTargetPool

State Management (Server-Side Apply):

Implemented patchServiceStatusConditions using Server-Side Apply (SSA) to safely update the status.conditions field without overwriting other status fields managed by different controllers.

Added ConditionsEqual helper to prevent unnecessary API calls if the state has not converged or changed.

The feature is disabled by default via the feature flag.

Verification

Unit Tests: Updated gce_loadbalancer_internal_test.go and gce_loadbalancer_external_test.go to verify that the correct conditions are generated for various provisioning scenarios (success, partial failure).

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 21, 2025
@k8s-ci-robot k8s-ci-robot requested a review from bowei November 21, 2025 11:11
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If the repository mantainers determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz November 21, 2025 11:11
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 21, 2025
@08volt
Copy link
Member Author

08volt commented Nov 21, 2025

/retest

1 similar comment
@08volt
Copy link
Member Author

08volt commented Nov 21, 2025

/retest

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 21, 2025
@08volt
Copy link
Member Author

08volt commented Nov 23, 2025

/retest

@08volt
Copy link
Member Author

08volt commented Nov 24, 2025

/assign @mmamczur

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: 08volt
Once this PR has been reviewed and has the lgtm label, please ask for approval from mmamczur. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 24, 2025
}

func (g *Cloud) ensureTargetPoolAndHealthCheck(tpExists, tpNeedsRecreation bool, svc *v1.Service, loadBalancerName, clusterID, ipAddressToUse string, hosts []*gceInstance, hcToCreate, hcToDelete *compute.HttpHealthCheck) error {
func (g *Cloud) ensureTargetPoolAndHealthCheck(tpExists, tpNeedsRecreation bool, svc *v1.Service, loadBalancerName, clusterID, ipAddressToUse string, hosts []*gceInstance, hcToCreate, hcToDelete *compute.HttpHealthCheck, conditions *[]metav1.Condition) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this arg a pointer to slice? do we have to add nil checks?

Copy link
Member Author

@08volt 08volt Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • passing only the slice is not 100% safe ( later there are "append" operations and if the lenght exceeds the pre-allocated size it would repalce the original array )

  • nil check is present directly in the function SetStatusCondition that I am using to update this slice

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

passing only the slice is not 100% safe ( later there are "append" operations and if the lenght exceeds the pre-allocated size it would repalce the original array )

But isn't the slice already a reference type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but it has the reference to the allocated memory + the size of the allocated memory

If the allocated size is not enough for the append it replace the allocated memory and size. In this case after the function you don't have the reference to the new allocated memory but still the old one without the new elements.

- Refactor ensureInternalLoadBalancer to return ServiceStatus and update error handling
- Changed the return type of ensureInternalLoadBalancer from (*v1.LoadBalancerStatus, error) to (*v1.ServiceStatus, error).
- Introduced expectedConditions to manage service status conditions effectively.
- Added patch to update conditions in the service
@bowei
Copy link
Member

bowei commented Dec 1, 2025

This seems like an abuse of the Conditions from a semantics perspective? Would like to understand more if you have any comments from API review about this usage.

@bowei
Copy link
Member

bowei commented Dec 1, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 1, 2025
@08volt
Copy link
Member Author

08volt commented Dec 2, 2025

Hi @thockin , could you please confirm this change was discussed with you?

@aojea
Copy link
Member

aojea commented Dec 2, 2025

@danwinship already proposed this feature in kubernetes/enhancements#4631 but it requires people to drive it, so this can be a good opportunity to contribute this feature to SIG network

@08volt
Copy link
Member Author

08volt commented Dec 2, 2025

@danwinship already proposed this feature in kubernetes/enhancements#4631 but it requires people to drive it, so this can be a good opportunity to contribute this feature to SIG network

Thanks @aojea. I didn't know about this proposal, but it sounds a bit different. This feature would allow listing the resource names used in GCP, while the proposal refers to more generic Conditions that allow for a better understanding of the load balancer provisioning/serving status.

@danwinship
Copy link
Contributor

Well, you said:

This change introduces a mechanism to report the successful allocation of specific GCP resources, enabling better introspection.

And that sounds like part of what KEP-4631 was trying to do: add conditions to allow the backend to explicitly report when it's having problems allocating a load balancer.

@08volt
Copy link
Member Author

08volt commented Dec 2, 2025

Well, you said:

This change introduces a mechanism to report the successful allocation of specific GCP resources, enabling better introspection.

And that sounds like part of what KEP-4631 was trying to do: add conditions to allow the backend to explicitly report when it's having problems allocating a load balancer.

Exactly, that's not the same thing

  • KEP-4631: Report problems and generic load balancer state with conditions such as LoadBalancerProvisioning/LoadBalancerReady/LoadBalancerDegraded ...
  • This change: Report specific allocated GCP resources: ServiceLoadBalancerForwardingRule/ServiceLoadBalancerHealthCheck ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants