Skip to content

[slice] Evict if Slice is DEGRADED (and not tolerated)#1123

Open
pajakd wants to merge 5 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:evict_if_not_tolerated
Open

[slice] Evict if Slice is DEGRADED (and not tolerated)#1123
pajakd wants to merge 5 commits intoAI-Hypercomputer:slice-mainfrom
pajakd:evict_if_not_tolerated

Conversation

@pajakd
Copy link
Collaborator

@pajakd pajakd commented Mar 13, 2026

Description

Evict the workload, when a slice enters DEGRADED state, while the workload requested only HEALTHY slices.

Issue

Testing

@pajakd pajakd force-pushed the evict_if_not_tolerated branch from 6c45282 to 73057ad Compare March 13, 2026 14:58
Comment on lines +144 to +155
if expr.Operator == corev1.NodeSelectorOpIn {
valueAllowedInExpr := false
for _, val := range expr.Values {
if val == value {
valueAllowedInExpr = true
break
}
}
if !valueAllowedInExpr {
termAllowsValue = false
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also handle Operator: corev1.NodeSelectorOpNotIn, Values: ["Degraded"]

  • isn't there some predefined method for checking this?

} else {
ac.Message = "Waiting for Slices to be created"
}
message := fmt.Sprintf("Slices are in states: %s", strings.Join(stateMessages, ", "))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR removes the else branch that previously handled cases where stateMessages is empty. If no slices have states yet, this will now set the message to "Slices are in states: ", which looks incomplete.

termAllowsValue := true
for _, expr := range term.MatchExpressions {
if expr.Key == key {
if expr.Operator == corev1.NodeSelectorOpIn {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic only evaluates corev1.NodeSelectorOpIn. If a user explicitly specifies NotIn: [Degraded], this function will skip evaluating that expression, and it will
erroneously return true (meaning Degraded is allowed). We should add support for corev1.NodeSelectorOpNotIn to ensure inverse conditions are respected.

return api.TruncateConditionMessage(message)
}

func workloadRequestedOnlyHealthySlices(wl *kueue.Workload) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name implies it checks if the entire workload requested only healthy slices. However, the implementation returns true if at least one PodSet requested
healthy slices.

if features.Enabled(features.FailOnUntoleratedDegradedSlice) {
for _, slice := range slicesByState[core.SliceStateActiveDegraded] {
psName := slice.Annotations[core.OwnerPodSetNameAnnotation]
if (psName != "" && !podSetRequiresHealthy[psName]) || (psName == "" && !workloadRequestedOnlyHealthySlices(wl)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These if statements repeat above in line 819. Can we put them in helper functions and name them accordingly.

return false
}
for _, term := range spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms {
if len(term.MatchExpressions) > 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about MatchingFields?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants