[slice] Evict if Slice is DEGRADED (and not tolerated)#1123
[slice] Evict if Slice is DEGRADED (and not tolerated)#1123pajakd wants to merge 5 commits intoAI-Hypercomputer:slice-mainfrom
Conversation
6c45282 to
73057ad
Compare
| if expr.Operator == corev1.NodeSelectorOpIn { | ||
| valueAllowedInExpr := false | ||
| for _, val := range expr.Values { | ||
| if val == value { | ||
| valueAllowedInExpr = true | ||
| break | ||
| } | ||
| } | ||
| if !valueAllowedInExpr { | ||
| termAllowsValue = false | ||
| } | ||
| } |
There was a problem hiding this comment.
we should also handle Operator: corev1.NodeSelectorOpNotIn, Values: ["Degraded"]
- isn't there some predefined method for checking this?
| } else { | ||
| ac.Message = "Waiting for Slices to be created" | ||
| } | ||
| message := fmt.Sprintf("Slices are in states: %s", strings.Join(stateMessages, ", ")) |
There was a problem hiding this comment.
The PR removes the else branch that previously handled cases where stateMessages is empty. If no slices have states yet, this will now set the message to "Slices are in states: ", which looks incomplete.
| termAllowsValue := true | ||
| for _, expr := range term.MatchExpressions { | ||
| if expr.Key == key { | ||
| if expr.Operator == corev1.NodeSelectorOpIn { |
There was a problem hiding this comment.
This logic only evaluates corev1.NodeSelectorOpIn. If a user explicitly specifies NotIn: [Degraded], this function will skip evaluating that expression, and it will
erroneously return true (meaning Degraded is allowed). We should add support for corev1.NodeSelectorOpNotIn to ensure inverse conditions are respected.
| return api.TruncateConditionMessage(message) | ||
| } | ||
|
|
||
| func workloadRequestedOnlyHealthySlices(wl *kueue.Workload) bool { |
There was a problem hiding this comment.
The function name implies it checks if the entire workload requested only healthy slices. However, the implementation returns true if at least one PodSet requested
healthy slices.
| if features.Enabled(features.FailOnUntoleratedDegradedSlice) { | ||
| for _, slice := range slicesByState[core.SliceStateActiveDegraded] { | ||
| psName := slice.Annotations[core.OwnerPodSetNameAnnotation] | ||
| if (psName != "" && !podSetRequiresHealthy[psName]) || (psName == "" && !workloadRequestedOnlyHealthySlices(wl)) { |
There was a problem hiding this comment.
These if statements repeat above in line 819. Can we put them in helper functions and name them accordingly.
| return false | ||
| } | ||
| for _, term := range spec.Affinity.NodeAffinity.RequiredDuringSchedulingIgnoredDuringExecution.NodeSelectorTerms { | ||
| if len(term.MatchExpressions) > 0 { |
Description
Evict the workload, when a slice enters DEGRADED state, while the workload requested only HEALTHY slices.
Issue
Testing