Extend CacheRuntime phase 2.4: add dataset related labels to nodes and support app pod affinity by xliuqq · Pull Request #5836 · fluid-cloudnative/fluid

xliuqq · 2026-05-07T03:19:22Z

mark thin runtime reference not support cache runtime

Ⅰ. Describe what this PR does

add dataset related labels to nodes and support app pod affinity.

Ⅱ. Does this pull request fix one issue?

part of #5412

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

mark thin runtime reference not support cache runtime Signed-off-by: xliuqq <xlzq1992@gmail.com>

fluid-e2e-bot · 2026-05-07T03:19:32Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

gemini-code-assist

Code Review

This pull request refactors runtime status management by introducing the RuntimeStatusAccessor interface and unifies worker pod retrieval via the GetWorkerPods method. It also implements node affinity support for cache components and updates the cache engine's synchronization and shutdown processes. Feedback highlights a critical bug where an error from e.transform is ignored, resulting in potential panics. Additionally, the reviewer noted code duplication in the thin runtime package that should be resolved by exporting helper functions from the base package, and pointed out the need to restore test coverage for the refactored status retrieval logic.

I am having trouble creating individual review comments. Click here to see my feedback.

pkg/ddc/cache/engine/sync.go (54-56)

The error returned by e.transform is ignored because the err variable is immediately overwritten by the result of e.CheckAndUpdateRuntimeStatus. If transform fails, runtimeValue might be invalid, which could lead to panics or incorrect status updates in the subsequent call.

	runtimeValue, err := e.transform(dataset, runtime, runtimeClass)
	if err != nil {
		return err
	}
	// TODO: use different struct for input parameter to avoid fully transform
	_, err = e.CheckAndUpdateRuntimeStatus(runtimeValue)

pkg/ddc/base/runtime.go (646-647)

Consider exporting this function as GetDDCRuntimeStatus. This would allow other packages, such as pkg/ddc/thin/referencedataset, to reuse this logic instead of duplicating the entire switch statement for DDC-based runtimes.

// GetDDCRuntimeStatus retrieves the runtime object based on runtime type for DDC-based runtimes
func GetDDCRuntimeStatus(client client.Client, runtimeType, name, namespace string) (*datav1alpha1.RuntimeStatus, error) {

pkg/ddc/base/runtime.go (630)

Update the call to use the exported function name if the suggestion to export getDDCRuntime is accepted.

		status, err := GetDDCRuntimeStatus(client, runtimeType, name, namespace)

pkg/ddc/thin/referencedataset/runtime.go (49)

Instead of using a local duplicated version of the runtime status retrieval logic, use the (suggested) exported function from the base package. This improves maintainability by centralizing the mapping of runtime types to their status objects.

	return base.GetDDCRuntimeStatus(e.Client, physicalRuntimeInfo.GetRuntimeType(),

pkg/ddc/thin/referencedataset/runtime.go (56-104)

This function is a duplicate of the logic in pkg/ddc/base/runtime.go. It should be removed in favor of using an exported helper from the base package to avoid code duplication and maintenance overhead when new runtimes are added.

pkg/ddc/base/runtime_test.go (1163-1382)

The tests for GetRuntimeStatus have been removed. While the function itself was refactored into GetRuntimeStatusAccessor, the test coverage should be maintained. Please update these tests to verify the new accessor and its implementations (DDCRuntimeStatusAccessor and CacheRuntimeStatusAccessor) instead of deleting them.

pkg/ddc/base/runtime.go (642)

The error message mentions runtimeInfo, but the function GetRuntimeStatusAccessor is intended to return a status accessor. Updating the message to 'fail to get runtime status accessor' would be more accurate.

		return nil, fmt.Errorf("fail to get runtime status accessor for runtime type: %s", runtimeType)

codecov · 2026-05-07T06:22:37Z

Codecov Report

❌ Patch coverage is 27.64228% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.90%. Comparing base (fc9a6d9) to head (de28086).
⚠️ Report is 7 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/ddc/base/runtime.go	48.78%	20 Missing and 1 partial ⚠️
pkg/ddc/cache/engine/sync.go	0.00%	21 Missing ⚠️
pkg/ddc/cache/engine/runtime.go	0.00%	15 Missing ⚠️
pkg/ddc/cache/component/daemonset_manager.go	0.00%	7 Missing ⚠️
pkg/ddc/cache/component/statefulset_manager.go	30.00%	7 Missing ⚠️
pkg/ddc/cache/engine/shutdown.go	0.00%	6 Missing ⚠️
pkg/ddc/cache/engine/status.go	0.00%	6 Missing ⚠️
.../nodeaffinitywithcache/node_affinity_with_cache.go	40.00%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5836      +/-   ##
==========================================
- Coverage   59.10%   58.90%   -0.20%     
==========================================
  Files         480      480              
  Lines       32511    32599      +88     
==========================================
- Hits        19215    19202      -13     
- Misses      11746    11844      +98     
- Partials     1550     1553       +3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: xliuqq <xlzq1992@gmail.com>

sonarqubecloud · 2026-05-07T12:18:47Z

Quality Gate passed

Issues
1 New issue
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cheyang · 2026-05-12T13:09:51Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a unified interface for accessing runtime status and worker pods across various Fluid runtime types. Key changes include the addition of the RuntimeStatusAccessor interface, the implementation of GetWorkerPods for retrieving worker pod information, and the integration of node affinity tracking for CacheRuntime. The feedback suggests correcting the documentation for GetWorkerPods, handling "not found" errors gracefully when fetching Kubernetes resources to prevent reconciliation loops from failing prematurely, and ensuring RuntimeClassName is validated before use.

gemini-code-assist · 2026-05-12T13:12:39Z

 type RuntimeInfoInterface interface {
 	Conventions

+	// GetWorkerPods returns the worker object and selector for runtime workers.


The comment for GetWorkerPods is inaccurate as it mentions returning a "worker object and selector", but the method signature returns a slice of pods.

Suggested change

// GetWorkerPods returns the worker object and selector for runtime workers.

// GetWorkerPods returns the pods for runtime workers.

gemini-code-assist · 2026-05-12T13:12:39Z

+	workers, err := kubeclient.GetStatefulSet(client, info.GetWorkerStatefulsetName(), info.GetNamespace())
+	if err != nil {
+		return nil, err
+	}


If the worker StatefulSet is not found, GetWorkerPods should return an empty list and no error to avoid failing the reconciliation loop during initial setup. Using utils.IgnoreNotFound is a clean way to handle this.

Suggested change

workers, err := kubeclient.GetStatefulSet(client, info.GetWorkerStatefulsetName(), info.GetNamespace())

if err != nil {

return nil, err

}

workers, err := kubeclient.GetStatefulSet(client, info.GetWorkerStatefulsetName(), info.GetNamespace())

if err != nil {

return nil, utils.IgnoreNotFound(err)

}

gemini-code-assist · 2026-05-12T13:12:39Z

+	ds, err := kubeclient.GetDaemonset(s.client, component.Name, component.Namespace)
+	if err != nil {
+		return nil, err
+	}


Handle the case where the DaemonSet is not found gracefully by returning nil, nil instead of an error. This prevents the status update from failing when the component hasn't been created yet.

Suggested change

ds, err := kubeclient.GetDaemonset(s.client, component.Name, component.Namespace)

if err != nil {

return nil, err

}

ds, err := kubeclient.GetDaemonset(s.client, component.Name, component.Namespace)

if err != nil {

return nil, utils.IgnoreNotFound(err)

}

gemini-code-assist · 2026-05-12T13:12:39Z

+	sts, err := kubeclient.GetStatefulSet(s.client, component.Name, component.Namespace)
+	if err != nil {
+		return nil, err
+	}


Handle the case where the StatefulSet is not found gracefully by returning nil, nil instead of an error to avoid unnecessary reconciliation failures.

Suggested change

sts, err := kubeclient.GetStatefulSet(s.client, component.Name, component.Namespace)

if err != nil {

return nil, err

}

sts, err := kubeclient.GetStatefulSet(s.client, component.Name, component.Namespace)

if err != nil {

return nil, utils.IgnoreNotFound(err)

}

gemini-code-assist · 2026-05-12T13:12:39Z

+	workers, err := kubeclient.GetStatefulSet(client, workerName, info.GetNamespace())
+	if err != nil {
+		return nil, err
+	}


Handle the case where the worker StatefulSet is not found gracefully to avoid reconciliation errors during the initial setup phase.

Suggested change

workers, err := kubeclient.GetStatefulSet(client, workerName, info.GetNamespace())

if err != nil {

return nil, err

}

workers, err := kubeclient.GetStatefulSet(client, workerName, info.GetNamespace())

if err != nil {

return nil, utils.IgnoreNotFound(err)

}

gemini-code-assist · 2026-05-12T13:12:39Z

@@ -18,13 +18,15 @@ package engine

 import (
 	"context"


Add fmt to imports for error formatting in the Sync method.

Suggested change

"context"

"context"

"fmt"

gemini-code-assist · 2026-05-12T13:12:39Z

 		return err
 	}
+	dataset := ctx.Dataset
+	runtimeClass, err := e.getRuntimeClass(runtime.Spec.RuntimeClassName)


It's safer to check if RuntimeClassName is set before attempting to fetch the CacheRuntimeClass to provide a clearer error message if it's missing.

Suggested change

runtimeClass, err := e.getRuntimeClass(runtime.Spec.RuntimeClassName)

if runtime.Spec.RuntimeClassName == "" {

return fmt.Errorf("runtimeClassName is not set in CacheRuntime %s/%s", runtime.Namespace, runtime.Name)

}

runtimeClass, err := e.getRuntimeClass(runtime.Spec.RuntimeClassName)

cheyang · 2026-05-12T13:33:51Z

Review of PR #5836 — CacheRuntime phase 2.4: add dataset-related labels to nodes and support app pod affinity.

The PR makes several well-structured changes: (1) RuntimeStatusAccessor abstraction to unify DDC and Cache runtime status access for the affinity webhook, (2) GetWorkerPods interface to decouple node labeling from direct StatefulSet access, (3) CacheRuntime status sync + SyncScheduleInfoToCacheNodes integration, (4) proper shutdown cleanup of dataset labels, (5) ThinRuntime reference now explicitly excludes CacheRuntime.

One blocking concern: CacheRuntimeInfo.GetWorkerPods duplicates the naming logic that already exists in base.RuntimeInfo.GetWorkerPods, using GetComponentName vs GetWorkerStatefulsetName. These two paths may produce different worker names depending on runtime type suffix conventions, creating a maintenance risk when conventions evolve.

Non-blocking: ctx.Dataset nil-safety in Sync, GetNodeAffinity API call overhead, full-transform-as-input pattern, ThinRuntime exclusion lacks user-facing validation/docs, and Shutdown TearDownWorkers error handling.

cheyang · 2026-05-12T13:33:52Z

+	corev1 "k8s.io/api/core/v1"
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 	"k8s.io/apimachinery/pkg/types"
+	"sigs.k8s.io/controller-runtime/pkg/client"


CacheRuntimeInfo.GetWorkerPods computes the worker StatefulSet name as GetComponentName(info.GetName(), common.ComponentTypeWorker), while base.RuntimeInfo.GetWorkerPods uses runtimeInfo.GetWorkerStatefulsetName(). These may produce different names depending on the runtime type suffix convention.

For CacheRuntime, the worker StatefulSet name is <name>-cache-worker (derived from the cache engine's naming convention via GetComponentName), but GetWorkerStatefulsetName() on the embedded RuntimeInfo may return a different suffix depending on how BuildRuntimeInfo was configured.

Since CacheRuntimeInfo embeds the base RuntimeInfoInterface and delegates all other methods, it's unclear which naming convention is authoritative. If the embedded RuntimeInfo was built with common.CacheRuntime as runtimeType, GetWorkerStatefulsetName() would also compute the suffix — but the two code paths use completely different logic to derive the name, creating a maintenance risk.

Recommendation: remove the duplicate GetWorkerPods implementation from CacheRuntimeInfo and let it delegate to the embedded base.RuntimeInfo.GetWorkerPods(), or at minimum ensure both paths produce the same name.

cheyang · 2026-05-12T13:33:54Z

@@ -33,6 +35,11 @@ func (e *CacheEngine) Sync(ctx cruntime.ReconcileRequestContext) (err error) {
 	if err != nil {
 		return err


Sync now uses ctx.Dataset directly to call e.transform(dataset, runtime, runtimeClass). If the ReconcileRequestContext is constructed without a Dataset (e.g., before the dataset is bound), this will panic or produce incorrect behavior. Other DDC engines typically check for nil dataset or handle the pre-binding case explicitly.

Consider adding a nil check for ctx.Dataset before proceeding with the transform and status sync, similar to how other runtime engines handle the unbound case.

If not blocking, it will be fixed in the same issue "use different struct for input parameter to avoid fully transform". #5836 (comment)

cheyang · 2026-05-12T13:33:55Z

 }

+func (s *StatefulSetManager) GetNodeAffinity(component *common.CacheRuntimeComponentValue) (*corev1.NodeAffinity, error) {
+	sts, err := kubeclient.GetStatefulSet(s.client, component.Name, component.Namespace)


GetNodeAffinity calls kubeclient.GetStatefulSet for every status update cycle. Since the StatefulSet spec (nodeSelector + affinity) rarely changes, this adds unnecessary API server load. Consider caching the affinity or reading it from the runtime value instead of re-fetching on every reconcile.

I will add an issue to track it. Does cache runtime support to change Worker Affinify ? If so, the affinity shoule be generated according to the runtime and dataset spec .

cheyang · 2026-05-12T13:33:57Z

+	_, err = e.CheckAndUpdateRuntimeStatus(runtimeValue)
+	if err != nil {
+		return err
+	}


The TODO comment "use different struct for input parameter to avoid fully transform" is correct — calling e.transform() + CheckAndUpdateRuntimeStatus() on every Sync cycle forces a full value computation even when only the status needs updating. This is wasteful. While not blocking, it should be tracked with an issue link rather than just a TODO.

will add an issue to track it.

cheyang · 2026-05-12T13:33:58Z

 )

 // getPhysicalDatasetRuntimeStatus get the runtime status of the physical dataset
+// Note: This function only supports DDC-based runtimes (Alluxio, Jindo, etc.)


The comment "CacheRuntime is not supported because its status structure is incompatible with ThinRuntime" makes it clear that CacheRuntime cannot be used as a physical dataset runtime for ThinRuntime reference. However, there is no user-facing documentation or validation to prevent users from attempting this configuration. Without a webhook validation or clear docs, users may configure a ThinRuntime pointing to a CacheRuntime-backed Dataset and get a cryptic runtime error.

Consider adding: (1) a validation in the ThinRuntime webhook or controller that rejects CacheRuntime as physicalRuntimeType, or (2) documentation that explicitly lists supported runtime types for ThinRuntime reference.

will add an issue to track it.

cheyang

/lgtm
/approve

Non-blocking naming concern noted in review; addressed in separate follow-up acceptable.

fluid-e2e-bot · 2026-05-12T14:16:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cheyang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [cheyang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

add dataset related labels to nodes and support app pod affinity

b4be480

mark thin runtime reference not support cache runtime Signed-off-by: xliuqq <xlzq1992@gmail.com>

fluid-e2e-bot Bot added the do-not-merge/work-in-progress label May 7, 2026

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

fix test and lint

de28086

Signed-off-by: xliuqq <xlzq1992@gmail.com>

xliuqq force-pushed the cache_affinity branch from 7c9fa7e to de28086 Compare May 7, 2026 12:17

xliuqq marked this pull request as ready for review May 8, 2026 11:14

fluid-e2e-bot Bot removed the do-not-merge/work-in-progress label May 8, 2026

xliuqq requested a review from cheyang May 8, 2026 11:15

gemini-code-assist Bot reviewed May 12, 2026

View reviewed changes

cheyang reviewed May 12, 2026

View reviewed changes

cheyang approved these changes May 12, 2026

View reviewed changes

fluid-e2e-bot Bot assigned cheyang May 12, 2026

fluid-e2e-bot Bot added the lgtm label May 12, 2026

fluid-e2e-bot Bot added the approved label May 12, 2026

	// GetWorkerPods returns the worker object and selector for runtime workers.
	// GetWorkerPods returns the pods for runtime workers.

		@@ -33,6 +35,11 @@ func (e *CacheEngine) Sync(ctx cruntime.ReconcileRequestContext) (err error) {
		if err != nil {
		return err

Conversation

xliuqq commented May 7, 2026

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. List the added test cases (unit test/integration test) if any, please explain if no tests are needed.

Ⅳ. Describe how to verify it

Ⅴ. Special notes for reviews

Uh oh!

fluid-e2e-bot Bot commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

pkg/ddc/cache/engine/sync.go (54-56)

pkg/ddc/base/runtime.go (646-647)

pkg/ddc/base/runtime.go (630)

pkg/ddc/thin/referencedataset/runtime.go (49)

pkg/ddc/thin/referencedataset/runtime.go (56-104)

pkg/ddc/base/runtime_test.go (1163-1382)

pkg/ddc/base/runtime.go (642)

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sonarqubecloud Bot commented May 7, 2026

Quality Gate passed

Uh oh!

cheyang commented May 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

cheyang commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cheyang left a comment

Choose a reason for hiding this comment

Uh oh!

fluid-e2e-bot Bot commented May 12, 2026

Uh oh!

codecov Bot commented May 7, 2026 •

edited

Loading