Skip to content

[Feature] Custom cluster metrics #56

@albertompe

Description

@albertompe

Proposed Custom Business Metrics

To bridge the gap between operator health and product health, the following custom metrics are proposed. They are registered against controller-runtime's global Prometheus registry (sigs.k8s.io/controller-runtime/pkg/metrics.Registry) and are automatically published on the same /metrics endpoint.

Metric Definitions

redkeycluster_reconcile_stage_errors_total

Property Value
Type CounterVec
Labels stage
Help Total number of RedkeyCluster reconcile failures partitioned by stage.
Label values get_cluster, ensure_rbac, ensure_robin_deployment, list_configs, create_new_config, aggregate_status, cleanup_superseded_configs
Instrumentation point In redkeycluster_controller.go, before each return ..., err in the reconcile loop.

redkeycluster_status_transitions_total

Property Value
Type CounterVec
Labels reconciled, status
Help Total number of aggregated RedkeyCluster status transitions.
Label values reconciled: values from RedkeyCluster.Status.Reconciled enum; status: values from RedkeyClusterConfig.Status.Status enum (empty normalized to Unknown).
Instrumentation point In redkeycluster_config.go, when the aggregated status changes, before the Status().Update(...) call.

redkeycluster_time_to_ready_seconds

Property Value
Type Histogram
Labels
Help Time from RedkeyClusterConfig creation to aggregated cluster readiness.
Buckets 5, 10, 20, 30, 60, 120, 300, 600, 900, 1800
Instrumentation point In redkeycluster_config.go, when the aggregated status transitions from "not ready" to Reconciled=True and Status=Ready. Uses highestConfig.CreationTimestamp.

redkeycluster_config_creations_total

Property Value
Type CounterVec
Labels reason
Help Total number of RedkeyClusterConfig objects created by the operator.
Label values initial, generation_change
Instrumentation point In redkeycluster_config.go, after a successful Create call.

redkeycluster_cleanup_deleted_configs_total

Property Value
Type Counter
Labels
Help Total number of superseded RedkeyClusterConfig objects deleted by cleanup.
Instrumentation point In redkeycluster_config.go, inside the deletion loop, after each successful Delete.

redkeycluster_robin_deployment_changes_total

Property Value
Type CounterVec
Labels action
Help Total number of Robin Deployment create and patch operations.
Label values create, patch
Instrumentation point In redkeycluster_robin.go, in the IsNotFound → create branch and the drift → patch branch.

Label Cardinality Guidelines

Do not use cluster, config, node, ip, or name as labels — this causes high cardinality and unbounded memory growth. In this first version, namespace is also excluded; if per-tenant observability is needed later, that would be the only additional label to consider.

All proposed labels use bounded enums defined in the API types, keeping cardinality predictable.

Priority

If only three custom metrics can be implemented initially, prioritize:

  1. redkeycluster_reconcile_stage_errors_total — pinpoints which reconcile stage is failing.
  2. redkeycluster_status_transitions_total — tracks cluster lifecycle progression.
  3. redkeycluster_time_to_ready_seconds — measures end-to-end business latency.

These three cover error diagnosis, progress tracking, and business latency respectively.

Implementation Approach

Custom metrics are defined in a dedicated file within the internal/controller package and registered in an init() function against sigs.k8s.io/controller-runtime/pkg/metrics.Registry. Since the manager already configures the metrics server, any collector registered in this global registry is automatically served on the /metrics endpoint with no additional wiring.

package controller

import (
    "github.com/prometheus/client_golang/prometheus"
    crmetrics "sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
    reconcileStageErrorsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "redkeycluster_reconcile_stage_errors_total",
            Help: "Total number of RedkeyCluster reconcile failures partitioned by stage.",
        },
        []string{"stage"},
    )

    statusTransitionsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "redkeycluster_status_transitions_total",
            Help: "Total number of aggregated RedkeyCluster status transitions.",
        },
        []string{"reconciled", "status"},
    )

    timeToReadySeconds = prometheus.NewHistogram(
        prometheus.HistogramOpts{
            Name:    "redkeycluster_time_to_ready_seconds",
            Help:    "Time from RedkeyClusterConfig creation to aggregated cluster readiness.",
            Buckets: []float64{5, 10, 20, 30, 60, 120, 300, 600, 900, 1800},
        },
    )

    configCreationsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "redkeycluster_config_creations_total",
            Help: "Total number of RedkeyClusterConfig objects created by the operator.",
        },
        []string{"reason"},
    )

    cleanupDeletedConfigsTotal = prometheus.NewCounter(
        prometheus.CounterOpts{
            Name: "redkeycluster_cleanup_deleted_configs_total",
            Help: "Total number of superseded RedkeyClusterConfig objects deleted by cleanup.",
        },
    )

    robinDeploymentChangesTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "redkeycluster_robin_deployment_changes_total",
            Help: "Total number of Robin Deployment create and patch operations.",
        },
        []string{"action"},
    )
)

func init() {
    crmetrics.Registry.MustRegister(
        reconcileStageErrorsTotal,
        statusTransitionsTotal,
        timeToReadySeconds,
        configCreationsTotal,
        cleanupDeletedConfigsTotal,
        robinDeploymentChangesTotal,
    )
}

Instrumentation at call sites follows the standard pattern:

// On reconcile stage error
reconcileStageErrorsTotal.WithLabelValues("ensure_rbac").Inc()

// On config creation
configCreationsTotal.WithLabelValues("generation_change").Inc()

// On status transition
statusTransitionsTotal.WithLabelValues(string(newReconciled), string(newStatus)).Inc()

// On readiness
timeToReadySeconds.Observe(time.Since(config.CreationTimestamp.Time).Seconds())

// On cleanup
cleanupDeletedConfigsTotal.Inc()

// On Robin deployment change
robinDeploymentChangesTotal.WithLabelValues("create").Inc()

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions