Skip to content

infinite or negative infinite error budget #749

@veerendra2

Description

@veerendra2

Some of our SLOs are showing infinite or negative infinite error budgets. More details are provided below.

Setup

  • Sloth Version: v0.12.0

    args:
      - kubernetes-controller
      - --resync-interval=5m
      - --workers=5
      - --default-slo-period=28d
      - --logger=json
  • Kubernetes Version: v1.33.1

  • vmalert Version: v1.125.0 (using VictoriaMetrics)

Negative Infinite Error Budget

We observed a negative infinite error budget after recently changing the target objective. Following this change, dashboards started showing a negative infinite error budget. However, in the past few days, this issue resolved itself and now displays percentage values as expected.

Below is the SLO spec for the affected service:

---
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
  name: vmcluster
  namespace: monitoring
  labels:
    team: diablo
spec:
  service: "vmcluster"
  labels:
    team: "diablo"
    namespace: "monitoring"
  slos:
    - name: "scrape-success"
      objective: 95.0
      description: "VictoriaMetrics SLI is the percentage of successful scrapes"
      sli:
        events:
          errorQuery: |
            sum(rate(vm_promscrape_scrapes_failed_total[{{.window}}]))
          totalQuery: |
            sum(rate(vm_promscrape_scrapes_total[{{.window}}]))
      alerting:
        name: SLOVMClusterScrapeFailure
        labels:
          team: diablo
        annotations:
          summary: "VictoriaMetrics scrapes are failing"
        pageAlert:
          labels:
            team: diablo
        ticketAlert:
          labels:
            team: diablo
Screen.Recording.2025-12-02.at.14.53.10.mov

Infinite Error Budget

We are unsure why this is happening. For example, the SLO for another service has been showing an infinite error budget for the past two weeks, whereas previously it displayed a numeric value.

Image

I have checked all underlying recording rules by executing them in PromQL to see their evaluations, but I still can't pinpoint where things are going wrong.

Can you share some insights on why this is happening and how to prevent it in the future?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions