Skip to content

Feature Request: Configurable TTL for ClickHouse system log tables via values.yaml #191

@alehostert

Description

@alehostert

There is currently no way to configure TTL for ClickHouse internal system log tables (e.g. trace_log, query_log, metric_log) through the Helm chart's values.yaml. This causes those tables to grow unbounded, which can fill up the data PVC and corrupt user data tables.

Problem

ClickHouse system log tables such as trace_log, query_log, metric_log, and others do not have a TTL configured by default. In a production deployment, these tables accumulate data indefinitely.

In our case, system.trace_log grew to ~97 GB over roughly 6 weeks, completely filling the /var/lib/clickhouse PVC (108 GB). This caused ClickHouse to fail writing new data parts, which resulted in 125+ broken parts across user tables (otel_traces, otel_traces_trace_id_ts), preventing those tables from loading entirely and making HyperDX show no data at all.

The affected system tables and their observed sizes at the time of the incident:

Table Size
trace_log ~97 GB (dominant)
query_log 2.84 GB
part_log 1.67 GB
processors_profile_log 972 MB
metric_log 652 MB
asynchronous_metric_log 248 MB

Expected Behavior

The Helm chart should expose a configuration option to set TTL for system log tables, so that operators can define retention policies without having to manually run ALTER TABLE commands after every deployment or upgrade.

Suggested Implementation

Add a systemLogs block under clickhouse.config in values.yaml:

clickhouse:
  config:
    systemLogs:
      ttlDays: 7  # Set to 0 to disable TTL

And reflect it in the clickhouse-configmap.yaml template:

<trace_log>
  <database>system</database>
  <table>trace_log</table>
  <partition_by>toYYYYMM(event_date)</partition_by>
  <flush_interval_milliseconds>7500</flush_interval_milliseconds>
  {{- if gt .Values.clickhouse.config.systemLogs.ttlDays 0 }}
  <ttl>event_date + INTERVAL {{ .Values.clickhouse.config.systemLogs.ttlDays }} DAY DELETE</ttl>
  {{- end }}
</trace_log>

This pattern would apply to all system log tables already declared in the configmap (query_log, metric_log, asynchronous_metric_log, part_log, processors_profile_log, trace_log, query_thread_log, query_views_log, opentelemetry_span_log).

Workaround

Until this is supported, TTL must be applied manually via clickhouse-client after each deployment:

ALTER TABLE system.trace_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.query_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.part_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.processors_profile_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.asynchronous_metric_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.query_views_log MODIFY TTL event_date + INTERVAL 7 DAY DELETE;
ALTER TABLE system.opentelemetry_span_log MODIFY TTL finish_date + INTERVAL 7 DAY DELETE;

Note that these changes do not persist if the pod is replaced with a fresh PVC, making automation through the chart the only reliable solution.

Environment

  • Chart version: clickstack-1.1.2
  • App version: 2.19.0
  • ClickHouse image: clickhouse/clickhouse-server:25.7-alpine
  • Kubernetes: OKE (Oracle Kubernetes Engine)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions