feat: highly available clickhouse deployment #27

scotwells · 2026-01-17T01:51:16Z

The previous clickhouse deployment was setup to run as a single node cluster preventing us from having a highly available deployment.

The database schema has been adjusted to create a replicated database so that Clickhouse will automatically replicate schema changes and data between replicas.

Important

This schema change is a breaking change so the environments will be re-created.

I've also adjusted the ordering of audit logs to use the timestamp as a secondary sort column so we can maintain strict ordering of audit logs. There's still benefits in maintaining the hourly bucketing so Clickhouse can skip over entire hours of data through indexes. The apiserver has been updated to use the new ordering.

The migration script will now wait for all replicas in the cluster to come up before executing migrations.

Relates to datum-cloud/enhancements#536

The previous clickhouse deployment was setup to run as a single node cluster preventing us from having a highly available deployment. The database schema has been adjusted to create a replicated database so that Clickhouse will automatically replicate schema changes and data between replicas. This schema change is a breaking change so the environments will be re-created. I've also adjusted the ordering of audit logs to use the timestamp as a secondary sort column so we can maintain strict ordering of audit logs. There's still benefits in maintaining the hourly bucketing so Clickhouse can skip over entire hours of data through indexes. The apiserver has been updated to use the new ordering. The migration script will now wait for all replicas in the cluster to come up before executing migrations.

Need operational visibility into Clickhouse Keeper so we can monitor the component and understand replication delay between cluster replicas.

Were using a replicated database now. Need to account for some replicas reporting the same metrics.

These pod disruption budgets ensure the clickhouse system is able to maintain quorum by only letting a single replica of the database and the keeper instance be offline at any given time.

scotwells · 2026-01-20T15:01:01Z

Had an incident happen in production over the weekend after we deployed this change. To help mitigate the same issue in the future, I added resource requirements, topology spread constraints, and pod disruption budgets for the workloads to ensure they're hardened for production.

I also had to replace the table engine with the ReplicatedReplacingMergeTree table engine so that data is replicated across replicas. I misunderstood the Clickhouse documentation originally and thought that using a replicated database engine was enough to configure replication of DDL schema changes and the underlying data.

Also adjusted the audit log pipeline dashboard to account for replication being enabled now.

scotwells requested review from JoseSzycho, cc-datum, drewr, ecv and zachsmith1 January 17, 2026 01:51

scotwells added 2 commits January 17, 2026 11:40

feat: collect clickhouse keeper metrics

e3dd342

Need operational visibility into Clickhouse Keeper so we can monitor the component and understand replication delay between cluster replicas.

fix: remove pod template

eb368f1

ecv previously approved these changes Jan 19, 2026

View reviewed changes

feat: clickhouse keeper persistent storage

507e388

scotwells dismissed ecv’s stale review via 507e388 January 19, 2026 16:57

scotwells added 5 commits January 19, 2026 14:50

feat: clickhouse keeper resource requests

6656dd9

feat: spread clickhouse keeper across availability zones

09d0f9f

feat: move to using a ReplicatedReplacingMergeTree table engine

0a959af

fix: handle multiple replicas in metric queries

cb41c07

Were using a replicated database now. Need to account for some replicas reporting the same metrics.

feat: configure pod disruption budgets

d7d7bfc

These pod disruption budgets ensure the clickhouse system is able to maintain quorum by only letting a single replica of the database and the keeper instance be offline at any given time.

scotwells requested a review from ecv January 20, 2026 15:03

ecv approved these changes Jan 20, 2026

View reviewed changes

scotwells merged commit e67f7dc into main Jan 20, 2026
4 checks passed

scotwells deleted the feat/create-highly-available-replicated-database branch January 20, 2026 17:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: highly available clickhouse deployment #27

feat: highly available clickhouse deployment #27

Uh oh!

scotwells commented Jan 17, 2026

Uh oh!

scotwells commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: highly available clickhouse deployment #27

feat: highly available clickhouse deployment #27

Uh oh!

Conversation

scotwells commented Jan 17, 2026

Uh oh!

scotwells commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants