Skip to content

Set telemetry retention policy asynchronously#10416

Open
bnaecker wants to merge 2 commits intomainfrom
ben/set-retention-policy-take2
Open

Set telemetry retention policy asynchronously#10416
bnaecker wants to merge 2 commits intomainfrom
ben/set-retention-policy-take2

Conversation

@bnaecker
Copy link
Copy Markdown
Collaborator

@bnaecker bnaecker commented May 8, 2026

  • Don't wait for a new TTL to actually take effect, only for the server to ack it. These can take a very long time on large tables, 100s of seconds, and waiting isn't useful in this context.
  • Update the clickhouse-admin API to return the configured retention policy for every table rather that assuming they're all the same
  • Recreate the clickhouse client if a TTL request fails, to ensure we can continue to set the TTL on the rest of the tables.
  • Slightly better filtering of table names in the usage output
  • Fixes Setting the telemetry TTL can take a while, and that should be OK #10415

@bnaecker bnaecker force-pushed the ben/set-retention-policy-take2 branch from c1c4d50 to d4fc219 Compare May 8, 2026 17:02
- Don't wait for a new TTL to actually take effect, only for the server
  to ack it. These can take a very long time on large tables, 100s of
  seconds, and waiting isn't useful in this context.
- Update the clickhouse-admin API to return the configured retention
  policy for every table rather that assuming they're all the same
- Recreate the clickhouse client if a TTL request fails, to ensure we
  can continue to set the TTL on the rest of the tables.
- Slightly better filtering of table names in the usage output
- Fixes #10415
@bnaecker bnaecker force-pushed the ben/set-retention-policy-take2 branch from d4fc219 to c6ebea4 Compare May 8, 2026 18:11
@bnaecker
Copy link
Copy Markdown
Collaborator Author

bnaecker commented May 8, 2026

Ok, I've tested this on berlin a few times. This exposes every table's retention policy separately, so that we can at least see via omdb whether all the tables are at the right version. It also makes the requests async, so it should not time out on huge tables. Here are the queries:

root@oxz_clickhouse_b9d79485:~# tail -F $(svcs -L clickhouse) | grep "ALTER TABLE"
2026.05.08 19:34:19.674015 [ 3 ] {cd342690-193d-40cf-8caf-c3104c172aae} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_bool MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.679861 [ 3 ] {455fc873-10a2-4621-a5e7-d5875679e9bd} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_i16 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.690816 [ 3 ] {52fb2234-b6e7-4af9-99fd-729b4ab5b253} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_i32 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.693464 [ 3 ] {a09c4b3b-8203-44c6-9eae-eab36c72352a} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_i64 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.695813 [ 3 ] {9ba85cbf-ddbb-42c1-ad9b-cae11fad2be4} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_i8 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.698314 [ 3 ] {cfb72fc6-f4aa-4869-b671-5b9082fb5be5} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_ipaddr MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.705600 [ 3 ] {db3d9a67-ed3e-45ff-832d-d55d3accd5b2} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_string MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.735129 [ 3 ] {a21cb4bb-0134-4085-ac36-c65be2900967} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_u16 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.742227 [ 3 ] {6b3478e7-9664-4c2c-a6fb-d5399919e1d3} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_u32 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.752926 [ 3 ] {6c7b4da8-f8db-47fc-af92-abc8aeaa6324} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_u64 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.759102 [ 3 ] {120a470a-fe35-4746-b815-3b2b7973d177} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_u8 MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.766817 [ 3 ] {7e5ce318-1d97-42ac-b5a4-813cadcee5b0} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.fields_uuid MODIFY TTL last_updated_at + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.779876 [ 3 ] {dd467502-2a76-430b-b257-c7a82ed0e8fc} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_bool MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.786619 [ 3 ] {55f20113-1e88-45c3-b99e-50c0585559a7} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_bytes MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.788771 [ 3 ] {4b54da27-04ed-48c1-bf3c-3e29ae102966} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_cumulativef32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.791156 [ 3 ] {1d201b1b-7e1d-4d44-9c80-6f0927864c99} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_cumulativef64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.793326 [ 3 ] {3490bbe2-cb33-4f90-8932-3ccad88f8e2d} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_cumulativei64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.795643 [ 3 ] {aab70624-dfbc-48bd-a5cf-6576b639dfdd} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_cumulativeu64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:19.948835 [ 3 ] {c5dd09f1-a41e-486a-aa7b-fc0483650fdd} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_f32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.071347 [ 3 ] {0f6c54ad-7076-40ac-b16d-1e87c99529fe} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_f64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.076044 [ 3 ] {80c16db4-c6d9-40f1-b326-94576636c6b0} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramf32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.078689 [ 3 ] {87348187-b44c-424b-b015-c5e32cf4c341} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramf64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.081485 [ 3 ] {e6994afe-6e57-46eb-85c4-f78cd7491ad5} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogrami16 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.084427 [ 3 ] {74527fdb-f07f-4adc-80d5-9d3fe4e339ad} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogrami32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.087360 [ 3 ] {5b8babbd-525f-4b7e-8156-4bc3586a2d9f} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogrami64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.089911 [ 3 ] {ac0ed560-a10a-431d-ad6a-29850189ba73} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogrami8 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.092460 [ 3 ] {4da73e2f-ddac-4693-9783-c1e0cc497553} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramu16 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.094890 [ 3 ] {4138f4b0-2516-44db-aee2-f4492c266e76} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramu32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.097458 [ 3 ] {02b55332-9f5e-4102-bb5b-7b9808fa4592} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramu64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.113752 [ 3 ] {84818397-0e84-4c24-8e28-5cdd1670b48c} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_histogramu8 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.116383 [ 3 ] {1063d9b9-6712-451e-890b-1ed3fe0cd3cf} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_i16 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.118782 [ 3 ] {f95a96f9-8875-4732-b533-7fd94643ef1d} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_i32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.121115 [ 3 ] {627ca295-aa84-4d82-a607-bd84730ce59a} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_i64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.127000 [ 3 ] {b0436254-f977-495b-b951-20cdb515ca29} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_i8 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.129273 [ 3 ] {9bda7eb8-a962-44ce-b85f-bd325e4605fd} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_string MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.131585 [ 3 ] {1fdaade8-4220-48e0-b0f2-9ccc25eb86f3} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_u16 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.133817 [ 3 ] {2b2bab01-adc3-43e7-a4fa-68697027306a} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_u32 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.135979 [ 3 ] {312235d3-3380-4b54-ac9d-535fb95bb610} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_u64 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.158775 [ 3 ] {7debd5be-7b9a-46b8-b637-dedb43f8609e} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE oximeter.measurements_u8 MODIFY TTL toDateTime(timestamp) + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)
2026.05.08 19:34:20.161077 [ 3 ] {5e09aad2-a0bb-4a57-9dd2-cb323d70ec1e} <Debug> executeQuery: (from [fd7e:2f1:4d74:102::5]:58666) ALTER TABLE system.query_log MODIFY TTL event_time + toIntervalDay(5) SETTINGS mutations_sync=0; (stage: Complete)

And the commands as issued from omdb:

root@oxz_switch1:~# omdb clickhouse-admin retention-policy
note: clickhouse-admin URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd7e:2f1:4d74:102::5]:8888

Retention policy
TABLE_NAME                 DAYS
fields_bool                30
fields_i16                 30
fields_i32                 30
fields_i64                 30
fields_i8                  30
fields_ipaddr              30
fields_string              30
fields_u16                 30
fields_u32                 30
fields_u64                 30
fields_u8                  30
fields_uuid                30
measurements_bool          30
measurements_bytes         30
measurements_cumulativef32 30
measurements_cumulativef64 30
measurements_cumulativei64 30
measurements_cumulativeu64 30
measurements_f32           30
measurements_f64           30
measurements_histogramf32  30
measurements_histogramf64  30
measurements_histogrami16  30
measurements_histogrami32  30
measurements_histogrami64  30
measurements_histogrami8   30
measurements_histogramu16  30
measurements_histogramu32  30
measurements_histogramu64  30
measurements_histogramu8   30
measurements_i16           30
measurements_i32           30
measurements_i64           30
measurements_i8            30
measurements_string        30
measurements_u16           30
measurements_u32           30
measurements_u64           30
measurements_u8            30
root@oxz_switch1:~# omdb -w clickhouse-admin set-retention-policy --days 5
note: clickhouse-admin URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd7e:2f1:4d74:102::5]:8888
root@oxz_switch1:~# omdb clickhouse-admin retention-policy
note: clickhouse-admin URL not specified.  Will pick one from DNS.
note: using DNS from system config (typically /etc/resolv.conf)
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using clickhouse-admin URL http://[fd7e:2f1:4d74:102::5]:8888

Retention policy
TABLE_NAME                 DAYS
fields_bool                5
fields_i16                 5
fields_i32                 5
fields_i64                 5
fields_i8                  5
fields_ipaddr              5
fields_string              5
fields_u16                 5
fields_u32                 5
fields_u64                 5
fields_u8                  5
fields_uuid                5
measurements_bool          5
measurements_bytes         5
measurements_cumulativef32 5
measurements_cumulativef64 5
measurements_cumulativei64 5
measurements_cumulativeu64 5
measurements_f32           5
measurements_f64           5
measurements_histogramf32  5
measurements_histogramf64  5
measurements_histogrami16  5
measurements_histogrami32  5
measurements_histogrami64  5
measurements_histogrami8   5
measurements_histogramu16  5
measurements_histogramu32  5
measurements_histogramu64  5
measurements_histogramu8   5
measurements_i16           5
measurements_i32           5
measurements_i64           5
measurements_i8            5
measurements_string        5
measurements_u16           5
measurements_u32           5
measurements_u64           5
measurements_u8            5

@bnaecker bnaecker requested review from jgallagher and smklein May 8, 2026 19:36
Comment thread clickhouse-admin/api/src/lib.rs
Comment thread clickhouse-admin/api/src/lib.rs
Comment thread oximeter/db/src/client/mod.rs
LIMIT 1;";
WHERE (\
database = 'oximeter' \
AND multiSearchAny(name, ['measurements', 'fields']) \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be "startsWith", but is now a substring match - do we think we're unlikely to add a table with one of these words used as a non-prefix?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could add an integration test here. We should always know exactly which tables will exist in clickhouse, so an integration test should catch any regressions.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's mostly because there's no method that I could find for checking whether a string starts with a number of prefixes. So this is searching anywhere within the string. I can change this to a condition like "starts with A or starts with B" if that's preferable.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about adding an integration test like this one? I didn't catch this, but claude found that we panic on reading the retention policy from system.query_log if it exists. I think we need at least the fix in that branch, or something like it, to merge.

Comment thread oximeter/db/src/client/mod.rs
policy: TypedBody<latest::retention::RetentionPolicyRequest>,
) -> Result<HttpResponseUpdatedNoContent, HttpError>;

/// Set the retention policy for timeseries data.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be worth a note that this update is eventually consistent. IIUC, a user could set the retention policy, immediately read it back, and be surprised that it's not (yet) the expected value.

And a question: with SETTINGS mutations_sync=0, is it possible to queue an update that never succeeds? Speaking of which, have we tested what happens here when the disk is full, which is the issue that triggered this feature in the first place? I don't want to delay merging on this question, but we might want to figure it out sometime if we haven't already.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea about the note.

Yes, I think it's possible for a background mutation to fail. The description for the system.mutations table lists columns like last_fail_time, for example. I can't find on that page or any other whether mutations are retried. The phrasing "last fail time" makes me think they are, but that's not clear.

I'm not sure what happens if the disk is full, we haven't attempted it and it is hard to reproduce. From this page it appears that whole data parts (chunks of tables) are rewritten according to the mutation, and then the original is dropped. Depending on how the TTL is implemented, that could certainly lead to write amplification. It's a valid concern, and we should flag it for testing. I'm...fairly confident... that it will not cause more pain than we've already had because we're expecting to reduce the retention time which would lead to all or most of a data part being dropped, not rewritten with say, a different order. Or maybe I'm just hopeful.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually there's this:

Unlike merges, mutations can't be rolled back once submitted and will continue to execute even after server restarts unless explicitly cancelled

From this page ominously named "Avoid mutations". I agree it's a risk, but this is the only way to actually set a TTL other than managing it ourselves, which I think is out of scope.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note added in ec2050b

Comment thread oximeter/db/src/client/mod.rs Outdated
Comment thread clickhouse-admin/api/src/lib.rs
@bnaecker
Copy link
Copy Markdown
Collaborator Author

bnaecker commented May 8, 2026

@smklein I can't respond directly to your question, but no I don't believe the default was to to exec these asynchronously. See the logs from #10415. We're clearly waiting for >530s for the actual mutation to finish before returning to the client. I'm not sure how to square that with the docs though.

@bnaecker
Copy link
Copy Markdown
Collaborator Author

bnaecker commented May 8, 2026

@smklein See https://clickhouse.com/docs/sql-reference/statements/alter#synchronicity-of-alter-queries.

For non-replicated tables, all ALTER queries are performed synchronously. For replicated tables, the query just adds instructions for the appropriate actions to ZooKeeper, and the actions themselves are performed as soon as possible. However, the query can wait for these actions to be completed on all the replicas.

@bnaecker
Copy link
Copy Markdown
Collaborator Author

bnaecker commented May 8, 2026

Thanks @smklein and @jmcarp for the helpful suggestions. I've addressed most of the comments in ec2050b. LMK if there's anything else.

@bnaecker bnaecker requested review from jmcarp and smklein May 8, 2026 22:24
Copy link
Copy Markdown
Collaborator

@smklein smklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though @jmcarp seems to have a much better grasp of this code than me. you might want to wait for his approval as well

@jmcarp
Copy link
Copy Markdown
Contributor

jmcarp commented May 9, 2026

I was, and maybe still am, confused about the behavior of mutations_sync for ALTER TABLE...MODIFY TTL. The linked docs look promising, but they're a bit hedge-y (queries "including, but not limited to"), and I don't know if they apply to the older ch version we run on the rack. I had claude write me a little script to benchmark ALTER TABLE...MODIFY TTL on a not-tiny table under different settings on a helios vm using our actual ch binary, and here's what I got:

┌───────────────────────────────────────────┬─────────────┬────────────┬─────────────────────────────────────┬────────────────┐
  │                  Variant                  │ Server time │ Wall clock │          Mutation queued?           │ Rows >7d after │
  ├───────────────────────────────────────────┼─────────────┼────────────┼─────────────────────────────────────┼────────────────┤
  │ defaults                                  │ 3.035s      │ 3130ms     │ yes, finished before ALTER returned │ 0              │
  ├───────────────────────────────────────────┼─────────────┼────────────┼─────────────────────────────────────┼────────────────┤
  │ SETTINGS mutations_sync = 0               │ 2.830s      │ 2920ms     │ yes, finished before ALTER returned │ 0              │
  ├───────────────────────────────────────────┼─────────────┼────────────┼─────────────────────────────────────┼────────────────┤
  │ SETTINGS alter_sync = 0                   │ 4.592s      │ 4693ms     │ yes, finished before ALTER returned │ 0              │
  ├───────────────────────────────────────────┼─────────────┼────────────┼─────────────────────────────────────┼────────────────┤
  │ SETTINGS materialize_ttl_after_modify = 0 │ 0.009s      │ 102ms      │ no mutation at all                  │ 38,358,497     │
  └───────────────────────────────────────────┴─────────────┴────────────┴─────────────────────────────────────┴────────────────┘

Maybe the script is bad, but it looks like ch v23.8.x doesn't respect either mutations_sync or alter_sync the way we would expect from the (current) docs. Note: I'm getting sleepy and haven't read the script carefully, and it's written for my specific helios vm. I'll take a closer look Monday, or maybe earlier if my kids behave. But for now, I'm confused by why this would have worked on berlin—either my test is wrong/unrealistic, or something else is going on.

Even if the mutations_sync=0 is a no-op (which...I want to say it should be?), I think this patch is an improvement once we fix the edge case around system.query_log, but I'm still unsure how about whether/why it affects the MODIFY TTL queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Setting the telemetry TTL can take a while, and that should be OK

3 participants