storage: make decode_and_mfp yield more often #22358

danhhz · 2023-10-12T17:59:57Z

We've seen evidence in production of this operator often going more than 5s without yielding, especially during rehydration and large unindexed SELECTs. This affects interactivity as well as the time between when a timely-driven Future is woken and when it is scheduled again.

Attempt to fix this with two tweaks to our yield heuristics. First, decrease the fuel between yields from 1_000_000 to 100_000. Second, also count each mfp evaluation against our fuel, in case it contains a restrictive filter.

The new 100_000 constant comes from eyeballing a replica in prod that was non-interactive. At the time, it was decoding ~300k rows per second, so 100k doesn't seem terribly low in comparison.

Touches MaterializeInc/database-issues#6497

Motivation

This PR fixes a recognized bug.

Tips for reviewer

Gonna sanity check the any performance regression from this with a nightlies run, in particular the feature bench.

For context, one of the affected replicas is spending about 3.4 cores out of 6 to decode a little under 300k rows per second (~11 μs each).

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

pH14 · 2023-10-12T18:07:53Z

src/storage-operators/src/persist_source.rs

+                        // work. However, in practice, we've been decode_and_mfp be a source of
+                        // interactivity loss during rehydration, so we now also count each mfp
+                        // evaluation against our fuel.
+                        *work += 1;


wondering if we want to move this up even further... both FetchedPart::next and until.less_equal(&time) can filter out an arbitrary number of tuples before producing a candidate for mfp to evaluate

Good point!

Not sure what we could do about the filtering inside FetchedPart::next but maybe that's okay because it's only looking at the timestamp (we always return after decoding key and val, which is presumably the expensive part).

The until.less_equal(&time) is probably not a terribly selective filter in the rehydration case, but might be worth it in the spirit of being defensive. We'd probably be double counting work then in the common case, no idea if we would feel bad about that. Maybe we should change it to be "1 work per decoded key/valplusmfp_output.len().saturating_sub(2)` (the latter basically being how many "extra" outputs are generated)

Double counting seems fine to me, it's all an approximation right now. I think I'd feel better about that if the yield threshold were configurable

other thought: we could add a debug! that outputs the time spent before yielding and # of inputs and # of outputs (or create additional metrics for those) to guide any future tuning

Given that there's a decent amount of hesitation around this, I've decided to keep the change as minimal as possible. Adding something for each mfp eval is pretty critical, so I think we have to live with that change, but I think moving the filtering is a bit speculative.

If possible, I'd also love to punt on any additional metrics or introspection in this PR because I don't really have a ton of time to devote to it atm, but I really think we want the knob in prod.

teskje

I'm not qualified to speak to persist_source specifics, so just some general thoughts:

How much is flow control a concern here? In compute operators, yielding more often means letting upstream operators push in more data, leading to potentially higher memory usage. Is this true for the persist_source as well? There are no upstream operators to a persist_source, but it does consist of multiple operators and it looks like the data fetching is not fueled.
Have you considered using time-based yielding? That might provide more consistent results given that it's hard to know how expensive the decode-and-MFP step is for any given data schema and MFP.

Anyway, this seems like a good incremental change, provided we don't see performance regressions in tests.

bkirwi · 2023-10-13T18:21:26Z

How much is flow control a concern here?

My feeling is that, if flow control is a problem, we have other tools to solve it in the source. So personally I'm most interested in what the answer would be assuming flow control is not a problem.

Have you considered using time-based yielding?

I think the original hope of exposing this knob was that the caller (ie. you) would consider time-based yielding... hence why this is configured in eg. compute/src/render/. But in practice it seems like neither storage nor compute are particularly comfortable tuning this knob. :/

To put the question back to you: how often should an operator yield?

bkirwi · 2023-10-13T18:25:05Z

(That said... I think this tuning is probably an improvement since the current one seems empirically bad! But I'm not sure what's up with those CI failures...)

teskje · 2023-10-14T13:50:58Z

But in practice it seems like neither storage nor compute are particularly comfortable tuning this knob. :/

Yeah, changing this setting is pretty scary, because of the unknown consequences for flow control and throughput. I don't know if we have a good way to estimate these consequences before we release to prod.

To put the question back to you: how often should an operator yield?

Good question! I think from a user's perspective waiting for more than a second for your query to return becomes annoying. A dataflow can have a bunch of expensive operators that get scheduled in sequence, so ideally each operator would yield after way below 1 second. I would naively say 10ms sounds good, but who knows what impact that has on throughput...

For joins I want to try an approach where the yielding behavior is configurable through an LD flag (#22391). I hope this will let us try different values quickly and with little risk. We could consider that for persist_source as well?

We've seen evidence in production of this operator often going more than 5s without yielding, especially during rehydration and large unindexed SELECTs. This affects interactivity as well as the time between when a timely-driven Future is woken and when it is scheduled again. Attempt to fix this with two tweaks to our yield heuristics. First, make the fuel between yields configurable through LD. Second, also count each mfp evaluation against our fuel, in case it contains a restrictive filter.

danhhz · 2023-10-18T20:59:05Z

Okay, in an effort to get something that we can land for the release cut tomorrow, I've switched this to be a knob in Launch Darkly, but otherwise left it as minimal as possible (including using the current value as the deafault for the config). This means that without an override in LD, the PR is a no-op except that we've started counting an mfp evaluation as a unit of work equivalent to an output. If we think that's still too much of a risk, we could make it something like max(1, mfp outputs) (where main is currently mfp outputs and the PR is currently 1 + mfp_outputs). PTAL!

The test failures from before were spooky, so I think we should get in another nightlies run with the config overridden to 100_000 before using it in prod. However, I'm going to start by kicking off a nightlies run with the PR exactly as it is to sanity check that the +1 doesn't break anything

danhhz · 2023-10-19T15:15:15Z

Looks like these nightly failures are known flakes on main

bkirwi

Looks good! If the new definition of work causes issues we can probably tune our way past them with the knob.

teskje

LGTM!

danhhz · 2023-10-19T16:05:54Z

TFTRs!

danhhz requested review from a team, bkirwi and teskje October 12, 2023 17:59

pH14 reviewed Oct 12, 2023

View reviewed changes

teskje reviewed Oct 13, 2023

View reviewed changes

danhhz force-pushed the persist_yield branch from 29d82b8 to f55db1b Compare October 18, 2023 20:48

danhhz requested a review from a team as a code owner October 18, 2023 20:48

bkirwi approved these changes Oct 19, 2023

View reviewed changes

teskje approved these changes Oct 19, 2023

View reviewed changes

danhhz merged commit 8279683 into MaterializeInc:main Oct 19, 2023

danhhz deleted the persist_yield branch October 19, 2023 16:05

storage: make decode_and_mfp yield more often #22358

storage: make decode_and_mfp yield more often #22358

Uh oh!

Conversation

danhhz commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

pH14 Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

danhhz Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

pH14 Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

pH14 Oct 12, 2023

Choose a reason for hiding this comment

Uh oh!

danhhz Oct 18, 2023

Choose a reason for hiding this comment

Uh oh!

teskje left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkirwi commented Oct 13, 2023

Uh oh!

bkirwi commented Oct 13, 2023

Uh oh!

teskje commented Oct 14, 2023

Uh oh!

danhhz commented Oct 18, 2023

Uh oh!

danhhz commented Oct 19, 2023

Uh oh!

bkirwi left a comment

Choose a reason for hiding this comment

Uh oh!

teskje left a comment

Choose a reason for hiding this comment

Uh oh!

danhhz commented Oct 19, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

danhhz commented Oct 12, 2023 •

edited

Loading

teskje left a comment •

edited

Loading