Skip to content

Conversation

@danhhz
Copy link
Contributor

@danhhz danhhz commented Oct 12, 2023

We've seen evidence in production of this operator often going more than 5s without yielding, especially during rehydration and large unindexed SELECTs. This affects interactivity as well as the time between when a timely-driven Future is woken and when it is scheduled again.

Attempt to fix this with two tweaks to our yield heuristics. First, decrease the fuel between yields from 1_000_000 to 100_000. Second, also count each mfp evaluation against our fuel, in case it contains a restrictive filter.

The new 100_000 constant comes from eyeballing a replica in prod that was non-interactive. At the time, it was decoding ~300k rows per second, so 100k doesn't seem terribly low in comparison.

Touches MaterializeInc/database-issues#6497

Motivation

  • This PR fixes a recognized bug.

Tips for reviewer

Gonna sanity check the any performance regression from this with a nightlies run, in particular the feature bench.

For context, one of the affected replicas is spending about 3.4 cores out of 6 to decode a little under 300k rows per second (~11 μs each).

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@danhhz danhhz requested review from a team, bkirwi and teskje October 12, 2023 17:59
// work. However, in practice, we've been decode_and_mfp be a source of
// interactivity loss during rehydration, so we now also count each mfp
// evaluation against our fuel.
*work += 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we want to move this up even further... both FetchedPart::next and until.less_equal(&time) can filter out an arbitrary number of tuples before producing a candidate for mfp to evaluate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Not sure what we could do about the filtering inside FetchedPart::next but maybe that's okay because it's only looking at the timestamp (we always return after decoding key and val, which is presumably the expensive part).

The until.less_equal(&time) is probably not a terribly selective filter in the rehydration case, but might be worth it in the spirit of being defensive. We'd probably be double counting work then in the common case, no idea if we would feel bad about that. Maybe we should change it to be "1 work per decoded key/valplusmfp_output.len().saturating_sub(2)` (the latter basically being how many "extra" outputs are generated)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double counting seems fine to me, it's all an approximation right now. I think I'd feel better about that if the yield threshold were configurable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other thought: we could add a debug! that outputs the time spent before yielding and # of inputs and # of outputs (or create additional metrics for those) to guide any future tuning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there's a decent amount of hesitation around this, I've decided to keep the change as minimal as possible. Adding something for each mfp eval is pretty critical, so I think we have to live with that change, but I think moving the filtering is a bit speculative.

If possible, I'd also love to punt on any additional metrics or introspection in this PR because I don't really have a ton of time to devote to it atm, but I really think we want the knob in prod.

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not qualified to speak to persist_source specifics, so just some general thoughts:

  • How much is flow control a concern here? In compute operators, yielding more often means letting upstream operators push in more data, leading to potentially higher memory usage. Is this true for the persist_source as well? There are no upstream operators to a persist_source, but it does consist of multiple operators and it looks like the data fetching is not fueled.
  • Have you considered using time-based yielding? That might provide more consistent results given that it's hard to know how expensive the decode-and-MFP step is for any given data schema and MFP.

Anyway, this seems like a good incremental change, provided we don't see performance regressions in tests.

@bkirwi
Copy link
Contributor

bkirwi commented Oct 13, 2023

How much is flow control a concern here?

My feeling is that, if flow control is a problem, we have other tools to solve it in the source. So personally I'm most interested in what the answer would be assuming flow control is not a problem.

Have you considered using time-based yielding?

I think the original hope of exposing this knob was that the caller (ie. you) would consider time-based yielding... hence why this is configured in eg. compute/src/render/. But in practice it seems like neither storage nor compute are particularly comfortable tuning this knob. :/

To put the question back to you: how often should an operator yield?

@bkirwi
Copy link
Contributor

bkirwi commented Oct 13, 2023

(That said... I think this tuning is probably an improvement since the current one seems empirically bad! But I'm not sure what's up with those CI failures...)

@teskje
Copy link
Contributor

teskje commented Oct 14, 2023

But in practice it seems like neither storage nor compute are particularly comfortable tuning this knob. :/

Yeah, changing this setting is pretty scary, because of the unknown consequences for flow control and throughput. I don't know if we have a good way to estimate these consequences before we release to prod.

To put the question back to you: how often should an operator yield?

Good question! I think from a user's perspective waiting for more than a second for your query to return becomes annoying. A dataflow can have a bunch of expensive operators that get scheduled in sequence, so ideally each operator would yield after way below 1 second. I would naively say 10ms sounds good, but who knows what impact that has on throughput...

For joins I want to try an approach where the yielding behavior is configurable through an LD flag (#22391). I hope this will let us try different values quickly and with little risk. We could consider that for persist_source as well?

We've seen evidence in production of this operator often going more than
5s without yielding, especially during rehydration and large unindexed
SELECTs. This affects interactivity as well as the time between when a
timely-driven Future is woken and when it is scheduled again.

Attempt to fix this with two tweaks to our yield heuristics. First, make
the fuel between yields configurable through LD. Second, also count each
mfp evaluation against our fuel, in case it contains a restrictive
filter.
@danhhz danhhz requested a review from a team as a code owner October 18, 2023 20:48
@danhhz
Copy link
Contributor Author

danhhz commented Oct 18, 2023

Okay, in an effort to get something that we can land for the release cut tomorrow, I've switched this to be a knob in Launch Darkly, but otherwise left it as minimal as possible (including using the current value as the deafault for the config). This means that without an override in LD, the PR is a no-op except that we've started counting an mfp evaluation as a unit of work equivalent to an output. If we think that's still too much of a risk, we could make it something like max(1, mfp outputs) (where main is currently mfp outputs and the PR is currently 1 + mfp_outputs). PTAL!

The test failures from before were spooky, so I think we should get in another nightlies run with the config overridden to 100_000 before using it in prod. However, I'm going to start by kicking off a nightlies run with the PR exactly as it is to sanity check that the +1 doesn't break anything

@danhhz
Copy link
Contributor Author

danhhz commented Oct 19, 2023

Looks like these nightly failures are known flakes on main

Copy link
Contributor

@bkirwi bkirwi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! If the new definition of work causes issues we can probably tune our way past them with the knob.

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@danhhz
Copy link
Contributor Author

danhhz commented Oct 19, 2023

TFTRs!

@danhhz danhhz merged commit 8279683 into MaterializeInc:main Oct 19, 2023
@danhhz danhhz deleted the persist_yield branch October 19, 2023 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants