-
Notifications
You must be signed in to change notification settings - Fork 373
Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
gmierz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @junngo! I tested it out locally and didn't see any issues.
What do you think about splitting this PR a bit though? I'd rather we land one change at a time (from less risky to more risky) to reduce the chances of something unexpected happening. We could start by landing the changes to the try data removal strategy and expand from there if there are no issues. If something bad happens with try data, it isn't as big of an issue versus something happening to our autoland/mozilla-central data.
| ).delete() | ||
| using.rowcount = deleted | ||
| using.execute( | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use f-strings for these instead of %s? I think that would make the raw queries a bit more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that f-strings can look a bit cleaner and more readable :)
I avoid f-strings here and use parameter binding instead.
It helps prevent SQL injection, lets PostgreSQL reuse query plans(caching queries) across chunks, and keeps type handling stable. Because of these benefits, I prefer this approach here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh excellent point! I missed that we're not using a % to add the string options so this approach makes sense to keep.
| ).values_list("id")[:chunk_size] | ||
| ).delete() | ||
| using.rowcount = deleted | ||
| using.execute( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments somewhere around the execute call to provide an explanation about what each part of the query does? We should do the same for all the other queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I’ll add comments around the execute call to explain what each part of the query does, and mirror that for the other queries as well.
|
Thanks for the review @gmierz |
Hi there :)
I worked on an improvement for the cycling data job. (https://bugzilla.mozilla.org/show_bug.cgi?id=1944375)
Issue & Background:
The cycling data job currently takes about 4-12 hours to complete. The main cause is related to a
PerformanceDatumReplicatetable. ThePerformanceDatumReplicatehas a foreign key relationship to aPerformanceDatumtable[0].When
PerformanceDatumrows are deleted, the correspondingPerformanceDatumReplicaterows are automatically removed via Django ORM cascade behavior.During the cycling job:
PerformanceDatum) per chunk.PerformanceDatumReplicateare resolved as part of the deletion process.PerformanceDatumReplicatecurrently contains over 500 million rows.sequential (full) table scanonPerformanceDatumReplicate.Even when executed only once per chunk, it can take 3-4 minutes.
Since the cycling job runs this logic repeatedly across many chunks, the total execution time grows to several hours.
This PR focuses on improving that behavior by avoiding repeated full table scans and reducing the overall runtime of the cycling data job.
Notes
PerformanceDatumReplicaterows are automatically removed via Django ORM cascade behavior whenPerformanceDatumrows are deleted.del_replicateCTEtarget_datumalready enforces these conditions, keeping the same filters indel_replicatehelps the postrgres planner choose a more stable execution plan.performance_datum_replicate.EXISTSin thedel_replicateCTEEXISTSclause is used to check whether replicate rows exist before touchingperformance_datum_replicate.ORDER BYin thetarget_datumCTE of the MainRemovalStrategyORDER BYallows the planner to more consistently leverage index scans.Query Plan Comparison
Before
After
[0]
treeherder/treeherder/perf/models.py
Line 276 in a82c683
[1] Only as a reference: #9107