Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136

junngo · 2025-12-26T12:34:57Z

Hi there :)
I worked on an improvement for the cycling data job. (https://bugzilla.mozilla.org/show_bug.cgi?id=1944375)

Issue & Background:

The cycling data job currently takes about 4-12 hours to complete. The main cause is related to a PerformanceDatumReplicate table. The PerformanceDatumReplicate has a foreign key relationship to a PerformanceDatum table[0].
When PerformanceDatum rows are deleted, the corresponding PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior.

During the cycling job:

The job selects up to 10,000 rows from the leading table (PerformanceDatum) per chunk.

For each chunk, related rows in PerformanceDatumReplicate are resolved as part of the deletion process.

DELETE FROM performance_datum_replicate
WHERE performance_datum_id IN ($1,$2,$3,$4,$5, ... ,$10,000)

PerformanceDatumReplicate currently contains over 500 million rows.
In certain query plans, this results in a sequential (full) table scan on PerformanceDatumReplicate.

Even when executed only once per chunk, it can take 3-4 minutes.
Since the cycling job runs this logic repeatedly across many chunks, the total execution time grows to several hours.

This PR focuses on improving that behavior by avoiding repeated full table scans and reducing the overall runtime of the cycling data job.

Notes

Reason for using a raw query instead of the Django ORM
- PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior when PerformanceDatum rows are deleted.
- In order to control the deletion process and influence the query plan at the SQL level, this logic needs to be implemented using a raw query rather than the ORM.
Reason for keeping duplicate condition checks in the del_replicate CTE
- Although target_datum already enforces these conditions, keeping the same filters in del_replicate helps the postrgres planner choose a more stable execution plan.
- In practice, this helps the planner choose a more stable plan and consistently use index-based access paths for performance_datum_replicate.
Reason for using EXISTS in the del_replicate CTE
- The EXISTS clause is used to check whether replicate rows exist before touching performance_datum_replicate.
- This avoids unnecessary access to the replicate table and is generally cheaper than performing a join or scan.
Reason for using ORDER BY in the target_datum CTE of the MainRemovalStrategy
- Using ORDER BY allows the planner to more consistently leverage index scans.
Test results
- The query plan was verified using Redash with the explain plan.
- The resulting plans consistently used index scans instead of sequential table scans.
- Execution time was observed to be approximately 0-10 seconds per chunk.
- Actual performance may vary depending on the current state of the database.

Query Plan Comparison

Before

PostgreSQL scanned the entire performance_datum_replicate table (over 500M rows) for each chunk
This was triggered by IN (10k ids) and resulted in a Seq Scan
Runtime was about 3 minutes per chunk (most of the time spent scanning performance_datum_replicate)

After

PostgreSQL accesses performance_datum_replicate only via its index on performance_datum_id
The query first checks whether related rows exist (EXISTS) and limits the target set using a CTE
Runtime dropped to ~100 ms per chunk in EXPLAIN ANALYZE

[0]

treeherder/treeherder/perf/models.py

Line 276 in a82c683

    
           performance_datum = models.ForeignKey(PerformanceDatum, on_delete=models.CASCADE)

[1] Only as a reference: #9107

…licates

gmierz

Looks great @junngo! I tested it out locally and didn't see any issues.

What do you think about splitting this PR a bit though? I'd rather we land one change at a time (from less risky to more risky) to reduce the chances of something unexpected happening. We could start by landing the changes to the try data removal strategy and expand from there if there are no issues. If something bad happens with try data, it isn't as big of an issue versus something happening to our autoland/mozilla-central data.

gmierz · 2026-01-09T18:23:11Z

treeherder/model/data_cycling/removal_strategies.py

-        ).delete()
-        using.rowcount = deleted
+        using.execute(
+            """


Could we use f-strings for these instead of %s? I think that would make the raw queries a bit more robust.

I agree that f-strings can look a bit cleaner and more readable :)
I avoid f-strings here and use parameter binding instead.
It helps prevent SQL injection, lets PostgreSQL reuse query plans(caching queries) across chunks, and keeps type handling stable. Because of these benefits, I prefer this approach here.

Oh excellent point! I missed that we're not using a % to add the string options so this approach makes sense to keep.

gmierz · 2026-01-09T18:24:23Z

treeherder/model/data_cycling/removal_strategies.py

-            ).values_list("id")[:chunk_size]
-        ).delete()
-        using.rowcount = deleted
+        using.execute(


Can you add comments somewhere around the execute call to provide an explanation about what each part of the query does? We should do the same for all the other queries.

Sure, I’ll add comments around the execute call to explain what each part of the query does, and mirror that for the other queries as well.

junngo · 2026-01-12T02:11:32Z

Thanks for the review @gmierz
That’s a good suggestion :) I’ll create a new patch for try first, and then we can land the changes gradually.

Bug 1944375 - Improve data cycling/deletion for performance datum rep…

aa518be

…licates

junngo marked this pull request as ready for review December 30, 2025 08:25

gmierz reviewed Jan 9, 2026

View reviewed changes

junngo mentioned this pull request Jan 12, 2026

Bug 1944375 - Improve data cycling for TryDataRemoval #9158

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136

Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136

Uh oh!

junngo commented Dec 26, 2025 •

edited

Loading

Uh oh!

gmierz left a comment

Uh oh!

gmierz Jan 9, 2026

Uh oh!

junngo Jan 12, 2026

Uh oh!

gmierz Jan 12, 2026

Uh oh!

gmierz Jan 9, 2026

Uh oh!

junngo Jan 12, 2026

Uh oh!

junngo commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136

Are you sure you want to change the base?

Bug 1944375 - Improve data cycling/deletion for performance datum replicates #9136

Uh oh!

Conversation

junngo commented Dec 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue & Background:

During the cycling job:

Notes

Query Plan Comparison

Before

After

Uh oh!

gmierz left a comment

Choose a reason for hiding this comment

Uh oh!

gmierz Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

junngo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gmierz Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

gmierz Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

junngo Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

junngo commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

junngo commented Dec 26, 2025 •

edited

Loading