Skip to content

Conversation

@junngo
Copy link
Contributor

@junngo junngo commented Dec 26, 2025

Hi there :)
I worked on an improvement for the cycling data job. (https://bugzilla.mozilla.org/show_bug.cgi?id=1944375)

Issue & Background:

The cycling data job currently takes about 4-12 hours to complete. The main cause is related to a PerformanceDatumReplicate table. The PerformanceDatumReplicate has a foreign key relationship to a PerformanceDatum table[0].
When PerformanceDatum rows are deleted, the corresponding PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior.

During the cycling job:

  • The job selects up to 10,000 rows from the leading table (PerformanceDatum) per chunk.
  • For each chunk, related rows in PerformanceDatumReplicate are resolved as part of the deletion process.
    DELETE FROM performance_datum_replicate
    WHERE performance_datum_id IN ($1,$2,$3,$4,$5, ... ,$10,000)
    
  • PerformanceDatumReplicate currently contains over 500 million rows.
  • In certain query plans, this results in a sequential (full) table scan on PerformanceDatumReplicate.

Even when executed only once per chunk, it can take 3-4 minutes.
Since the cycling job runs this logic repeatedly across many chunks, the total execution time grows to several hours.

This PR focuses on improving that behavior by avoiding repeated full table scans and reducing the overall runtime of the cycling data job.

Notes

  1. Reason for using a raw query instead of the Django ORM
    • PerformanceDatumReplicate rows are automatically removed via Django ORM cascade behavior when PerformanceDatum rows are deleted.
    • In order to control the deletion process and influence the query plan at the SQL level, this logic needs to be implemented using a raw query rather than the ORM.
  2. Reason for keeping duplicate condition checks in the del_replicate CTE
    • Although target_datum already enforces these conditions, keeping the same filters in del_replicate helps the postrgres planner choose a more stable execution plan.
    • In practice, this helps the planner choose a more stable plan and consistently use index-based access paths for performance_datum_replicate.
  3. Reason for using EXISTS in the del_replicate CTE
    • The EXISTS clause is used to check whether replicate rows exist before touching performance_datum_replicate.
    • This avoids unnecessary access to the replicate table and is generally cheaper than performing a join or scan.
  4. Reason for using ORDER BY in the target_datum CTE of the MainRemovalStrategy
    • Using ORDER BY allows the planner to more consistently leverage index scans.
  5. Test results
    • The query plan was verified using Redash with the explain plan.
    • The resulting plans consistently used index scans instead of sequential table scans.
    • Execution time was observed to be approximately 0-10 seconds per chunk.
    • Actual performance may vary depending on the current state of the database.

Query Plan Comparison

Before

  • PostgreSQL scanned the entire performance_datum_replicate table (over 500M rows) for each chunk
  • This was triggered by IN (10k ids) and resulted in a Seq Scan
  • Runtime was about 3 minutes per chunk (most of the time spent scanning performance_datum_replicate)

After

  • PostgreSQL accesses performance_datum_replicate only via its index on performance_datum_id
  • The query first checks whether related rows exist (EXISTS) and limits the target set using a CTE
  • Runtime dropped to ~100 ms per chunk in EXPLAIN ANALYZE

[0]

performance_datum = models.ForeignKey(PerformanceDatum, on_delete=models.CASCADE)

[1] Only as a reference: #9107

@junngo junngo marked this pull request as ready for review December 30, 2025 08:25
Copy link
Collaborator

@gmierz gmierz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @junngo! I tested it out locally and didn't see any issues.

What do you think about splitting this PR a bit though? I'd rather we land one change at a time (from less risky to more risky) to reduce the chances of something unexpected happening. We could start by landing the changes to the try data removal strategy and expand from there if there are no issues. If something bad happens with try data, it isn't as big of an issue versus something happening to our autoland/mozilla-central data.

).delete()
using.rowcount = deleted
using.execute(
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use f-strings for these instead of %s? I think that would make the raw queries a bit more robust.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that f-strings can look a bit cleaner and more readable :)
I avoid f-strings here and use parameter binding instead.
It helps prevent SQL injection, lets PostgreSQL reuse query plans(caching queries) across chunks, and keeps type handling stable. Because of these benefits, I prefer this approach here.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh excellent point! I missed that we're not using a % to add the string options so this approach makes sense to keep.

).values_list("id")[:chunk_size]
).delete()
using.rowcount = deleted
using.execute(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comments somewhere around the execute call to provide an explanation about what each part of the query does? We should do the same for all the other queries.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I’ll add comments around the execute call to explain what each part of the query does, and mirror that for the other queries as well.

@junngo
Copy link
Contributor Author

junngo commented Jan 12, 2026

Thanks for the review @gmierz
That’s a good suggestion :) I’ll create a new patch for try first, and then we can land the changes gradually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants