Adds retry logic to PreservationSubmissionJob #1473

JPrevost · 2025-09-22T20:55:31Z

NOTE: the retry logic was confirmed with an inactive dev1 lambda and worked as expected.

Why are these changes being introduced:

The APT lambda will return 502 errors when it hasn't been called for a few days as AWS makes the lambda inactive
This logic retries the jobs after 5 minutes to hopefully give enough time for the lambda to become active again. It retries up to 10 times.

Relevant ticket(s):

https://mitlibraries.atlassian.net/browse/ETD-677

How does this address that need:

Adds a custom error class for APTBadGatewayErrors
Configures ActiveJob to retry on that error class up to 10 times with a wait of 5 minutes between retries

Document any side effects to this change:

While this will reduce the number off failed jobs and manual cleanup, it will create additional files in S3 and additional records in the ETD database due to that logic being included in the job that is being retried. This is a tradeoff we are willing to make to at this time as it is much better than a manual cleanup after failure and we are hopeful Infra will be able to add a Cron-like job to keep the lambda from fully deprovisioning into the inactive state soon so this should be a temporary workaround but still worth having in our codebase.
We should ensure the Lambda is not inactive manually before running our jobs for a while until we better understand how long passes before it becomes inactive and how long it takes to become active again. This code is more of a safety net for when we forget.

Developer

All new ENV is documented in README
All new ENV has been added to Heroku Pipeline, Staging and Prod
ANDI or Wave has been run in accordance to
our guide and
all issues introduced by these changes have been resolved or opened as new
issues (link to those issues in the Pull Request details above)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

The commit message is clear and follows our guidelines
(not just this pull request message)
There are appropriate tests covering any new functionality
The documentation has been updated or is unnecessary
The changes have been verified
New dependencies are appropriate or there were no changes

Requires database migrations?

NO

Includes new or updated dependencies?

NO

Why are these changes being introduced: * The APT lambda will return 502 errors when it hasn't been called for a few days as AWS makes the lambda `inactive` * This logic retries the jobs after 5 minutes to hopefully give enough time for the lambda to become active again. It retries up to 10 times. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ETD-677 How does this address that need: * Adds a custom error class for APTBadGatewayErrors * Configures ActiveJob to retry on that error class up to 10 times with a wait of 5 minutes between retries Document any side effects to this change: * While this will reduce the number off failed jobs and manual cleanup, it will create additional files in S3 and additional records in the ETD database due to that logic being included in the job that is being retried. This is a tradeoff we are willing to make to at this time as it is much better than a manual cleanup after failure and we are hopeful Infra will be able to add a Cron-like job to keep the lambda from fully deprovisioning into the `inactive` state soon so this should be a temporary workaround but still worth having in our codebase. * We should ensure the Lambda is not `inactive` manually before running our jobs for a while until we better understand how long passes before it becomes `inactive` and how long it takes to become active again. This code is more of a safety net for when we forget.

coveralls · 2025-09-22T21:00:07Z

coverage: 98.339% (+0.005%) from 98.334%
when pulling 30f7a03 on etd-677-retry-apt
into 1a3ca0f on main.

Why are these changes being introduced: * Two tests were misleading as to what they were testing * Documentation to reflect it is best to manual check the lambda state will help us avoid confusion in the future Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ETD-677 How does this address that need: * Updates docs to include how to run dss * Updates docs to include note to check lambda state manually * Rewords two test names to reflect what they are actually testing Document any side effects to this change: * There are additional error states from APT we should detect and handle but that is out of scope for this ticket

jazairi

I added some non-blocking comments. I'd feel a little better if we could confirm Sentry catches the 502 errors, so we can be aware of retries. That might be easier to test in staging, though, so I'm happy to merge this as is.

Thank you for the collaborative review process on this! The pairings helped me understand the problem space and risks involved.

jazairi · 2025-09-23T18:10:32Z

README.md

-3. DSS runs (as of this writing that is a manual process documented in the
-  [DSS repo](https://github.com/MITLibraries/dspace-submission-service#run-stage))
+3. DSS runs (as of this writing that is no longer documented in the
+  [DSS repo](https://github.com/MITLibraries/dspace-submission-service))


Really nice addition to the readme here! Even if the process were still documented in DSS, I like having the overview within the ETD readme.

jazairi · 2025-09-23T18:12:02Z

app/jobs/preservation_submission_job.rb

+  # We are using a fixed wait time of 5 minutes between retries to give the lambda time to warm up, rather than
+  # retrying immediately with exponential backoff as we expect the first few retries to fail in that scenario so this
+  # longer time is more effective.
+  retry_on APTBadGatewayError, wait: 5.minutes, attempts: 10


As I mentioned in our pairing, I am curious about how the mailer will behave under multiple attempts. I understand that retries do not enqueue emails because of where the error is raised in the control flow, but I'm not sure whether that's optimal.

README.md

jazairi · 2025-09-23T18:35:30Z

app/jobs/preservation_submission_job.rb

-    rescue StandardError, Aws::Errors => e
+    rescue StandardError => e
+      # Explicitly re-raise the 502-specific error so ActiveJob can retry it.
+      raise e if e.is_a?(APTBadGatewayError)


As alluded to above and discussed in Slack, this seems to prevent the job from sending mail on retries. I'm still on the fence about whether this is good or not. For now, I feel like it's probably good enough, since Sentry should let us know, and long-term it should rarely happen.

mitlib temporarily deployed to thesis-submit-pr-1473 September 22, 2025 20:57 Inactive

JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 14:19 Inactive

JPrevost requested a review from jazairi September 23, 2025 14:20

jazairi self-assigned this Sep 23, 2025

Fixup note in readme

9619216

JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 14:49 Inactive

jazairi approved these changes Sep 23, 2025

View reviewed changes

Fixup docs around checking lambda status

30f7a03

JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 20:14 Inactive

JPrevost merged commit 69cb24d into main Sep 24, 2025
2 checks passed

JPrevost deleted the etd-677-retry-apt branch September 24, 2025 12:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds retry logic to PreservationSubmissionJob #1473

Adds retry logic to PreservationSubmissionJob #1473

Uh oh!

JPrevost commented Sep 22, 2025 •

edited

Loading

Uh oh!

coveralls commented Sep 22, 2025 •

edited

Loading

Uh oh!

jazairi left a comment

Uh oh!

jazairi Sep 23, 2025

Uh oh!

jazairi Sep 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

jazairi Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Adds retry logic to PreservationSubmissionJob #1473

Adds retry logic to PreservationSubmissionJob #1473

Uh oh!

Conversation

JPrevost commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Developer

Code Reviewer

Requires database migrations?

Includes new or updated dependencies?

Uh oh!

coveralls commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jazairi left a comment

Choose a reason for hiding this comment

Uh oh!

jazairi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

jazairi Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jazairi Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JPrevost commented Sep 22, 2025 •

edited

Loading

coveralls commented Sep 22, 2025 •

edited

Loading

jazairi Sep 23, 2025 •

edited

Loading