-
Notifications
You must be signed in to change notification settings - Fork 4
Adds retry logic to PreservationSubmissionJob #1473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why are these changes being introduced: * The APT lambda will return 502 errors when it hasn't been called for a few days as AWS makes the lambda `inactive` * This logic retries the jobs after 5 minutes to hopefully give enough time for the lambda to become active again. It retries up to 10 times. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ETD-677 How does this address that need: * Adds a custom error class for APTBadGatewayErrors * Configures ActiveJob to retry on that error class up to 10 times with a wait of 5 minutes between retries Document any side effects to this change: * While this will reduce the number off failed jobs and manual cleanup, it will create additional files in S3 and additional records in the ETD database due to that logic being included in the job that is being retried. This is a tradeoff we are willing to make to at this time as it is much better than a manual cleanup after failure and we are hopeful Infra will be able to add a Cron-like job to keep the lambda from fully deprovisioning into the `inactive` state soon so this should be a temporary workaround but still worth having in our codebase. * We should ensure the Lambda is not `inactive` manually before running our jobs for a while until we better understand how long passes before it becomes `inactive` and how long it takes to become active again. This code is more of a safety net for when we forget.
Why are these changes being introduced: * Two tests were misleading as to what they were testing * Documentation to reflect it is best to manual check the lambda state will help us avoid confusion in the future Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/ETD-677 How does this address that need: * Updates docs to include how to run dss * Updates docs to include note to check lambda state manually * Rewords two test names to reflect what they are actually testing Document any side effects to this change: * There are additional error states from APT we should detect and handle but that is out of scope for this ticket
jazairi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some non-blocking comments. I'd feel a little better if we could confirm Sentry catches the 502 errors, so we can be aware of retries. That might be easier to test in staging, though, so I'm happy to merge this as is.
Thank you for the collaborative review process on this! The pairings helped me understand the problem space and risks involved.
| 3. DSS runs (as of this writing that is a manual process documented in the | ||
| [DSS repo](https://github.com/MITLibraries/dspace-submission-service#run-stage)) | ||
| 3. DSS runs (as of this writing that is no longer documented in the | ||
| [DSS repo](https://github.com/MITLibraries/dspace-submission-service)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice addition to the readme here! Even if the process were still documented in DSS, I like having the overview within the ETD readme.
| # We are using a fixed wait time of 5 minutes between retries to give the lambda time to warm up, rather than | ||
| # retrying immediately with exponential backoff as we expect the first few retries to fail in that scenario so this | ||
| # longer time is more effective. | ||
| retry_on APTBadGatewayError, wait: 5.minutes, attempts: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in our pairing, I am curious about how the mailer will behave under multiple attempts. I understand that retries do not enqueue emails because of where the error is raised in the control flow, but I'm not sure whether that's optimal.
| rescue StandardError, Aws::Errors => e | ||
| rescue StandardError => e | ||
| # Explicitly re-raise the 502-specific error so ActiveJob can retry it. | ||
| raise e if e.is_a?(APTBadGatewayError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As alluded to above and discussed in Slack, this seems to prevent the job from sending mail on retries. I'm still on the fence about whether this is good or not. For now, I feel like it's probably good enough, since Sentry should let us know, and long-term it should rarely happen.
NOTE: the retry logic was confirmed with an inactive dev1 lambda and worked as expected.
Why are these changes being introduced:
inactiveRelevant ticket(s):
How does this address that need:
Document any side effects to this change:
inactivestate soon so this should be a temporary workaround but still worth having in our codebase.inactivemanually before running our jobs for a while until we better understand how long passes before it becomesinactiveand how long it takes to become active again. This code is more of a safety net for when we forget.Developer
our guide and
all issues introduced by these changes have been resolved or opened as new
issues (link to those issues in the Pull Request details above)
Code Reviewer
(not just this pull request message)
Requires database migrations?
NO
Includes new or updated dependencies?
NO