-
Notifications
You must be signed in to change notification settings - Fork 4
Adds retry logic to PreservationSubmissionJob #1473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -4,6 +4,16 @@ class PreservationSubmissionJob < ActiveJob::Base | |
|
|
||
| queue_as :default | ||
|
|
||
| # Custom error class for 502 Bad Gateway responses from APT | ||
| class APTBadGatewayError < StandardError; end | ||
|
|
||
| # Retry up to 10 times for transient 502 Bad Gateway responses. These 502 errors are generally caused by the lambda | ||
| # entering `Inactive` state after a long period of inactivity, which can take several minutes to recover from. | ||
| # We are using a fixed wait time of 5 minutes between retries to give the lambda time to warm up, rather than | ||
| # retrying immediately with exponential backoff as we expect the first few retries to fail in that scenario so this | ||
| # longer time is more effective. | ||
| retry_on APTBadGatewayError, wait: 5.minutes, attempts: 10 | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I mentioned in our pairing, I am curious about how the mailer will behave under multiple attempts. I understand that retries do not enqueue emails because of where the error is raised in the control flow, but I'm not sure whether that's optimal. |
||
|
|
||
| def perform(theses) | ||
| Rails.logger.info("Preparing to send #{theses.count} theses to preservation") | ||
| results = { total: theses.count, processed: 0, errors: [] } | ||
|
|
@@ -13,7 +23,10 @@ def perform(theses) | |
| preserve_payload(payload) | ||
| Rails.logger.info("Thesis #{thesis.id} has been sent to preservation") | ||
| results[:processed] += 1 | ||
| rescue StandardError, Aws::Errors => e | ||
| rescue StandardError => e | ||
| # Explicitly re-raise the 502-specific error so ActiveJob can retry it. | ||
| raise e if e.is_a?(APTBadGatewayError) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As alluded to above and discussed in Slack, this seems to prevent the job from sending mail on retries. I'm still on the fence about whether this is good or not. For now, I feel like it's probably good enough, since Sentry should let us know, and long-term it should rarely happen. |
||
|
|
||
| preservation_error = "Thesis #{thesis.id} could not be preserved: #{e}" | ||
| Rails.logger.info(preservation_error) | ||
| results[:errors] << preservation_error | ||
|
|
@@ -40,6 +53,12 @@ def post_payload(payload) | |
| http.request(request) | ||
| end | ||
|
|
||
| # If the remote endpoint returns 502, raise a APTBadGatewayError so ActiveJob can retry. | ||
| if response.code.to_s == '502' | ||
| Rails.logger.warn("Received 502 from APT for payload #{payload.id}; raising for retry") | ||
| raise APTBadGatewayError, 'APT returned 502 Bad Gateway' | ||
| end | ||
|
|
||
| validate_response(response) | ||
| end | ||
|
|
||
|
|
@@ -49,8 +68,8 @@ def validate_response(response) | |
| end | ||
|
|
||
| result = JSON.parse(response.body) | ||
| unless result['success'] == true | ||
| raise "APT failed to create a bag: #{response.body}" | ||
| end | ||
| return if result['success'] == true | ||
|
|
||
| raise "APT failed to create a bag: #{response.body}" | ||
| end | ||
| end | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice addition to the readme here! Even if the process were still documented in DSS, I like having the overview within the ETD readme.