Skip to content

Conversation

@JPrevost
Copy link
Member

@JPrevost JPrevost commented Sep 22, 2025

NOTE: the retry logic was confirmed with an inactive dev1 lambda and worked as expected.

Why are these changes being introduced:

  • The APT lambda will return 502 errors when it hasn't been called for a few days as AWS makes the lambda inactive
  • This logic retries the jobs after 5 minutes to hopefully give enough time for the lambda to become active again. It retries up to 10 times.

Relevant ticket(s):

How does this address that need:

  • Adds a custom error class for APTBadGatewayErrors
  • Configures ActiveJob to retry on that error class up to 10 times with a wait of 5 minutes between retries

Document any side effects to this change:

  • While this will reduce the number off failed jobs and manual cleanup, it will create additional files in S3 and additional records in the ETD database due to that logic being included in the job that is being retried. This is a tradeoff we are willing to make to at this time as it is much better than a manual cleanup after failure and we are hopeful Infra will be able to add a Cron-like job to keep the lambda from fully deprovisioning into the inactive state soon so this should be a temporary workaround but still worth having in our codebase.
  • We should ensure the Lambda is not inactive manually before running our jobs for a while until we better understand how long passes before it becomes inactive and how long it takes to become active again. This code is more of a safety net for when we forget.

Developer

  • All new ENV is documented in README
  • All new ENV has been added to Heroku Pipeline, Staging and Prod
  • ANDI or Wave has been run in accordance to
    our guide and
    all issues introduced by these changes have been resolved or opened as new
    issues (link to those issues in the Pull Request details above)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer

  • The commit message is clear and follows our guidelines
    (not just this pull request message)
  • There are appropriate tests covering any new functionality
  • The documentation has been updated or is unnecessary
  • The changes have been verified
  • New dependencies are appropriate or there were no changes

Requires database migrations?

NO

Includes new or updated dependencies?

NO

Why are these changes being introduced:

* The APT lambda will return 502 errors when it hasn't been called
  for a few days as AWS makes the lambda `inactive`
* This logic retries the jobs after 5 minutes to hopefully give enough
  time for the lambda to become active again. It retries up to 10 times.

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/ETD-677

How does this address that need:

* Adds a custom error class for APTBadGatewayErrors
* Configures ActiveJob to retry on that error class up to 10 times with
  a wait of 5 minutes between retries

Document any side effects to this change:

* While this will reduce the number off failed jobs and manual cleanup,
  it will create additional files in S3 and additional records in the
  ETD database due to that logic being included in the job that is
  being retried. This is a tradeoff we are willing to make to at this
  time as it is much better than a manual cleanup after failure and
  we are hopeful Infra will be able to add a Cron-like job to keep the
  lambda from fully deprovisioning into the `inactive` state soon so
  this should be a temporary workaround but still worth having in our
  codebase.
* We should ensure the Lambda is not `inactive` manually before running
  our jobs for a while until we better understand how long passes before
  it becomes `inactive` and how long it takes to become active again.
  This code is more of a safety net for when we forget.
@mitlib mitlib temporarily deployed to thesis-submit-pr-1473 September 22, 2025 20:57 Inactive
@coveralls
Copy link

coveralls commented Sep 22, 2025

Coverage Status

coverage: 98.339% (+0.005%) from 98.334%
when pulling 30f7a03 on etd-677-retry-apt
into 1a3ca0f on main.

Why are these changes being introduced:

* Two tests were misleading as to what they were testing
* Documentation to reflect it is best to manual check the lambda state
  will help us avoid confusion in the future

Relevant ticket(s):

* https://mitlibraries.atlassian.net/browse/ETD-677

How does this address that need:

* Updates docs to include how to run dss
* Updates docs to include note to check lambda state manually
* Rewords two test names to reflect what they are actually testing

Document any side effects to this change:

* There are additional error states from APT we should detect and handle
  but that is out of scope for this ticket
@JPrevost JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 14:19 Inactive
@JPrevost JPrevost requested a review from jazairi September 23, 2025 14:20
@jazairi jazairi self-assigned this Sep 23, 2025
@JPrevost JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 14:49 Inactive
Copy link
Contributor

@jazairi jazairi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some non-blocking comments. I'd feel a little better if we could confirm Sentry catches the 502 errors, so we can be aware of retries. That might be easier to test in staging, though, so I'm happy to merge this as is.

Thank you for the collaborative review process on this! The pairings helped me understand the problem space and risks involved.

3. DSS runs (as of this writing that is a manual process documented in the
[DSS repo](https://github.com/MITLibraries/dspace-submission-service#run-stage))
3. DSS runs (as of this writing that is no longer documented in the
[DSS repo](https://github.com/MITLibraries/dspace-submission-service))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice addition to the readme here! Even if the process were still documented in DSS, I like having the overview within the ETD readme.

# We are using a fixed wait time of 5 minutes between retries to give the lambda time to warm up, rather than
# retrying immediately with exponential backoff as we expect the first few retries to fail in that scenario so this
# longer time is more effective.
retry_on APTBadGatewayError, wait: 5.minutes, attempts: 10
Copy link
Contributor

@jazairi jazairi Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in our pairing, I am curious about how the mailer will behave under multiple attempts. I understand that retries do not enqueue emails because of where the error is raised in the control flow, but I'm not sure whether that's optimal.

rescue StandardError, Aws::Errors => e
rescue StandardError => e
# Explicitly re-raise the 502-specific error so ActiveJob can retry it.
raise e if e.is_a?(APTBadGatewayError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alluded to above and discussed in Slack, this seems to prevent the job from sending mail on retries. I'm still on the fence about whether this is good or not. For now, I feel like it's probably good enough, since Sentry should let us know, and long-term it should rarely happen.

@JPrevost JPrevost temporarily deployed to thesis-submit-pr-1473 September 23, 2025 20:14 Inactive
@JPrevost JPrevost merged commit 69cb24d into main Sep 24, 2025
2 checks passed
@JPrevost JPrevost deleted the etd-677-retry-apt branch September 24, 2025 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants