Skip to content

[CELEBORN-1577][FOLLOWUP] Fix backward compatiblity issue with interrupt shuffle#3675

Open
s0nskar wants to merge 1 commit into
apache:mainfrom
s0nskar:fix_interrupt
Open

[CELEBORN-1577][FOLLOWUP] Fix backward compatiblity issue with interrupt shuffle#3675
s0nskar wants to merge 1 commit into
apache:mainfrom
s0nskar:fix_interrupt

Conversation

@s0nskar
Copy link
Copy Markdown
Contributor

@s0nskar s0nskar commented May 6, 2026

What changes were proposed in this pull request?

Fix backward compatibility issue with interrupt shuffle by checking if the reason is nonEmpty.

Why are the changes needed?

If someone uses a new client with old server which is not sending CheckQuotaResponse in HeartbeatFromApplicationResponse then proto uses default value to build CheckQuotaResponse with isAvailable=false and reason="". In this case the job will always fails without breaching the quota, we should not fail the job if the reason is empty to make it backward compatible.

Does this PR resolve a correctness bug?

No

Does this PR introduce any user-facing change?

No

How was this patch tested?

Tested in local setup.

@RexXiong
Copy link
Copy Markdown
Contributor

RexXiong commented May 7, 2026

Using reason.nonEmpty as a proxy for "server supports quota check" is fragile — reason is a business description field, not a compatibility signal. If a legitimate quota-exceeded response happens to have an empty reason, it would be silently ignored; conversely, any future change to the reason format could break this logic.

Since quotaInterruptShuffleEnabled defaults to false, users must explicitly opt in. In a new-client/old-server scenario, users should simply not enable this config. This is an explicit and semantically clear approach, rather than relying on an implicit convention around the reason field.

If we do need robust backward compatibility, a better approach would be to add an explicit capability field (e.g., quotaCheckSupported) to the heartbeat response. But that's a larger change and can be addressed separately.

Reviewed by Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the client-side quota interrupt handling in ApplicationHeartbeater to be backward compatible with older masters that don’t populate checkQuotaResponse in HeartbeatFromApplicationResponse, preventing accidental job cancellation due to protobuf default values.

Changes:

  • Gate shuffle interruption on CheckQuotaResponse.reason.nonEmpty in addition to !isAvailable.
  • Avoid treating protobuf default isAvailable=false, reason="" (from missing message field) as a real quota-exceeded signal.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


private def checkQuotaExceeds(response: CheckQuotaResponse): Unit = {
if (conf.quotaInterruptShuffleEnabled && !response.isAvailable) {
if (conf.quotaInterruptShuffleEnabled && !response.isAvailable && response.reason.nonEmpty) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to introduce new field like quotaCheckSupported to fix backward compatiblity.

@SteNicholas
Copy link
Copy Markdown
Member

@s0nskar, please rebase the latest main branch for Grafana Dashboard CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants