Skip to content

Fix: Make the janitor best effort#5802

Merged
themisvaltinos merged 3 commits into
mainfrom
themis/janitorcleanup
May 21, 2026
Merged

Fix: Make the janitor best effort#5802
themisvaltinos merged 3 commits into
mainfrom
themis/janitorcleanup

Conversation

@themisvaltinos
Copy link
Copy Markdown
Collaborator

@themisvaltinos themisvaltinos commented May 21, 2026

This updates the janitor instead of stopping on the first drop failure, to attempt every cleanup operation and aggregates all failures into a single error raised at the end. The warn_on_delete_failure is also respected so in that case it is set to true it would warn rather than raise.

It also introduces the force delete option. If the janitor fails to drop a view, schema or snapshot table it retains the corresponding state records so the next run can retry. This is correct for transient failures, but if the failure is permanent the janitor will retry forever and the affected environments and snapshots will never be cleaned up.

In that case, with the --force-delete flag:

sqlmesh janitor --force-delete

purges the state records for expired environments and snapshots even when their physical objects could not be dropped. But any objects that were not successfully dropped becomes orphaned and must be removed manually from the data warehouse.

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
Comment thread sqlmesh/core/janitor.py
Comment thread sqlmesh/core/janitor.py
num_expired_snapshots += len(batch.expired_snapshot_ids)
except Exception as e:
message = f"Failed to clean up an expired snapshots batch: {e}"
logger.warning(message)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment thread sqlmesh/core/context.py
self.state_sync.delete_expired_environments(current_ts=current_ts)
# we want to retry on the next janitor pass if drops failed
if not failures:
self.state_sync.delete_expired_environments(current_ts=current_ts)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where a schema was permanently deleted by the user, prior to the janitor running, won't this cause the environment to never expire? I feel like we need an escape hatch.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this pr unblocks the janitor so it doesn't accumulate stale snapshots in the case of one failure and it can continue with the rest of the operations

this guard to not delete the expired environments in this case of a failure is matching the current behaviour we have of retrying the next time that the janitor runs rather than deleting the expired environments and leaving orphans in the db. the test: test_janitor_cleanup_order is an example of this behaviour that we currently want

I see the value of an escape hatch for the cases where drops persistently fail, but this would require a more deliberate mechanism. I can't think of a simple way to do it as part of this pr without changing the janitor design substantially unless you have a suggestion of a simple way

Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
Signed-off-by: Themis Valtinos <73662635+themisvaltinos@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates SQLMesh’s janitor workflow to be “best effort”: it attempts all cleanup operations, collects failures, and surfaces them together at the end (with optional warning-only behavior and a --force-delete mode to purge state even when physical drops fail).

Changes:

  • Make environment view/schema/catalog cleanup return a list of failures instead of raising immediately.
  • Make expired snapshot deletion best-effort, optionally deleting snapshot state even when physical cleanup fails (force_delete).
  • Aggregate and surface janitor failures in Context._run_janitor, add CLI plumbing and expand test coverage.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/core/test_snapshot_evaluator.py Updates expectations for snapshot cleanup failures now surfaced as SQLMeshError.
tests/core/test_janitor.py Adjusts janitor unit tests for best-effort cleanup and failure collection.
tests/core/integration/test_aux_commands.py Adds/updates integration tests for aggregated failures, warn-only mode, and force_delete.
sqlmesh/core/snapshot/evaluator.py Switches snapshot cleanup to collect concurrent failures and raise a combined error.
sqlmesh/core/janitor.py Returns failure lists for cleanup helpers; adds force_delete support for snapshot state deletion on failure.
sqlmesh/core/context.py Aggregates janitor failures and applies warn/raise behavior; adds force_delete parameter plumbing.
sqlmesh/cli/main.py Adds --force-delete option and passes it through to Context.run_janitor.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sqlmesh/core/context.py
Comment on lines +2922 to +2923
failure_string = "\n - ".join(failures)
summary = f"Janitor completed with failures:\n {failure_string}"
Comment on lines +562 to +564
if errors:
errored_snapshots = "\n".join(f" {e.node.name}: {e.__cause__}" for e in errors)
raise SQLMeshError(f"\n{errored_snapshots}")
Comment thread sqlmesh/core/janitor.py
Comment on lines 87 to +89
message = f"Failed to drop the expired environment view '{expired_view}': {e}"
if warn_on_delete_failure:
logger.warning(message)
else:
raise SQLMeshError(message) from e
logger.warning(message)
failures.append(message)
Comment thread sqlmesh/core/janitor.py
Comment on lines 101 to +104
except Exception as e:
message = f"Failed to drop the expired environment schema '{schema}': {e}"
if warn_on_delete_failure:
logger.warning(message)
else:
raise SQLMeshError(message) from e
logger.warning(message)
failures.append(message)
Comment thread sqlmesh/core/janitor.py
Comment on lines 114 to +117
except Exception as e:
message = f"Failed to drop the expired environment catalog '{catalog}': {e}"
if warn_on_delete_failure:
logger.warning(message)
else:
raise SQLMeshError(message) from e
logger.warning(message)
failures.append(message)
@themisvaltinos themisvaltinos merged commit c5dbce5 into main May 21, 2026
32 checks passed
@themisvaltinos themisvaltinos deleted the themis/janitorcleanup branch May 21, 2026 22:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants