Skip to content

rfc zero downtime version upgrades#5

Open
PatrickKoss wants to merge 4 commits intomlflow:mainfrom
PatrickKoss:rfc/zero-downtime-version-upgrades
Open

rfc zero downtime version upgrades#5
PatrickKoss wants to merge 4 commits intomlflow:mainfrom
PatrickKoss:rfc/zero-downtime-version-upgrades

Conversation

@PatrickKoss
Copy link
Copy Markdown

Add tiered schema verification and an AST-based migration classifier so that MLflow tracking servers can start against an older database schema when all pending migrations are purely additive. This eliminates mandatory downtime for the majority of MLflow releases while preserving the safety guarantees of the existing strict check for destructive migrations.

@PatrickKoss
Copy link
Copy Markdown
Author

@B-Step62 @harupy can you please check?

Copy link
Copy Markdown
Contributor

@B-Step62 B-Step62 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the proposal, @PatrickKoss. I believe allowing flexible and selective migration is directionally correct, but feel that the current risk assessment needs more scrutiny. I left some detailed comments below.


**Key design decisions:**

- **AST parsing over string matching**: Using Python's `ast` module provides structured analysis of the migration code rather than fragile regex matching. It correctly handles `op.X`, `batch_op.X`, and nested calls.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AST-based categorization might not be always accurate. Incorrectly classifying migration as SAFE can cause pretty bad decision like data collapse. I think we can rely on it at most as a helper and always require human setup. E.g., when a change author generates a new migration, the parser script suggests the category.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to push back on this one, because I think the AST is doing more reliable work than this gives it credit for, and I think the proposed fix (require humans to annotate every migration) is the wrong direction.

First, the empirical case: the workspace migration cited in another comment was written, reviewed, and merged by humans, and it still ships the failure mode that comment is worried about. Adding more human review to the loop wouldn't have caught it — the people who reviewed it were sharp engineers and they still missed the implications. So "rely on humans" isn't actually safer than "rely on a narrow mechanical check"; it's just a different way to be wrong, with worse failure-mode properties (decay into boilerplate, infrequent review, low context). Rails's safety_assured! from the strong_migrations gem is the canonical cautionary tale here — it's routinely sprinkled in just to silence warnings, not because anyone checked.

Second, the technical case: I went through the operations the current RFC marks as ambiguous and tried to construct cases where the AST genuinely can't decide. Almost all of them collapse:

  • alter_column (type change) — Alembic declares existing_type and type_ explicitly; the AST can decide widening vs. narrowing.
  • drop_constraint — strictly more permissive; safe by definition.
  • drop_index — same family as create_index; safe.
  • create_foreign_key — defaults to breaking unless author annotates that old code already enforces it application-side. No inference needed.
  • add_column (nullable=True, server_default=non-null) — the AST can detect this exact shape and classify it as the borderline case it is. Default-to-breaking, author can promote with annotation.

The one operation I couldn't classify mechanically was op.execute(...). The string is opaque, SQL parsing is brittle, and even with perfect parsing the AST can't tell a planned backfill from a foot-gun. That's the one place where the concern is genuinely true, and that's the one place where the classifier should fail-closed: op.execute() defaults to BREAKING, and the migration author can override only with an explicit annotation that CI validates.

So the proposal is:

  1. AST authoritative for everything except op.execute(). The classifier reads the operation and its arguments and classifies. No human input required for the 95%+ of migrations that don't use raw SQL.
  2. op.execute() defaults to BREAKING, with a narrow author-annotation escape hatch validated in CI. The annotation has to include a reason, has to be machine-parseable, and dev/schema_diff.py blocks the merge if it's missing or malformed.
  3. No general "annotate every migration" requirement. Contributors writing normal Alembic ops never see the annotation system. Only the small fraction of migrations that need raw SQL or fall into the borderline add_column shape pay the annotation cost.

This addresses the underlying concern (AST being wrong about something it can't see) without the cost of decaying manual annotations on every migration.

Comment on lines +80 to +82
| **SAFE** | Purely additive; old code is unaffected | `create_table`, `create_index`, `add_column` (nullable), `alter_column` (set nullable=True), `alter_column` (server_default change) |
| **CAUTIOUS** | Likely safe but requires human review | `add_column` (non-nullable), `alter_column` (type change), `drop_constraint`, `create_foreign_key`, `drop_index`, `execute` (raw SQL) |
| **BREAKING** | Destructive; old code will fail | `drop_table`, `drop_column`, `rename_table`, `alter_column` (rename), ORM/data migrations (detected via `session.query` / `op.get_bind`) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do admins judge whether a CAUTIOUS change is safe or not?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honest answer: I don't think CAUTIOUS should exist as a runtime category at all. It's the bucket we put migrations in when we couldn't decide, and asking the operator to decide is asking the person with the least context to make the call. I tried to construct a case where a third bucket genuinely earns its keep, and I couldn't find one.

Walking through what's currently CAUTIOUS in the table (line 81):

  • add_column (nullable=False)BREAKING. This is the workspace failure mode from comment 1. There's no "it depends".
  • alter_column (type change)SAFE for in-family widening (Alembic declares existing_type and type_; AST reads both), BREAKING for narrowing or cross-family.
  • drop_constraintSAFE. Strictly more permissive.
  • create_foreign_keyBREAKING by default. Author can opt into SAFE with an explicit annotation if they're sure old code already enforces the constraint application-side.
  • drop_indexSAFE. Same family as create_index.
  • execute (raw SQL)BREAKING by default, with the annotation escape hatch from comment 2.

Every operation lands cleanly in SAFE or BREAKING. The middle bucket is empty.

The cases I worried might force a third bucket all collapsed:

  • add_column (nullable=True, server_default=non-null) — borderline shape, mechanically detectable, defaults to BREAKING, author opts in with annotation.
  • alter_column (server_default change) where new code depends on the new default — migration is SAFE, the hazard is in the code change in the same PR, caught by the comment-1 PR-diff check.
  • create_index on a huge table causing lock contention — operational concern about running the migration, not about the new server booting against the old DB. Out of scope for Tier 3 auto-compat; deserves its own treatment in a separate RFC.
  • Mixed-op migrations — already handled by the "worst op wins" rule on line 84.

If you can construct a case where CAUTIOUS genuinely earns its keep, I'd love to see it — that would change my mind. But absent one, I'd rather collapse the runtime taxonomy to two states (SAFE / BREAKING) so the operator's decision is always actionable. "SAFE, do a rolling deploy" or "BREAKING, scale down and run the migration" — no third state to interpret, no slow walk to the override flag, no false-confidence accumulation from ignored CAUTIOUS warnings.

The annotation system from another comment is the only place humans intervene, and it's narrow: only op.execute() and the borderline add_column cases require an annotation, and CI validates it. Most migration authors never see it. Most operators never see CAUTIOUS in check-upgrade output because it doesn't exist anymore.


- **Maintenance burden of the classifier**: The AST-based classifier must correctly handle all Alembic operation patterns used in MLflow migrations. New operation patterns may require updates to the classifier. The manual overrides dictionary must be maintained as new edge-case migrations are added.

- **False sense of safety**: The classifier makes static judgments about migration safety. It cannot account for all runtime behaviors. For example, a `create_index` on a very large table could cause lock contention in some databases, which is operationally unsafe even though it's structurally additive. The documentation should clearly state that "SAFE" means structurally additive, not guaranteed zero-impact.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation helps but easily be overlooked. I would prefer having more prominant warning that operators will explicitly need to confirm 'yes I know this can cause unavailability".

Copy link
Copy Markdown
Author

@PatrickKoss PatrickKoss Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genuinely good question, and the answer changes after the rework. The original RFC had a SAFE category that included grey-area migrations whose safety depended on what the server code did. In that world, "operator must explicitly acknowledge" was a fair safeguard — there were real cases where the operator was the last line of defense.

After tightening SAFE (only truly nullable additive columns), narrowing the AST classifier (fail-closed on op.execute()), and collapsing CAUTIOUS into SAFE/BREAKING, the SAFE bucket is now provably safe — not "we think it's probably fine", but "the AST has verified the operation is structurally additive AND the CI PR-diff check has verified the server code in the same release is backward-compatible with the old schema". There's no risk left for the operator to acknowledge in the way you originally meant.

That said, I still think Tier 3 should be opt-in at the server level, just for a different reason: changing operator-visible behavior on upgrade without consent is a bad pattern, regardless of whether the new behavior is safe. Concretely:

  • Some operators have wrapper scripts or health checks that depend on mlflow server failing fast on schema mismatch. Flipping Tier 3 on by default would silently break those.
  • "This server is participating in zero-downtime upgrades" should be a grep-able fact in the startup args or env config, not an implicit behavior.
  • Platform teams shouldn't get a deploy-semantics change for free when they bump MLflow versions. They should read the release notes and opt in.

So the proposal is mlflow server --allow-zero-downtime-upgrades (or MLFLOW_ALLOW_ZERO_DOWNTIME_UPGRADES=true), off by default. When the flag is off, behavior is identical to today. When it's on and the schema is out of date with all-SAFE pending migrations, the server logs at INFO level: "Schema mismatch detected, all N pending migrations classified as SAFE, starting normally" — informational, not alarmist, because SAFE actually means safe.

I want to be careful not to bake in a "you have been warned" pattern that contradicts the comment above claim that we've actually closed the failure modes.


- **Complexity in the startup path**: The tiered verification adds branching logic to server startup. If the classifier raises an unexpected exception, the fallback to Tier 4 ensures safety, but the added code paths increase the surface area for bugs.

- **Environment variable as escape hatch**: `MLFLOW_ALLOW_SCHEMA_MISMATCH=true` bypasses all safety checks. If operators set this permanently and forget about it, they could run into data corruption from genuinely breaking migrations. The documentation should emphasize this is a temporary override, not a permanent setting.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This risk requires more mitigation than documentation. Instead of a flag, shall we provide a more explicit configuration (e.g. max revision, allowlist)?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed the boolean override is too blunt for the original RFC's design. The reason it felt blunt was that it was trying to do two jobs at once: "approve a CAUTIOUS migration I've reviewed" AND "bypass a hard mismatch in an emergency". Once CAUTIOUS goes away (reworked), those two jobs separate cleanly, and I think the right answer is one bypass mechanism, not an allowlist plus a boolean.

Concretely:

  • Routine zero-downtime upgrade. Operator runs mlflow db check-upgrade, sees all-SAFE, deploys with --allow-zero-downtime-upgrades. No bypass involved. Tier 3 just works because the classifier has proven the migrations are safe.
  • Emergency bypass. Production is down, operator needs to roll forward against an old DB, at least one pending migration is BREAKING. They set MLFLOW_BYPASS_SCHEMA_CHECK_DANGEROUS=true (the name is intentionally uncomfortable to type). The server logs an ERROR at startup listing every migration being skipped, and an ERROR on every incoming request, so leaving it on accidentally creates a paper trail nobody can ignore.

I considered the allowlist mechanism, and I think it's solving a problem that disappears once CAUTIOUS is gone. Its purpose was "let the operator selectively approve specific revisions whose safety is judgment-dependent". After the rework, there are no judgment-dependent revisions — each pending migration is either provably SAFE or provably BREAKING (only crossable via the emergency bypass). The allowlist would have nothing to selectively allow.

The one case I can imagine wanting an allowlist for is "this migration is BREAKING per the classifier but I, the operator, know it's safe for my deployment because I don't use the affected feature". That's the feature-granularity case from the other comment, and per the analysis there, MLflow doesn't have the infrastructure (feature flags, conditional router mounting) to make that judgment reliable today (again correct me if I am wrong). If the operator wants to override anyway, the emergency bypass is the right tool — one loud escape hatch, not a routine-looking selective-approval mechanism.

I want the bypass to feel like a fire extinguisher: visible, ugly, present for emergencies, painful to leave engaged.

2. Run `mlflow db upgrade`
3. Scale back up

For teams running MLflow as a shared tracking server in production (e.g., on Kubernetes), this means scheduled downtime on every release. The problem is that the vast majority of MLflow migrations are purely additive: creating new tables, adding nullable columns, or building indexes. These operations are fully compatible with the old code still running. A server at revision N can safely serve traffic while the database is at revision N or N+K, as long as the pending migrations only add things the old code doesn't reference.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additive change does not always mean safe. For example, the upgraded server code often includes updated SQL statement that inserts rows with the new column, which fails until the DB revision catches up.

For example, revision 1b5f0d9ad7c1 adds a workspace column to many tables like experiment. Every experiment creation request now includes the workspace column with the default value and will fail without the migration. Some other operations listed under SAFE category has similar failure mode (e.g. skipping nullable=true change causes server logic that insert null value to fail.)

Whether a migration is safe cannot be solely determinied by the database scheme and largely depending on the server logic. I think this narrows down the chance of migrations to be categorized as "SAFE" and "CAUTIOUS" significantly.

Casually, skipping revision will break 'some' features, otherwise we don't need the revision, except index change. Then the trade-off that operators will make is whether or not the feature is critical or not. For example, if the organization doesn't use AI Gateway, it is fine to skip migrations for related tables. I think this flexibilty is useful for many organizations, but we need to rethink the design if we go with this direction. The criteria is not safe or unsafe, but rather breaking particular features is acceptable or not.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the workspace migration is broken, but I'd push back on using it to disqualify the broader "additive" category. The migration is broken for a specific reason: it adds a nullable=False column with a hardcoded server_default='default' and the new server code immediately writes to it on every INSERT. That's not an additive change in the API-versioning sense — it's a required new field with a polite default. Compare to a REST API: adding X-Workspace: required to every POST is a breaking change; adding X-Workspace: optional, defaults to null isn't. The migration is the first kind, not the second.

Under the current RFC table, add_column (nullable=False) is already CAUTIOUS, not SAFE (line 81), so the literal example wouldn't auto-pass Tier 3 anyway. But the example does highlight that we should tighten what add_column (nullable=True) means. A genuinely additive column looks like this:

  • nullable=True
  • No server_default, or a default that produces NULL
  • New code tolerates NULL on read paths (because that's what existing rows will have)
  • New code doesn't unconditionally require the column on write paths against existing tables

When you do it this way, it's safe in both directions. Old server reading new schema: the column is just there, ignored. New server writing to old schema: as long as the new code doesn't require the column in INSERTs that target rows from the old DB, the migration is structurally compatible.

So I'd like to:

  1. Tighten the SAFE classification of add_column to require nullable=True and no non-null server_default. The presence of a non-null default is a smell that the new code expects a value, which is the API-versioning red flag your example actually demonstrates.
  2. Add a CI check in dev/schema_diff.py that fails any PR adding a migration and server-side code that unconditionally writes the new column. The migration alone can't tell us whether the code is backward-compatible — the code change in the same PR can. This is the missing piece that makes "additive" actually mean "additive in both directions".

With those two changes, the workspace migration would be caught at PR time (because it's nullable=False and the same PR adds the workspace column to required INSERT positions), and the existing safely-additive migrations in MLflow's history (the many add_column nullable=True cases) would still be classified as SAFE. We don't have to throw out the category — we just have to draw the line in the right place.

The category I want to defend is real: the bulk of MLflow's migration history is genuinely backward-compatible additive changes, and forcing every one of them through a downtime window because of one badly-written migration would be the wrong lesson to take from this.

On the per-feature granularity framing: Please correct me if I am wrong, I might be. I spent some time looking at whether that's tractable in MLflow as it exists today, and I think the honest answer is "not without a much larger refactor than this RFC". MLflow has exactly one clean feature toggle that gates a database-touching subsystem (MLFLOW_ENABLE_WORKSPACES). Everything else — AI Gateway, scorers, online scoring, assessments, traces, evaluation datasets, jobs, webhooks — has no server-side enable/disable flag and exists by virtue of being in the code.

The pip install 'mlflow[gateway]' extras pattern looks like it could be a shortcut, but it isn't: the [gateway] and [genai] extras only install runtime provider dependencies (boto3, tiktoken, slowapi, etc.); they don't gate the gateway code itself. mlflow/server/fastapi_app.py does an unconditional from mlflow.server.gateway_api import gateway_router; fastapi_app.include_router(gateway_router) — there is no try/except ImportError and no if MLFLOW_ENABLE_GATEWAY guard. The gateway router is mounted on every MLflow tracking server regardless of which extras are installed, and the migrations live in the core package.

That means the operator-facing decision the per-feature framing wants to enable ("skip migrations for features I don't use") has no reliable input signal: an operator running plain pip install mlflow still has the gateway router exposed, and skipping the gateway migration would leave it broken for any caller who hit it. To make per-feature granularity work, MLflow would need conditional router mounting, conditional ORM model registration, graceful 404s on disabled features, and an explicit feature registry — a backward-incompatible architectural change to the server, not an addition.

I'd rather solve the downtime problem with the RFC's narrower approach, which doesn't depend on any of that. The two designs compose cleanly, and if MLflow grows a real feature-flag layer in a future RFC, mlflow db check-upgrade can be extended to consume it without rewriting any of this RFC's machinery. I've added a paragraph to the Alternatives section of the RFC explaining why per-feature granularity is out of scope for now and noting it as a compatible future direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants