databricks-solutions · dgokeeffe · Apr 19, 2026 · Apr 19, 2026 · Apr 19, 2026 · May 9, 2026
diff --git a/databricks-skills/databricks-mlflow-ml/SKILL.md b/databricks-skills/databricks-mlflow-ml/SKILL.md
@@ -0,0 +1,91 @@
+---
+name: databricks-mlflow-ml
+description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that."
+---
+
+# MLflow + Unity Catalog — Classic ML
+
+Read this file fully; consult `references/gotchas.md` before writing UC code; consult `references/recipes.md` only for the alias-swap and `spark_udf` patterns.
+
+If you're tempted to read `patterns-training.md`, `patterns-experiment-setup.md`, `patterns-uc-registration.md`, or `patterns-batch-inference.md` to figure out basic sklearn training, stop — you don't need them. This skill is only about the Databricks / Unity Catalog parts that are easy to miss.
+
+## Why This Skill Exists
+
+Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**.
+
+| Skill | Scope | MLflow API Surface |
+|-------|-------|--------------------|
+| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces |
+| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` |
+| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` |
+
+Use this skill when training forecasting / classification / regression models, registering them to Unity Catalog, and scoring them in a notebook or Lakeflow pipeline. Do not use it for GenAI evaluation or Model Serving endpoint management.
+
+## Hard Rules
+
+1. Call `mlflow.set_registry_uri("databricks-uc")` before registering or loading UC models.
+2. UC model names are always three-level: `catalog.schema.model_name`.
+3. Load by alias, not version: `models:/catalog.schema.model@champion`, not `models:/catalog.schema.model/3`.
+4. In UC-enforced workspaces, experiments need `artifact_location="dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>"`.
+5. `register_model` creates a version; it does **not** set `@champion` or `@challenger`.
+6. Use aliases for lifecycle. Legacy stages like `Production` / `Staging` are deprecated for UC models.
+
+## Quick Start
+
+Minimum viable path from trained model object to UC-registered, notebook-scored model:
+
+```python
+import mlflow
+import mlflow.sklearn
+from mlflow import MlflowClient
+from mlflow.models import infer_signature
+
+CATALOG = "my_catalog"
+SCHEMA = "my_schema"
+MODEL_NAME = f"{CATALOG}.{SCHEMA}.my_model"
+
+# 1. Configure UC registry + UC volume-backed experiment.
+mlflow.set_registry_uri("databricks-uc")
+mlflow.set_experiment(
+    experiment_name="/Users/me@company.com/forecasting",
+    artifact_location=f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/mlflow_artifacts/forecasting",
+)
+
+# 2. Train + log. Use name="model" in MLflow 3.x; artifact_path="model" only for older code.
+with mlflow.start_run() as run:
+    model.fit(X_train, y_train)
+    signature = infer_signature(X_train, model.predict(X_train[:5]))
+
+    mlflow.sklearn.log_model(
+        sk_model=model,                  # log the full Pipeline if preprocessing exists
+        name="model",
+        signature=signature,
+        input_example=X_train.iloc[:5],
+    )
+
+# 3. Register + set alias. register_model returns a ModelVersion; alias is a separate call.
+result = mlflow.register_model(
+    model_uri=f"runs:/{run.info.run_id}/model",
+    name=MODEL_NAME,
+)
+MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version)
+
+# 4. Load by alias, never by hard-coded version.
+loaded = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
+predictions = loaded.predict(X_test)
+```
+
+## Decision Table
+
+| Situation | Do this |
+|-----------|---------|
+| Starting a first UC-registered classic ML model | Quick Start, then `recipes.md` §1–2; check `gotchas.md` #1, #2, #4, #7 |
+| Model registered but missing from Catalog Explorer | Diagnose `set_registry_uri` and three-level names in `gotchas.md` #1–2 |
+| Need notebook batch scoring | Use `mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")`; keep the alias rule above |
+| Need scheduled / distributed batch scoring in Lakeflow SDP | Use `recipes.md` §3 and `gotchas.md` #11; construct `spark_udf` at module scope |
+| Retrained a challenger and need promotion | Use `recipes.md` §4 exactly; delete old `@champion` before setting new `@champion` |
+| Load or predict behaves oddly | Use `recipes.md` §5 for `get_model_info` / signature checks, then `gotchas.md` for UC-specific failures |
+
+## Runtime Compatibility
+
+MLflow 3.x prefers `name=` in `log_model`; MLflow 2.x examples often use `artifact_path=`, which works but warns in newer versions. UC model stages are deprecated across modern Databricks runtimes; use aliases.
diff --git a/databricks-skills/databricks-mlflow-ml/references/gotchas.md b/databricks-skills/databricks-mlflow-ml/references/gotchas.md
@@ -0,0 +1,161 @@
+# Databricks / Unity Catalog Gotchas
+
+Only the Databricks + Unity Catalog-specific failures are here. Generic MLflow, sklearn, and modeling advice intentionally lives elsewhere.
+
+## Runtime Gotcha Matrix
+
+| Area | MLflow 2.x | MLflow 3.x / newer Databricks guidance |
+|------|------------|-----------------------------------------|
+| Model artifact argument | `artifact_path="model"` is common | Prefer `name="model"`; `artifact_path` warns and may disappear later |
+| UC lifecycle | Stages already deprecated for UC | Use aliases only: `@champion`, `@challenger`, custom aliases |
+| Registry target | Workspace registry remains default unless changed | Still call `mlflow.set_registry_uri("databricks-uc")` explicitly |
+
+---
+
+## 1. Missing `mlflow.set_registry_uri("databricks-uc")`
+
+**How it fails:** Silent. `register_model` succeeds, but the model lands in the legacy workspace registry, not Unity Catalog; Catalog Explorer cannot find it.
+
+**Fix:** call this before any register or load:
+
+```python
+mlflow.set_registry_uri("databricks-uc")
+assert mlflow.get_registry_uri() == "databricks-uc"
+```
+
+**Why:** MLflow keeps workspace-registry defaults for backward compatibility, so the API call can succeed in the wrong registry.
+
+---
+
+## 2. Not using a three-level UC model name
+
+**How it fails:** Loud with UC registry (`INVALID_PARAMETER_VALUE`), but silent-wrong if you also forgot `set_registry_uri`: two-level names can register to the workspace registry.
+
+**Fix:** always use `catalog.schema.model_name`.
+
+```python
+# Wrong
+"my_model"
+"my_schema.my_model"
+
+# Correct
+"my_catalog.my_schema.my_model"
+```
+
+**Why:** Unity Catalog models are securable objects under a catalog and schema; workspace-registry names are not.
+
+---
+
+## 3. Experiment artifact location is not a UC volume
+
+**How it fails:** Usually loud later, not at setup: `log_model` or artifact upload fails with storage / permission errors. In older patterns, artifacts may silently land in DBFS root, which breaks UC governance expectations.
+
+**Fix:** set a UC volume-backed artifact location when creating the experiment.
+
+```python
+mlflow.set_experiment(
+    experiment_name="/Users/me@company.com/forecasting",
+    artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting",
+)
+```
+
+**Why:** UC-enforced workspaces reject unmanaged DBFS-root artifact writes; UC volumes keep model artifacts governed and loadable.
+
+---
+
+## 4. Using legacy `Production` / `Staging` stages
+
+**How it fails:** Silent or misleading. Stage APIs such as `transition_model_version_stage()` are deprecated / ineffective for UC models; aliases named `"Production"` may exist as labels but are not treated as lifecycle stages.
+
+**Fix:** use UC aliases by convention:
+
+```python
+MlflowClient().set_registered_model_alias(name, "champion", version)
+MlflowClient().set_registered_model_alias(name, "challenger", version)
+```
+
+**Why:** Unity Catalog model lifecycle moved from stages to free-form aliases; downstream loaders should use `models:/name@champion`.
+
+---
+
+## 5. Missing `CREATE MODEL ON SCHEMA`
+
+**How it fails:** Loud. `register_model` raises `PERMISSION_DENIED: User ... does not have CREATE MODEL permission`.
+
+**Fix:** ask the schema owner for the schema-level model-creation grant.
+
+```sql
+GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`;
+SHOW GRANTS ON SCHEMA my_catalog.my_schema;
+```
+
+**Why:** `USE CATALOG` and `USE SCHEMA` are not enough; model creation is a separate UC privilege.
+
+---
+
+## 6. Assuming `ai_query` is batch inference for custom UC models
+
+**How it fails:** Loud or wrong-primitive. `ai_query` calls serving endpoints; a UC-registered custom model is not automatically a serving endpoint.
+
+**Fix:** for batch inference, use:
+
+```python
+mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")   # notebook / pandas path
+mlflow.pyfunc.spark_udf(spark, "models:/catalog.schema.model@champion", result_type="double")
+```
+
+**Why:** registration and serving are separate. `ai_query` belongs to Model Serving / Foundation Model endpoint workflows, not ordinary UC batch scoring.
+
+---
+
+## 7. Constructing `spark_udf` inside a Lakeflow SDP function
+
+**How it fails:** Often loud and slow: repeated model deserialization, serialization errors, or pipeline refreshes that hang / retry. Sometimes just silently expensive.
+
+**Fix:** construct the UDF once at module scope and call it inside `@dp.table` / `@dp.materialized_view`.
+
+```python
+mlflow.set_registry_uri("databricks-uc")
+predict_udf = mlflow.pyfunc.spark_udf(
+    spark,
+    "models:/catalog.schema.model@champion",
+    result_type="double",
+)
+```
+
+**Why:** Lakeflow SDP can evaluate dataset functions repeatedly; model loading belongs at module import time, not inside the dataset function body.
+
+---
+
+## 8. Missing `mlflow[databricks]` extras outside Databricks compute
+
+**How it fails:** Loud. Local laptop / CI / non-Databricks jobs may train and log, then fail on UC registration with missing cloud SDK imports such as `azure`, `boto3`, or `google.cloud`.
+
+**Fix:**
+
+```bash
+pip install 'mlflow[databricks]'
+# or
+pip install 'mlflow-skinny[databricks]'
+```
+
+**Why:** UC registration stages artifacts through cloud-managed storage; the Databricks extras include the provider SDKs that plain `mlflow` may omit.
+
+---
+
+## 9. Using deprecated `artifact_path=` instead of `name=`
+
+**How it fails:** Noisy now, possibly loud later. Newer MLflow warns that `artifact_path` is deprecated; future major versions may remove it.
+
+**Fix:** prefer:
+
+```python
+mlflow.sklearn.log_model(
+    sk_model=model,
+    name="model",
+    signature=signature,
+    input_example=input_example,
+)
+```
+
+**Why:** MLflow renamed the within-run model artifact argument; the value still becomes the path used by `runs:/<run_id>/model`.