Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions databricks-skills/databricks-mlflow-ml/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
name: databricks-mlflow-ml
description: "Classic ML model lifecycle on Databricks with MLflow and Unity Catalog. Use when training scikit-learn / XGBoost / PyTorch models with MLflow tracking, registering models to Unity Catalog (three-level names, @champion / @challenger aliases), setting mlflow.set_registry_uri('databricks-uc'), logging experiments with UC volume artifact_location, loading registered models via mlflow.pyfunc.load_model or mlflow.pyfunc.spark_udf, and running batch inference (notebook or Lakeflow SDP pipeline). Not for GenAI agent evaluation — use databricks-mlflow-evaluation for that. Not for Model Serving endpoints — use databricks-model-serving for that."
---

# MLflow + Unity Catalog — Classic ML

Read this file fully; consult `references/gotchas.md` before writing UC code; consult `references/recipes.md` only for the alias-swap and `spark_udf` patterns.

If you're tempted to read `patterns-training.md`, `patterns-experiment-setup.md`, `patterns-uc-registration.md`, or `patterns-batch-inference.md` to figure out basic sklearn training, stop — you don't need them. This skill is only about the Databricks / Unity Catalog parts that are easy to miss.

## Why This Skill Exists

Three skills in the AI Dev Kit touch MLflow; this one owns **classic ML training + UC registration + batch inference**.

| Skill | Scope | MLflow API Surface |
|-------|-------|--------------------|
| `databricks-mlflow-evaluation` | GenAI agent evaluation | `mlflow.genai.evaluate()`, scorers, judges, traces |
| `databricks-model-serving` | Real-time serving endpoints | Deployment APIs, endpoint management, `ai_query` |
| `databricks-mlflow-ml` *(this skill)* | Classic ML + UC registration + batch inference | `mlflow.sklearn.log_model`, `register_model`, `set_registered_model_alias`, `pyfunc.load_model`, `pyfunc.spark_udf` |

Use this skill when training forecasting / classification / regression models, registering them to Unity Catalog, and scoring them in a notebook or Lakeflow pipeline. Do not use it for GenAI evaluation or Model Serving endpoint management.

## Hard Rules

1. Call `mlflow.set_registry_uri("databricks-uc")` before registering or loading UC models.
2. UC model names are always three-level: `catalog.schema.model_name`.
3. Load by alias, not version: `models:/catalog.schema.model@champion`, not `models:/catalog.schema.model/3`.
4. In UC-enforced workspaces, experiments need `artifact_location="dbfs:/Volumes/<catalog>/<schema>/<volume>/<path>"`.
5. `register_model` creates a version; it does **not** set `@champion` or `@challenger`.
6. Use aliases for lifecycle. Legacy stages like `Production` / `Staging` are deprecated for UC models.

## Quick Start

Minimum viable path from trained model object to UC-registered, notebook-scored model:

```python
import mlflow
import mlflow.sklearn
from mlflow import MlflowClient
from mlflow.models import infer_signature

CATALOG = "my_catalog"
SCHEMA = "my_schema"
MODEL_NAME = f"{CATALOG}.{SCHEMA}.my_model"

# 1. Configure UC registry + UC volume-backed experiment.
mlflow.set_registry_uri("databricks-uc")
mlflow.set_experiment(
experiment_name="/Users/me@company.com/forecasting",
artifact_location=f"dbfs:/Volumes/{CATALOG}/{SCHEMA}/mlflow_artifacts/forecasting",
)

# 2. Train + log. Use name="model" in MLflow 3.x; artifact_path="model" only for older code.
with mlflow.start_run() as run:
model.fit(X_train, y_train)
signature = infer_signature(X_train, model.predict(X_train[:5]))

mlflow.sklearn.log_model(
sk_model=model, # log the full Pipeline if preprocessing exists
name="model",
signature=signature,
input_example=X_train.iloc[:5],
)

# 3. Register + set alias. register_model returns a ModelVersion; alias is a separate call.
result = mlflow.register_model(
model_uri=f"runs:/{run.info.run_id}/model",
name=MODEL_NAME,
)
MlflowClient().set_registered_model_alias(MODEL_NAME, "champion", result.version)

# 4. Load by alias, never by hard-coded version.
loaded = mlflow.pyfunc.load_model(f"models:/{MODEL_NAME}@champion")
predictions = loaded.predict(X_test)
```

## Decision Table

| Situation | Do this |
|-----------|---------|
| Starting a first UC-registered classic ML model | Quick Start, then `recipes.md` §1–2; check `gotchas.md` #1, #2, #4, #7 |
| Model registered but missing from Catalog Explorer | Diagnose `set_registry_uri` and three-level names in `gotchas.md` #1–2 |
| Need notebook batch scoring | Use `mlflow.pyfunc.load_model("models:/catalog.schema.model@champion")`; keep the alias rule above |
| Need scheduled / distributed batch scoring in Lakeflow SDP | Use `recipes.md` §3 and `gotchas.md` #11; construct `spark_udf` at module scope |
| Retrained a challenger and need promotion | Use `recipes.md` §4 exactly; delete old `@champion` before setting new `@champion` |
| Load or predict behaves oddly | Use `recipes.md` §5 for `get_model_info` / signature checks, then `gotchas.md` for UC-specific failures |

## Runtime Compatibility

MLflow 3.x prefers `name=` in `log_model`; MLflow 2.x examples often use `artifact_path=`, which works but warns in newer versions. UC model stages are deprecated across modern Databricks runtimes; use aliases.
161 changes: 161 additions & 0 deletions databricks-skills/databricks-mlflow-ml/references/gotchas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# Databricks / Unity Catalog Gotchas

Only the Databricks + Unity Catalog-specific failures are here. Generic MLflow, sklearn, and modeling advice intentionally lives elsewhere.

## Runtime Gotcha Matrix

| Area | MLflow 2.x | MLflow 3.x / newer Databricks guidance |
|------|------------|-----------------------------------------|
| Model artifact argument | `artifact_path="model"` is common | Prefer `name="model"`; `artifact_path` warns and may disappear later |
| UC lifecycle | Stages already deprecated for UC | Use aliases only: `@champion`, `@challenger`, custom aliases |
| Registry target | Workspace registry remains default unless changed | Still call `mlflow.set_registry_uri("databricks-uc")` explicitly |

---

## 1. Missing `mlflow.set_registry_uri("databricks-uc")`

**How it fails:** Silent. `register_model` succeeds, but the model lands in the legacy workspace registry, not Unity Catalog; Catalog Explorer cannot find it.

**Fix:** call this before any register or load:

```python
mlflow.set_registry_uri("databricks-uc")
assert mlflow.get_registry_uri() == "databricks-uc"
```

**Why:** MLflow keeps workspace-registry defaults for backward compatibility, so the API call can succeed in the wrong registry.

---

## 2. Not using a three-level UC model name

**How it fails:** Loud with UC registry (`INVALID_PARAMETER_VALUE`), but silent-wrong if you also forgot `set_registry_uri`: two-level names can register to the workspace registry.

**Fix:** always use `catalog.schema.model_name`.

```python
# Wrong
"my_model"
"my_schema.my_model"

# Correct
"my_catalog.my_schema.my_model"
```

**Why:** Unity Catalog models are securable objects under a catalog and schema; workspace-registry names are not.

---

## 3. Experiment artifact location is not a UC volume

**How it fails:** Usually loud later, not at setup: `log_model` or artifact upload fails with storage / permission errors. In older patterns, artifacts may silently land in DBFS root, which breaks UC governance expectations.

**Fix:** set a UC volume-backed artifact location when creating the experiment.

```python
mlflow.set_experiment(
experiment_name="/Users/me@company.com/forecasting",
artifact_location="dbfs:/Volumes/my_catalog/my_schema/mlflow_artifacts/forecasting",
)
```

**Why:** UC-enforced workspaces reject unmanaged DBFS-root artifact writes; UC volumes keep model artifacts governed and loadable.

---

## 4. Using legacy `Production` / `Staging` stages

**How it fails:** Silent or misleading. Stage APIs such as `transition_model_version_stage()` are deprecated / ineffective for UC models; aliases named `"Production"` may exist as labels but are not treated as lifecycle stages.

**Fix:** use UC aliases by convention:

```python
MlflowClient().set_registered_model_alias(name, "champion", version)
MlflowClient().set_registered_model_alias(name, "challenger", version)
```

**Why:** Unity Catalog model lifecycle moved from stages to free-form aliases; downstream loaders should use `models:/name@champion`.

---

## 5. Missing `CREATE MODEL ON SCHEMA`

**How it fails:** Loud. `register_model` raises `PERMISSION_DENIED: User ... does not have CREATE MODEL permission`.

**Fix:** ask the schema owner for the schema-level model-creation grant.

```sql
GRANT CREATE MODEL ON SCHEMA my_catalog.my_schema TO `user@company.com`;
SHOW GRANTS ON SCHEMA my_catalog.my_schema;
```

**Why:** `USE CATALOG` and `USE SCHEMA` are not enough; model creation is a separate UC privilege.

---

## 6. Assuming `ai_query` is batch inference for custom UC models

**How it fails:** Loud or wrong-primitive. `ai_query` calls serving endpoints; a UC-registered custom model is not automatically a serving endpoint.

**Fix:** for batch inference, use:

```python
mlflow.pyfunc.load_model("models:/catalog.schema.model@champion") # notebook / pandas path
mlflow.pyfunc.spark_udf(spark, "models:/catalog.schema.model@champion", result_type="double")
```

**Why:** registration and serving are separate. `ai_query` belongs to Model Serving / Foundation Model endpoint workflows, not ordinary UC batch scoring.

---

## 7. Constructing `spark_udf` inside a Lakeflow SDP function

**How it fails:** Often loud and slow: repeated model deserialization, serialization errors, or pipeline refreshes that hang / retry. Sometimes just silently expensive.

**Fix:** construct the UDF once at module scope and call it inside `@dp.table` / `@dp.materialized_view`.

```python
mlflow.set_registry_uri("databricks-uc")
predict_udf = mlflow.pyfunc.spark_udf(
spark,
"models:/catalog.schema.model@champion",
result_type="double",
)
```

**Why:** Lakeflow SDP can evaluate dataset functions repeatedly; model loading belongs at module import time, not inside the dataset function body.

---

## 8. Missing `mlflow[databricks]` extras outside Databricks compute

**How it fails:** Loud. Local laptop / CI / non-Databricks jobs may train and log, then fail on UC registration with missing cloud SDK imports such as `azure`, `boto3`, or `google.cloud`.

**Fix:**

```bash
pip install 'mlflow[databricks]'
# or
pip install 'mlflow-skinny[databricks]'
```

**Why:** UC registration stages artifacts through cloud-managed storage; the Databricks extras include the provider SDKs that plain `mlflow` may omit.

---

## 9. Using deprecated `artifact_path=` instead of `name=`

**How it fails:** Noisy now, possibly loud later. Newer MLflow warns that `artifact_path` is deprecated; future major versions may remove it.

**Fix:** prefer:

```python
mlflow.sklearn.log_model(
sk_model=model,
name="model",
signature=signature,
input_example=input_example,
)
```

**Why:** MLflow renamed the within-run model artifact argument; the value still becomes the path used by `runs:/<run_id>/model`.
Loading