feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474
feat(databricks-skills): add databricks-mlflow-ml skill for classic ML#474dgokeeffe wants to merge 5 commits intodatabricks-solutions:mainfrom
Conversation
|
Do the mlflow official skills we install not over this gap? cc: @jacksandom |
|
@dustinvannoy-db I checked the The UC-specific stuff is what this PR covers: UC-enforced workspaces rejecting DBFS artifact roots, the legacy stage transition API silently no-oping on UC models, |
Fills the gap between databricks-mlflow-evaluation (GenAI agent eval) and databricks-model-serving (real-time endpoints). Covers: - Classic ML model training with MLflow tracking (sklearn / XGBoost / PyTorch) - Experiment creation with UC volume artifact_location (required in UC-enforced workspaces) - Unity Catalog model registration with three-level names - @Champion / @Challenger alias management - Batch inference via mlflow.pyfunc.load_model (notebook, up to ~10k rows) - Distributed batch via mlflow.pyfunc.spark_udf in Lakeflow SDP pipelines Structure mirrors databricks-mlflow-evaluation: - SKILL.md: workflows + trigger description + quick start - references/GOTCHAS.md: 12 common mistakes with symptoms + fixes - references/CRITICAL-interfaces.md: exact API signatures + models:/ URI format - references/patterns-experiment-setup.md: UC volume artifact_location setup - references/patterns-training.md: logging with signature + input_example - references/patterns-uc-registration.md: register + alias + verify + A/B - references/patterns-batch-inference.md: pyfunc.load_model + spark_udf + ai_query anti-pattern - references/user-journeys.md: 7 end-to-end workflows including debugging Key gotchas covered that other MLflow guides miss: - Experiment creation now requires UC volume artifact_location in UC-enforced workspaces (DBFS root writes are rejected) - mlflow.set_registry_uri('databricks-uc') is required; silent workspace registry fallback is the databricks-solutions#1 support question - ai_query does NOT work on custom UC-registered models unless they're deployed to a serving endpoint; use pyfunc.load_model or spark_udf instead - UC aliases (@champion/@Challenger) replace deprecated stage transitions (transition_model_version_stage is a no-op on UC models) - mlflow.pyfunc.spark_udf must be constructed at module scope in Lakeflow SDP pipelines, not inside the function body Tested against MLflow 2.16+ on Databricks Runtime 15.4 LTS. Content battle- tested in the Coles Vibe Workshop (classic-ML track running in an airgapped environment where online MLflow docs aren't reachable).
Field-tested the skill end-to-end from a local Python environment against a live Databricks workspace. Surfaced two gotchas not in the original set: databricks-solutions#12 mlflow[databricks] extras missing when running outside Databricks: plain `pip install mlflow` omits azure-core / boto3 / google.cloud SDKs that UC registration needs to stage artifacts. Training + log_model work; register_model fails with opaque "No module named 'azure'". Databricks clusters ship the extras pre-installed, so this only bites laptops / CI. databricks-solutions#13 artifact_path= deprecated in favour of name= (MLflow 2.16+): emits warning on every log_model call. Non-blocking, but worth flagging since most online tutorials + training courses still use the old param. Both verified against the workshop's test run — skill workflow 1 now completes cleanly with these fixes documented.
Original SKILL.md didn't state a runtime target. Adds a "Runtime compatibility" section anchored on what the skill was actually tested against — MLflow 3.11 on Lakeflow SDP serverless compute v5 — with a compat note for MLflow 2.16+ (classic DBR 15.4 LTS still ships 2.x). Points at GOTCHAS.md for the 3.x-vs-2.x divergence (artifact_path deprecation, etc.).
bf84ee5 to
cf21195
Compare
|
here's what claude suggest: |
Quentin posted a Claude-generated audit on PR databricks-solutions#474 specifying the restructure. Ran gpt-5.5 in logfood with the audit as the spec. Changes: 8 files / 1,666 lines → 3 files / 485 lines (71% reduction). Structure: - SKILL.md (91 lines) — frontmatter, 3-skill comparison table, hard rules, Quick Start, decision table for situation→recipe routing, read-order instruction at top, negative list ("don't read X-pattern.md for sklearn 101"). - references/gotchas.md (161 lines) — only Databricks/UC-specific failures: silently-wrong workspace registry, three-level UC names, artifact_location UC volume in UC-enforced workspaces, alias-on-stage no-op, CREATE MODEL ON SCHEMA grant, ai_query vs custom-model batch, spark_udf module-scope in Lakeflow SDP, mlflow[databricks] extras, artifact_path→name deprecation. Each entry: symptom + silent/loud + fix + one-sentence why. - references/recipes.md (233 lines) — UC-specific code shapes only: experiment + UC volume setup, log→register→alias canonical pattern, Lakeflow SDP spark_udf module-scope, A/B alias swap order, verification one-liners. Deleted (per Quentin's audit): - references/CRITICAL-interfaces.md (90% plain MLflow API) - references/GOTCHAS.md (replaced by lowercase gotchas.md, dropping the generic entries: alias-not-version, verify-after-register, signature basics, version reuse, Pipeline preprocessing — all generic MLflow / sklearn knowledge) - references/user-journeys.md (pure pointer-shuffling) - references/patterns-experiment-setup.md - references/patterns-training.md - references/patterns-uc-registration.md - references/patterns-batch-inference.md Workflow tables in SKILL.md replaced by a 6-row decision table. Common Issues table consolidated into gotchas.md. Reference Files list dropped — Claude can ls. Co-authored-by: Isaac
macOS case-insensitive filesystem hid this from the previous commit. The content was already lowercased in references; this commit makes the git index match. Co-authored-by: Isaac
Why
The existing MLflow-related skills leave a gap for classic ML practitioners:
databricks-mlflow-evaluationmlflow.genai.evaluate, scorers, judges)databricks-model-servingdatabricks-unity-catalogdatabricks-mlflow-ml(this PR)A data scientist training a forecasting model, registering it to Unity Catalog, and scoring predictions in a notebook or Lakeflow pipeline has no skill to trigger on. This PR fills that gap.
What's in the skill
SKILL.md — workflow index (Train → Register → Score, Retrain + Promote A/B, Debugging), quick-start, runtime compatibility note, and trigger description.
7 reference files:
GOTCHAS.md— 14 common mistakes with symptoms + fixesCRITICAL-interfaces.md— exact API signatures + themodels:/catalog.schema.model@aliasURI formatpatterns-experiment-setup.md— UC volumeartifact_location(required in UC-enforced workspaces)patterns-training.md— logging withsignature+input_example,sklearn.Pipelinewrapping, autologgingpatterns-uc-registration.md— three-level names,@champion/@challengeraliases, verification viaDESCRIBE MODEL, A/B promotionpatterns-batch-inference.md— notebookpyfunc.load_model(Tier 1), Lakeflow SDPpyfunc.spark_udf(Tier 2), champion-vs-challenger validation, explicit warning againstai_queryon custom UC modelsuser-journeys.md— 7 end-to-end workflows including debugging scenariosKey gotchas this skill teaches that other guides miss
artifact_locationon experiment creation — DBFS root is rejected in UC-enforced workspaces. Everylog_modelcall fails with opaque errors untilartifact_locationpoints at a UC volume.mlflow.set_registry_uri('databricks-uc')— without this,register_modelsilently routes to the legacy workspace registry. The Add initial skills for Databricks development #1 "my model isn't showing up in Catalog Explorer" support question.ai_queryon custom UC models — doesn't work. Requires a serving endpoint. Correct primitive ismlflow.pyfunc.load_model(notebook) ormlflow.pyfunc.spark_udf(Lakeflow).@champion/@challengeraliases — replace deprecatedtransition_model_version_stage()stages. The legacy API still exists but is a no-op on UC-registered models (no error, no effect).mlflow.pyfunc.spark_udfin Lakeflow SDP — must be constructed at module scope, not inside@dp.materialized_view. Otherwise deserialization repeats on every pipeline evaluation.pip install 'mlflow[databricks]'— required for UC registration outside Databricks clusters. Plainpip install mlflowomits the cloud-storage SDKs (azure-core / boto3 / google.cloud) MLflow needs to stage UC artifacts. Clusters ship the extras pre-installed.Testing
Field-tested end-to-end against a live Databricks workspace:
GradientBoostingRegressor@championalias — verified in Catalog Explorer UImlflow.pyfunc.load_model— predictions within ~2% of actualsmlflow[databricks]install +artifact_pathdeprecation) and added to GOTCHAS.mdRuntime verified: MLflow 3.11 on Lakeflow SDP serverless compute v5 (current default). Patterns compatible with MLflow 2.16+ — pairs on older classic DBRs still get correct behaviour. 2.x/3.x divergences called out in GOTCHAS.md (e.g.,
artifact_path→name=).Structure parity
File layout matches
databricks-mlflow-evaluation(sameSKILL.md+references/+GOTCHAS.md+CRITICAL-interfaces.md+patterns-*.mdconvention). Installable via the existinginstall_skills.sh:Not in scope
databricks-model-servingcovers that)databricks-mlflow-evaluationcovers that)databricks-unity-catalogcovers those)Deliberately narrow — classic ML + UC registration + batch inference only.
Origin
Built to fill a gap encountered during the Coles Vibe Workshop (airgapped Databricks field-engineer hackathon). DS pairs needed UC-scoped MLflow guidance that wasn't covered by any existing skill. Content battle-tested in the workshop before being contributed upstream.