workspace import-dir: default-exclude .git, .databricks, node_modules#5118
Open
jamesbroadhead wants to merge 3 commits intomainfrom
Open
workspace import-dir: default-exclude .git, .databricks, node_modules#5118jamesbroadhead wants to merge 3 commits intomainfrom
jamesbroadhead wants to merge 3 commits intomainfrom
Conversation
The previous walker copied every entry under the source tree into the workspace verbatim. That has two practical consequences for users deploying Databricks Apps via `databricks workspace import-dir` followed by `databricks apps deploy`: 1. The local repo's `.git/config` (often containing the template-repo origin URL, sometimes cached credentials) ends up at `/app/python/source_code/.git/` in the running app container. 2. Local bundle cache `.databricks/` overwrites whatever the bundle pipeline put in the remote workspace. Empirically reproduced on a probe deployment (deploy04-probe-jb on e2-dogfood.staging) — the running container had a full `.git/` tree including HEAD, config, objects, refs, hooks. CoDA (github.com/datasciencemonkey/coding-agents-databricks-apps) ships an in-app `_reinit_app_git()` to scrub this on every startup, and its CLAUDE.md warns "never move .git folder to the workspace if you're running workspace import" — that workaround is the bug surface this change closes. Reported as DEPLOY-04 #2 in Tushar's "Apps Gaps That Matter to EMEA Apps" doc. Skip is name-based and applied during the walk; if a user explicitly passes `.git` (or `.databricks`) as the source root, that root is still copied — the rule only fires on entries encountered during recursion. `.gitignore` and other dot-files at the root remain copied as before. Co-authored-by: Isaac
Waiting for approvalBased on git history, these people are best suited to review:
Eligible reviewers: Suggestions based on git history. See OWNERS for ownership rules. |
Same rationale as .git/.databricks: gets uploaded by accident, large, re-installed in the runtime anyway. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
databricks workspace import-dirwalks the source tree and copies every entry into the workspace verbatim — it has no awareness of.gitignoreor default exclusions. This change adds a name-based skip for.git,.databricks, andnode_modulesdirectories during the walk..gitignoreand other dotfiles at the root remain copied. If a user explicitly passes.git(or any of the others) as the source root, that root is still copied — the skip rule applies to entries encountered during recursion.Motivation: align
import-dirwithsync's existing defaultsdatabricks syncalready hard-codes skips for the same two directories that cause the most trouble:libs/git/repository.go—// Always ignore root .git directory.adds.gitto the default ignore rules unconditionally.libs/git/view.go(SetupDefaults) —// Hard code .databricks ignore pattern so that we never sync it (irrespective of .gitignore patterns).So
syncandimport-dircurrently produce different workspace contents for the same source tree:syncskips.git/and.databricks/,import-dircopies them. This PR closes that gap forimport-dirso the two commands behave consistently.node_modulesis the one entry that goes beyond whatsyncdoes by default. For any project with a typical.gitignore,syncwould already skip it via gitignore rules;import-dirignores.gitignoreentirely, so adding it to the name-based skip list keeps the behavior aligned with what users get fromsync.Why this matters in practice
For users who land on
import-dir(typically via symmetry with the documentedexport-dir) and then rundatabricks apps deploy --source-code-path:.git/config(often containing the template-repo origin URL) ends up at/app/python/source_code/.git/in the running app container..databricks/overwrites whatever the bundle pipeline put in the remote workspace.node_modules/along — large, slow to upload, and re-installed in the runtime anyway.The canonical answer for the apps-deploy flow is
databricks sync(which the official Apps docs recommend). This PR is not a substitute for that — it just bringsimport-dir's defaults into line withsync's for users who reach for it anyway.Test plan
.git/skipped, nested.git/skipped,.databricks/skipped,node_modules/skipped,.gitignorefile kept, explicit.gitroot copied (escape hatch).go test ./cmd/workspace/workspace/— passgolangci-lint run ./cmd/workspace/workspace/— cleanTestImportDir— unchanged, no.gitin its testdata so behavior is identical.This pull request and its description were written by Isaac.