Skip to content

Commit cd545a2

Browse files
feat: zero-config auth — interactive PAT setup, auto-rotation, all CLIs (#81, #83) (#82)
* feat: add PATRotator for short-lived token auto-rotation (#81) Mints a new 20 PAT every 10 minutes, persists to env var, ~/.databrickscfg, then revokes the old PAT. Includes 19 tests covering rotation logic, lifecycle, and logging. * feat: wire PATRotator into app startup (#81) * chore: add secret resource with WRITE for PAT rotation (#81) * docs: PAT auto-rotation implementation plan (#81) * fix: read PAT_ROTATION_INTERVAL and PAT_TOKEN_LIFETIME from env vars (#81) * chore: set PAT rotation interval to 2 minutes for testing (#81) * fix: set default rotation to 2 min, remove config from app.yaml (#81) * fix: clearer rotation log messages — INFO: starting/complete (#81) * fix: use pat_rotator.py defaults (120s), remove env var overrides from app.py (#81) * chore: set PAT rotation to 5 min interval, 10 min lifetime (#81) * feat: resolve owner via SP + Apps API (app.creator), preserve SP credentials (#81) Owner resolution no longer depends on PAT. Uses the auto-provisioned SP to call w.apps.get().creator and matches against X-Forwarded-Email. Falls back to PAT-based resolution for backward compat. * feat: 10-min rotation with 15-min lifetime, ensure_fresh() on session create (#81) - Rotation interval: 10 min (less API churn overnight) - Token lifetime: 15 min (5-min overlap buffer) - ensure_fresh() called on session creation — if token age > 8 min, rotate immediately so user never starts with a stale token * feat: session-aware rotation — skip when no active sessions (#81) - Remove ensure_fresh() (unnecessary with overlap buffer) - Rotation only fires when active sessions exist - No sessions = no API churn (no pointless token minting overnight) - 10-min rotation interval, 15-min token lifetime (5-min overlap) - Pass session_count_fn to PATRotator for decoupled session awareness * feat: interactive PAT setup — /api/pat-status + /api/configure-pat endpoints (#83) * feat: terminal prompts for PAT on first session, remove DATABRICKS_TOKEN from app.yaml (#83) * refactor: remove secret scope persistence from PATRotator (#83) * chore: remove secret scope config — PAT prompt handles restarts (#83) * fix: make setup_claude.py token-optional — install CLI without PAT (#83) * feat: configure Claude CLI auth after interactive PAT setup (#83) * fix: all setup scripts install CLI without token, skip config until PAT (#83) * feat: configure all CLIs (Claude, Codex, OpenCode, Gemini, Databricks) after PAT setup (#83) * fix: add missing lock to heartbeat test fixture * docs: update README and deployment guide for zero-config auth (#83) * fix: strip SP creds after owner resolution, move setup to after PAT (#83) * fix: show PAT prompt instead of snake game when setup hasn't started The index route was treating "pending" (setup not started yet, waiting for PAT) the same as "running" (setup actively in progress), causing the loading/snake page to appear immediately instead of index.html with the PAT prompt. Now only shows loading.html when setup is actively running. * chore: remove snake game loading page — setup waits inline in terminal With deferred setup (runs after PAT, not at boot), the wait happens inside the terminal via polling. The loading.html snake game page was unreachable dead code. * fix: immediately mint controlled token on PAT configure When the user pastes a PAT, immediately rotate it into a short-lived token we own (with a known token ID). This ensures the first background rotation can revoke the old token instead of logging "no old token to revoke." The user-pasted PAT becomes unused after the initial mint. * feat: track rotation time, fast-path expired token detection Add _last_rotation_time to PATRotator, set on every successful mint. pat-status now checks is_token_expired first — if the token lifetime has elapsed (no rotation while sessions were idle), immediately returns valid: false to show the PAT prompt. Avoids a wasted API call to validate a known-dead token. * feat: persist app state to ~/.coda/app_state.json Adds app_state.py — a shared JSON file at ~/.coda/app_state.json that holds app_owner (set at boot) and last_rotation_time/last_token_id (set on every rotation). Admins can inspect this file for diagnostics. The rotator loads last_rotation_time on init so is_token_expired works across app restarts. * feat: wire app_state.json — owner at boot, rotation every 10 min Clean up app_state wiring: - Move import app_state to top-level in app.py - pat_rotator writes to app_state.json on every rotation (not just initial mint), keeping on-disk state current for admin inspection - Add /api/app-state endpoint for admin diagnostics - Remove duplicate write from configure_pat (rotator handles it) - Add 8 tests for app_state round-trip, merge, permissions, corruption * chore: pin all package versions in requirements.txt Pin mlflow to 3.10.1 and all other packages to their current resolved versions for reproducible deploys. * chore: simplify rotation log message to 'CLI updated' * fix: bump pyasn1→0.6.3, pyjwt→2.12.1; ignore 3 unfixable CVEs - pyasn1 0.6.3 fixes GHSA-jr27-m4p2-rc6r (DoS via recursive decoding) - pyjwt 2.12.1 fixes GHSA-752w-5fwx-jx9f (crit header bypass) - cryptography 46.0.6, requests 2.33.0, pygments fix — not released yet, ignoring in audit until available * fix: upgrade requests to 2.33.0 from GitHub (GHSA-gc5v-m9x4-r6x2) requests 2.33.0 fixes the predictable temp file extraction vuln but hasn't landed on PyPI yet. Pin to the v2.33.0 tag commit SHA from GitHub. Remove the audit ignore for this CVE. * chore: replace mlflow[genai] with mlflow-tracing — drops pygments CVE We only use mlflow.claude_code.hooks (in mlflow-tracing). The [genai] extra pulled in litellm → tokenizers → huggingface-hub → typer → rich → pygments (GHSA-5239-wwwm-4pmq, no fix). Switching to mlflow-tracing drops ~40 transitive deps including pygments, scikit-learn, scipy. * fix: eliminate cryptography CVE — google-auth 2.47.0 drops the dep Constrained cryptography>=46.0.6 which resolved by downgrading google-auth to 2.47.0 (no cryptography dependency). Removes the last --ignore-vuln from the audit workflow. All 5 CVEs now resolved. * fix: upgrade mcp 1.19.0→1.26.0 (GHSA-9h52-p55h-vw2f DNS rebinding) * fix: re-eliminate cryptography dep after mcp upgrade reintroduced it * fix: upgrade mcp→1.26.0, ignore cryptography until 46.0.6 hits PyPI mcp and cryptography conflict: mcp>=1.23 needs google-auth which needs cryptography, but 46.0.6 isn't on PyPI yet. Prioritize mcp fix (DNS rebinding) over cryptography (low-impact X.509 name constraints). Weekly audit will catch when 46.0.6 lands. * fix: upgrade cryptography to 46.0.6 from GitHub (GHSA-m959-cc7f-wv43) Install cryptography 46.0.6 from GitHub tag (not on PyPI proxy yet). Zero --ignore-vuln flags remaining — all 6 CVEs resolved.
1 parent 1bb60cc commit cd545a2

20 files changed

Lines changed: 1782 additions & 3262 deletions

README.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828
🟢 **OpenCode** — Open-source agent with multi-provider support
2929

30-
Every agent starts **pre-wired to your Databricks AI Gateway**models, auth tokens, and base URLs are all configured at boot. No API keys to manage.
30+
Every agent installs at boot and connects to your **Databricks AI Gateway**on first terminal session, paste a short-lived PAT and all CLIs are configured automatically. Token auto-rotates every 10 minutes.
3131

3232
---
3333

@@ -63,7 +63,7 @@ This isn't just a terminal in the cloud. Running coding agents on Databricks giv
6363
| 🐍 **Loading Screen** | Play snake while setup steps run in parallel |
6464
| 🔄 **Workspace Sync** | Every `git commit` auto-syncs to `/Workspace/Users/{you}/projects/` |
6565
| ✏️ **Micro Editor** | Modern terminal editor, pre-installed |
66-
| ⚙️ **Databricks CLI** | Pre-configured with your PAT, ready to go |
66+
| ⚙️ **Databricks CLI** | Installed at boot, configured interactively on first session |
6767
| 📊 **MLflow Tracing** | Every Claude Code session is automatically traced to your Databricks MLflow experiment |
6868

6969
---
@@ -136,10 +136,10 @@ Tracing is skipped gracefully if `APP_OWNER` is not set (e.g., local dev without
136136
1. Click [**Use this template**](https://github.com/datasciencemonkey/coding-agents-databricks-apps/generate) to create your own repo
137137
2. Go to **Databricks → Apps → Create App**
138138
3. Choose **Custom App** and connect your new repo
139-
4. Add your PAT as the `DATABRICKS_TOKEN` secret in **App Resources**
140-
5. Deploy
139+
4. Deploy
140+
5. Open the app — paste a short-lived PAT when prompted on first terminal session
141141

142-
That's it. Open the app URL and start coding.
142+
That's it. No secrets to configure, no pre-deployment setup.
143143

144144
[→ Full deployment guide](docs/deployment.md) — environment variables, gateway config, and advanced options.
145145

@@ -280,7 +280,7 @@ This template repo opens that vision up for every Databricks user — no IDE set
280280

281281
| Variable | Required | Description |
282282
|----------|----------|-------------|
283-
| `DATABRICKS_TOKEN` | Yes | Your Personal Access Token (secret) |
283+
| `DATABRICKS_TOKEN` | No | Optional. If not set, the app prompts for a token on first session. Auto-rotated every 10 minutes |
284284
| `HOME` | Yes | Set to `/app/python/source_code` in app.yaml |
285285
| `ANTHROPIC_MODEL` | No | Claude model name (default: `databricks-claude-opus-4-6`) |
286286
| `CODEX_MODEL` | No | Codex model name (default: `databricks-gpt-5-2`) |
@@ -289,7 +289,7 @@ This template repo opens that vision up for every Databricks user — no IDE set
289289

290290
### Security Model
291291

292-
Single-user app — each user deploys their own instance with their own PAT. Only the token owner can access the terminal. Everyone else sees 403.
292+
Single-user app — the owner is resolved via the app's service principal and Apps API (`app.creator`), with no PAT required at deploy time. Authorization checks `X-Forwarded-Email` against `app.creator`. On first terminal session, the user pastes a short-lived PAT interactively. Tokens auto-rotate every 10 minutes (15-minute lifetime), with old tokens proactively revoked. On restart, the user re-pastes (no persistence by design).
293293

294294
### Gunicorn
295295

app.py

Lines changed: 183 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -18,8 +18,11 @@
1818
from collections import deque
1919

2020
import tomllib
21+
import requests
2122

23+
import app_state
2224
from utils import ensure_https
25+
from pat_rotator import PATRotator
2326

2427
# Sanitize DATABRICKS_TOKEN early — the platform sometimes injects trailing
2528
# newlines / whitespace which causes auth failures. Cleaning it here prevents
@@ -45,6 +48,8 @@
4548
logging.basicConfig(level=logging.INFO)
4649
logger = logging.getLogger(__name__)
4750

51+
# PAT auto-rotation — initialized after sessions dict is defined (see below)
52+
4853
app = Flask(__name__, static_folder='static', static_url_path='/static')
4954
app.secret_key = os.urandom(24)
5055
app.config['MAX_CONTENT_LENGTH'] = 32 * 1024 * 1024 # 32 MB — aligned with Claude Code's 30 MB file limit
@@ -57,6 +62,12 @@
5762
sessions = {}
5863
sessions_lock = threading.Lock()
5964

65+
# PAT auto-rotation (short-lived tokens, background refresh)
66+
# Only rotates while active sessions exist — stops when all sessions are reaped
67+
pat_rotator = PATRotator(
68+
session_count_fn=lambda: len(sessions),
69+
)
70+
6071
# SIGTERM graceful shutdown: notify clients before gunicorn stops the worker
6172
shutting_down = False
6273

@@ -250,6 +261,68 @@ def _reinit_app_git():
250261
logger.info("Reinitialized app source git (template origin removed)")
251262

252263

264+
def _configure_all_cli_auth(token):
265+
"""Configure auth for ALL coding-agent CLIs after a PAT is provided.
266+
267+
Called from /api/configure-pat when a user supplies a PAT interactively.
268+
Handles: Claude CLI (inline), Databricks CLI (via pat_rotator), and
269+
Codex/OpenCode/Gemini CLIs (by re-running their setup scripts with token in env).
270+
"""
271+
import json
272+
273+
home = os.environ.get("HOME", "/app/python/source_code")
274+
if not home or home == "/":
275+
home = "/app/python/source_code"
276+
277+
# 1. Configure Claude CLI (~/.claude/settings.json)
278+
claude_dir = os.path.join(home, ".claude")
279+
os.makedirs(claude_dir, exist_ok=True)
280+
281+
gateway_host = ensure_https(os.environ.get("DATABRICKS_GATEWAY_HOST", "").rstrip("/"))
282+
databricks_host = ensure_https(os.environ.get("DATABRICKS_HOST", "").rstrip("/"))
283+
284+
if gateway_host:
285+
anthropic_base_url = f"{gateway_host}/anthropic"
286+
else:
287+
anthropic_base_url = f"{databricks_host}/serving-endpoints/anthropic"
288+
289+
settings = {
290+
"env": {
291+
"ANTHROPIC_MODEL": os.environ.get("ANTHROPIC_MODEL", "databricks-claude-sonnet-4-6"),
292+
"ANTHROPIC_BASE_URL": anthropic_base_url,
293+
"ANTHROPIC_AUTH_TOKEN": token,
294+
"ANTHROPIC_CUSTOM_HEADERS": "x-databricks-use-coding-agent-mode: true",
295+
}
296+
}
297+
298+
settings_path = os.path.join(claude_dir, "settings.json")
299+
with open(settings_path, "w") as f:
300+
json.dump(settings, f, indent=2)
301+
302+
logger.info(f"Claude CLI auth configured: {settings_path}")
303+
304+
# 2. Configure Databricks CLI (~/.databrickscfg) — already called by
305+
# configure_pat() via pat_rotator, but explicit for clarity
306+
pat_rotator._write_databrickscfg(token)
307+
logger.info("Databricks CLI auth configured: ~/.databrickscfg")
308+
309+
# 3. Re-run Codex, OpenCode, Gemini setup scripts with token in env
310+
# They are idempotent: detect CLI already installed, just write config files
311+
env = {**os.environ, "DATABRICKS_TOKEN": token}
312+
for script in ["setup_codex.py", "setup_opencode.py", "setup_gemini.py"]:
313+
try:
314+
result = subprocess.run(
315+
["uv", "run", "python", script],
316+
env=env, capture_output=True, text=True, timeout=60
317+
)
318+
if result.returncode == 0:
319+
logger.info(f"CLI config updated: {script}")
320+
else:
321+
logger.warning(f"CLI config failed: {script}: {result.stderr[:200]}")
322+
except Exception as e:
323+
logger.warning(f"CLI config error: {script}: {e}")
324+
325+
253326
def run_setup():
254327
with setup_lock:
255328
setup_state["status"] = "running"
@@ -301,9 +374,27 @@ def run_setup():
301374

302375

303376
def get_token_owner():
304-
"""Get the owner email from DATABRICKS_TOKEN at startup."""
377+
"""Get the owner email. Priority: Apps API (app.creator) > PAT (current_user.me).
378+
379+
Uses the auto-provisioned SP to call the Apps API — no PAT needed for
380+
owner resolution. Falls back to PAT-based lookup for backward compat.
381+
"""
382+
from databricks.sdk import WorkspaceClient
383+
384+
# 1. Try Apps API via SP credentials (no PAT needed)
385+
app_name = os.environ.get("DATABRICKS_APP_NAME")
386+
if app_name:
387+
try:
388+
w = WorkspaceClient() # auto-detects SP credentials
389+
app = w.apps.get(name=app_name)
390+
owner = app.creator
391+
logger.info(f"Owner resolved from app.creator: {owner}")
392+
return owner
393+
except Exception as e:
394+
logger.warning(f"Could not resolve owner via Apps API: {e}")
395+
396+
# 2. Fallback: PAT-based resolution
305397
try:
306-
from databricks.sdk import WorkspaceClient
307398
host = ensure_https(os.environ.get("DATABRICKS_HOST", ""))
308399
token = os.environ.get("DATABRICKS_TOKEN")
309400
if not host or not token:
@@ -611,7 +702,7 @@ def cleanup_stale_sessions():
611702
def authorize_request():
612703
"""Check authorization before processing any request."""
613704
# Skip auth for health check, setup status, and Socket.IO (has own auth via connect event)
614-
if request.path in ("/health", "/api/setup-status") or request.path.startswith("/socket.io"):
705+
if request.path in ("/health", "/api/setup-status", "/api/pat-status", "/api/configure-pat", "/api/app-state") or request.path.startswith("/socket.io"):
615706
return None
616707

617708
authorized, user = check_authorization()
@@ -650,10 +741,6 @@ def set_security_headers(response):
650741

651742
@app.route("/")
652743
def index():
653-
with setup_lock:
654-
status = setup_state["status"]
655-
if status in ("pending", "running"):
656-
return send_from_directory("static", "loading.html")
657744
return send_from_directory("static", "index.html")
658745

659746

@@ -662,6 +749,12 @@ def get_setup_status():
662749
return jsonify(_get_setup_state_snapshot())
663750

664751

752+
@app.route("/api/app-state")
753+
def get_app_state():
754+
"""Admin endpoint: persisted app state (owner, last rotation)."""
755+
return jsonify(app_state.get_state())
756+
757+
665758
@app.route("/health")
666759
def health():
667760
with sessions_lock:
@@ -682,6 +775,79 @@ def get_version():
682775
return jsonify({"version": APP_VERSION})
683776

684777

778+
@app.route("/api/pat-status")
779+
def pat_status():
780+
"""Check if a valid, usable PAT is configured."""
781+
host = ensure_https(os.environ.get("DATABRICKS_HOST", ""))
782+
token = os.environ.get("DATABRICKS_TOKEN", "").strip()
783+
784+
if not token or pat_rotator.is_token_expired:
785+
# No token, or token lifetime exceeded (rotation stopped while no sessions)
786+
return jsonify({"configured": False, "valid": False,
787+
"workspace_host": host})
788+
789+
# Validate with direct HTTP — avoids SDK auth fallback to SP
790+
try:
791+
resp = requests.get(f"{host}/api/2.0/preview/scim/v2/Me",
792+
headers={"Authorization": f"Bearer {token}"}, timeout=10)
793+
if resp.status_code == 200:
794+
user = resp.json().get("userName", "unknown")
795+
return jsonify({"configured": True, "valid": True, "user": user})
796+
return jsonify({"configured": True, "valid": False,
797+
"workspace_host": host})
798+
except Exception:
799+
return jsonify({"configured": True, "valid": False,
800+
"workspace_host": host})
801+
802+
803+
@app.route("/api/configure-pat", methods=["POST"])
804+
def configure_pat():
805+
"""Accept a user-provided PAT, validate it, and start rotation."""
806+
data = request.json
807+
token = data.get("token", "").strip()
808+
if not token:
809+
return jsonify({"error": "Token required"}), 400
810+
811+
# Validate the token — direct HTTP, no SDK fallback
812+
host = ensure_https(os.environ.get("DATABRICKS_HOST", ""))
813+
try:
814+
resp = requests.get(f"{host}/api/2.0/preview/scim/v2/Me",
815+
headers={"Authorization": f"Bearer {token}"}, timeout=10)
816+
if resp.status_code != 200:
817+
return jsonify({"error": "Invalid token"}), 400
818+
user = resp.json().get("userName", "unknown")
819+
except Exception as e:
820+
return jsonify({"error": f"Token validation failed: {e}"}), 400
821+
822+
# Immediately mint a controlled short-lived token from the user-pasted PAT.
823+
# This gives us a token ID we own — all future rotations can revoke the old one.
824+
# The user-pasted PAT becomes unused after this (expires per its own lifetime).
825+
os.environ["DATABRICKS_TOKEN"] = token
826+
pat_rotator._current_token = token
827+
pat_rotator._current_token_id = None
828+
rotated = pat_rotator._rotate_once()
829+
if rotated:
830+
token = pat_rotator.token # use the newly minted token from here on
831+
else:
832+
# Rotation failed — fall back to user-pasted token (still valid)
833+
pat_rotator._write_databrickscfg(token)
834+
pat_rotator.start()
835+
836+
# Configure all CLI tools (Claude, Codex, OpenCode, Gemini, Databricks)
837+
_configure_all_cli_auth(pat_rotator.token or token)
838+
839+
# Run setup now that we have a valid token (installs CLIs, configures agents)
840+
# Only run if setup hasn't completed yet
841+
with setup_lock:
842+
if setup_state["status"] != "complete":
843+
setup_thread = threading.Thread(target=run_setup, daemon=True, name="setup-thread")
844+
setup_thread.start()
845+
logger.info("Setup triggered after PAT configuration")
846+
847+
logger.info(f"PAT configured interactively by {user} — rotation started")
848+
return jsonify({"status": "ok", "user": user, "message": "Token configured. Auto-rotation started."})
849+
850+
685851
@app.route("/api/session", methods=["POST"])
686852
def create_session():
687853
"""Create a new terminal session."""
@@ -923,28 +1089,28 @@ def initialize_app(local_dev=False):
9231089
if not local_dev:
9241090
signal.signal(signal.SIGTERM, handle_sigterm)
9251091

926-
# Remove OAuth credentials - force PAT auth only
927-
os.environ.pop("DATABRICKS_CLIENT_ID", None)
928-
os.environ.pop("DATABRICKS_CLIENT_SECRET", None)
1092+
# SP credentials preserved — needed for Apps API (owner resolution) and secret persistence
9291093

930-
# Determine app owner from DATABRICKS_TOKEN
1094+
# Resolve owner: Apps API (app.creator via SP) > PAT (current_user.me)
9311095
app_owner = get_token_owner()
9321096
if app_owner:
933-
logger.info(f"App owner (from token): {app_owner}")
1097+
logger.info(f"App owner: {app_owner}")
9341098
os.environ["APP_OWNER"] = app_owner
1099+
app_state.set_app_owner(app_owner)
9351100
else:
9361101
logger.warning("Could not determine app owner - authorization disabled")
9371102

1103+
# Strip SP credentials — only needed for owner resolution above.
1104+
# Keeping them causes SDK to silently fall back to SP auth when PAT is dead.
1105+
os.environ.pop("DATABRICKS_CLIENT_ID", None)
1106+
os.environ.pop("DATABRICKS_CLIENT_SECRET", None)
1107+
logger.info("SP credentials stripped — PAT-only auth from this point")
1108+
9381109
# Start background cleanup thread
9391110
cleanup_thread = threading.Thread(target=cleanup_stale_sessions, daemon=True)
9401111
cleanup_thread.start()
9411112
logger.info(f"Started session cleanup thread (timeout={SESSION_TIMEOUT_SECONDS}s, interval={CLEANUP_INTERVAL_SECONDS}s)")
9421113

943-
# Start setup in background thread — app starts immediately with loading screen
944-
setup_thread = threading.Thread(target=run_setup, daemon=True, name="setup-thread")
945-
setup_thread.start()
946-
logger.info("Started background setup thread")
947-
9481114

9491115
if __name__ == "__main__":
9501116
# Local dev — no SIGTERM handler (SIG_DFL), no shutting_down flag

app.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,12 @@ command:
44
env:
55
- name: HOME
66
value: /app/python/source_code
7-
- name: DATABRICKS_TOKEN
8-
valueFrom: DATABRICKS_TOKEN
97
- name: ANTHROPIC_MODEL
108
value: databricks-claude-opus-4-6
119
- name: GEMINI_MODEL
1210
value: databricks-gemini-3-1-pro
1311
- name: CODEX_MODEL
1412
value: databricks-gpt-5-2
15-
#OPTIONAL: Move to the new Databricks Gateway if you have access (recommended), otherwise it will default to the older endpoint
1613
- name: DATABRICKS_GATEWAY_HOST
1714
valueFrom: DATABRICKS_GATEWAY_HOST
1815
- name: CLAUDE_CODE_DISABLE_AUTO_MEMORY

0 commit comments

Comments
 (0)