This is a minimal runnable MVP to satisfy the current cycle's GUI acceptance: a simple Flask UI that can scan a directory and display datasets with basic Python code interpretation.
- Install dependencies (prefer a virtualenv):
python3 -m venv .venv
# bash/zsh:
source .venv/bin/activate
# fish:
source .venv/bin/activate.fish
pip install -r requirements.txt
- Initialize environment variables (recommended):
- bash/zsh:
# Export variables for this shell
source scripts/init_env.sh
# Optionally write a .env file for tooling
source scripts/init_env.sh --write-dotenv
- fish shell:
# Export variables for this shell
source scripts/init_env.fish
# Optionally write a .env file for tooling
source scripts/init_env.fish --write-dotenv
- Run the server:
scidk-serve
# or
python3 -m scidk.app
- Open the UI in your browser:
- Use the "Scan Files" form to scan a directory (e.g., this repository root). Python files will be interpreted to show imports, functions, and classes.
Note: The scanner prefers NCDU for fast filesystem enumeration when available. Install NCDU via your OS package manager (e.g., brew install ncdu on macOS or sudo apt-get install ncdu on Debian/Ubuntu). If NCDU is not installed, the app falls back to Python traversal.
- Editable install error (Multiple top-level packages discovered): We ship setuptools config to include only the scidk package. If you previously had this error, pull latest and try again:
pip install -e .. - Shell errors when initializing env: Use the script matching your shell (
init_env.shfor bash/zsh,init_env.fishfor fish). Avoid runningsh scripts/init_env.sh; instead, source it.
These tests run in a real browser using Playwright and pytest. The test suite automatically starts the Flask app on port 5001 with safe defaults and no external Neo4j connection.
Prereqs (once per machine):
- Python virtual environment activated.
- Install dev dependencies and Playwright browsers.
Commands:
# Install dev deps (if not yet installed)
pip install -e .[dev]
# Install Playwright browsers (Chromium, Firefox, WebKit)
make e2e-install-browsers # or: python -m playwright install --with-deps
# Run headless E2E tests
make e2e # or: pytest -m e2e tests/e2e -q
# Run headed with inspector (debug mode)
make e2e-headed # or: PLAYWRIGHT_HEADLESS=0 PWDEBUG=1 pytest -m e2e tests/e2e -q
# Parallel execution (if pytest-xdist is installed; falls back to serial)
make e2e-parallel
Notes:
- The E2E test fixture sets:
- SCIDK_PORT=5001
- NEO4J_AUTH=none
- SCIDK_PROVIDERS=local_fs
- SCIDK_DB_PATH=sqlite:///:memory:
- Ensure port 5001 is free before running, or adjust the fixture if needed.
- For verbose logs during a failing test, use:
make e2e-debugorpytest -m e2e -vv -s.
- Testing default: The testing Neo4j database uses password
neo4jiscool. Set this in the app Settings or via environment. - Choose your password before first start by setting NEO4J_AUTH in .env or your shell (example uses testing default):
- echo "NEO4J_AUTH=neo4j/neo4jiscool" >> .env
- docker compose -f docker-compose.neo4j.yml up -d
- Change an existing password:
- With container: scripts/neo4j_set_password.sh 'NewPass123!' --container scidk-neo4j --current 'neo4jiscool'
- Local cypher-shell: scripts/neo4j_set_password.sh 'NewPass123!' --host bolt://localhost:7687 --user neo4j --current 'neo4jiscool'
- More details in dev/ops/deployment-neo4j.md (includes direct cypher-shell commands).
- If you see "The client is unauthorized due to authentication failure" when committing:
- Ensure Settings → Neo4j has User=neo4j and Password=neo4jiscool (or your actual DB password).
- Or set env NEO4J_AUTH=neo4j/neo4jiscool before starting the app, or update .env and restart.
- Confirm Browser and app use the same DB: SHOW DATABASES; and :use neo4j.
- Retry Commit to Graph.
- Endpoints to be implemented next:
- GET /api/rocrate — Generate minimal RO-Crate JSON-LD for a selected directory (depth=1, capped)
- GET /files — Stream file bytes for viewer previews/downloads
- See dev/features/ui/feature-rocrate-viewer-embedding.md for contracts and UI integration plan.
- Feature flag: set SCIDK_PROVIDERS to a comma-separated list (default: local_fs,mounted_fs)
- GET /api/providers → [{ id, display_name, capabilities, auth }]
- GET /api/provider_roots?provider_id=local_fs → list available roots/drives for the provider
- GET /api/browse?provider_id=local_fs&root_id=/&path=/home/user → { entries: [ { id, name, type, size, mtime, provider_id } ] }
- POST /api/scan { provider_id, root_id?, path, recursive? } → starts a scan using the provided path
- For provider_id=rclone: MVP supports a metadata-only scan (records the session without local file enumeration).
- Legacy: provider_id omitted defaults to local_fs and scans the local filesystem path
- POST /api/scan {"path": "/path", "recursive": true}
- GET /api/datasets
- GET /api/datasets/
Rclone provider (optional):
- Enable by installing rclone and setting SCIDK_PROVIDERS=local_fs,mounted_fs,rclone (or include rclone among others).
- UI: See docs/rclone/quickstart.md (README-ready snippet: dev/features/providers/README-snippet-rclone.md).
- API:
- GET /api/providers will include { id: "rclone", ... } when enabled.
- GET /api/provider_roots?provider_id=rclone lists rclone remotes (uses
rclone listremotes). - GET /api/browse?provider_id=rclone&root_id=:&path=: lists entries via
rclone lsjson. - Optional browse flags:
recursive=true|false(default false),max_depth=1..N(default 1),fast_list=true|false(default false). The provider retries automatically without fast-list if unsupported by a backend. - POST /api/scan { provider_id: 'rclone', path: 'remote:path' } records a metadata-only scan.
- If rclone is not installed or a remote is misconfigured, API returns a clear error message with HTTP 500 and {"error": "..."}.
- Optional FUSE mount flow with safe defaults: see docs/rclone/mount-examples.md and dev/ops/rclone/systemd/rclone-mount@.service.
- Enable the feature: set
SCIDK_RCLONE_MOUNTS=1(orSCIDK_FEATURE_RCLONE_MOUNTS=1). When enabled, the rclone provider is auto-enabled for remote validation even if not listed inSCIDK_PROVIDERS. - UI: Settings → Rclone Mounts section appears. Create a mount by entering
remote, optionalsubpath, aname, and submit (read-only by default). - Safety: Mountpoints are restricted under
./data/mounts/<name>; remotes are validated againstrclone listremotesoutput. - Endpoints (enabled only when the feature flag is set):
- GET
/api/rclone/mounts— list managed mounts - POST
/api/rclone/mountswith JSON{ remote, subpath, name, read_only }— startsrclone mounttargeting./data/mounts/<name> - DELETE
/api/rclone/mounts/<id>— unmounts and stops the process - GET
/api/rclone/mounts/<id>/logs?tail=N— returns last N log lines - GET
/api/rclone/mounts/<id>/health— checks process alive and that the path is listable
- GET
- Requirements: rclone must be installed and on PATH. Works on Linux/macOS. On Windows, use
rclone cmountwith WinFsp; current UI targets Linux/macOS primarily.
- POST /api/tasks { type: 'scan', path, recursive? } → { task_id }
- GET /api/tasks → list all tasks (most recent first)
- GET /api/tasks/<task_id> → details including status, progress, and scan_id when completed
- POST /api/tasks { type: 'commit', scan_id } → start a background commit to graph (in-memory + optional Neo4j)
- Enable rclone provider: export SCIDK_PROVIDERS="local_fs,mounted_fs,rclone".
- SQLite path index is created at SCIDK_DB_PATH (default ~/.scidk/db/files.db) and uses WAL mode.
- Trigger a scan via HTTP:
- POST /api/scans with JSON {"provider_id":"rclone","root_id":"remote:","path":"remote:bucket","recursive":false,"fast_list":true}
- Check progress/status:
- GET /api/scans//status → { status, file_count, folder_count, ingested_rows, by_ext, ... }
- Browse the scan snapshot (virtual root):
- GET /api/scans//fs
- Notes:
- Wrapper uses
rclone lsjsonwith --recursive or --max-depth 1, and optional --fast-list. - Batch insert to SQLite in 10k rows/transaction; rows include both files and folders.
- Wrapper uses
We use pytest for unit and API tests.
Pytest is included in requirements.txt; after installing dependencies, run:
python3 -m pytest -q
Notes:
- If your shell doesn't expose a global
pytestcommand (common in fish), usingpython3 -m pytestis the most reliable. - You can still run
pytest -qif your PATH includes the virtualenv's bin directory.
Conventions:
- Tests live in tests/ and rely on pytest fixtures in tests/conftest.py (e.g., Flask app and client).
- Add tests alongside new features in future cycles; see dev/cycles.md for cycle protocol.
- This MVP uses an in-memory graph; data resets on restart.
- Neo4j deployment docs reside in dev/ops/deployment-neo4j.md, but Neo4j is not yet wired in the MVP code.
- Delivery cycles and planning protocol: dev/cycles.md
- RO-Crate Viewer embedding plan (Crate-O): dev/features/ui/feature-rocrate-viewer-embedding.md
- Describo integration (product vision): dev/vision/describo-integration.md
- Current options:
- Synchronous: POST /api/scan runs immediately and returns when complete.
- Background: POST /api/tasks with { type: 'scan', path, recursive } enqueues a background scan and returns { task_id }. Poll GET /api/tasks/ for status/progress; GET /api/tasks lists recent tasks.
- Progress: For Python traversal, progress reports files processed vs. total; percent reflects processed/total when determinable.
- Enumeration: Scanning prefers ncdu or gdu when installed; otherwise falls back to Python traversal.
- Future: When reliable streaming from ncdu/gdu is in place, percent will be computed from streamed JSON for better fidelity.
- On /map you can:
- Switch layouts (Force/breadthfirst/manual) and Save/Load positions.
- Adjust Node size, Edge width, and Label font via UI sliders; enable High-contrast labels for readability.
- Download the current schema as CSV.
- Preview and download instances for File, Folder, and Scan labels as CSV (XLSX if openpyxl is installed).
- Status: The app ships with docker-compose.neo4j.yml to run a local Neo4j, but the Flask app currently uses an in-memory graph.
- Next steps to enable Neo4j writes/reads:
- Add a GraphAdapter interface and a Neo4jAdapter implementing upsert_dataset, add_interpretation, commit_scan, schema_triples.
- Add config/feature flag (e.g., SCIDK_GRAPH_BACKEND=neo4j) to switch adapters.
- Map current in-memory structures to Neo4j schema: (:File), (:Folder), (:Scan) nodes and CONTAINS, INTERPRETED_AS, SCANNED_IN relationships.
- Use Cypher or APOC to compute schema triples for /api/graph/schema.
- Until then, data is not persisted to Neo4j. Use the CSV exports or the in-memory map for the demo.
Neo4j-backed schema (optional; in addition to the default in-memory /api/graph/schema):
- GET /api/graph/schema.neo4j — Uses Cypher to return nodes and unique relationship triples with counts.
- GET /api/graph/schema.apoc — Uses APOC (apoc.meta.data and apoc.meta.stats). Returns 502 if APOC procedures are not available.
Environment variables required for Neo4j schema endpoints:
- NEO4J_URI (e.g., bolt://localhost:7687)
- NEO4J_USER
- NEO4J_PASSWORD
- SCIDK_NEO4J_DATABASE (optional; defaults to the driver/session default)
Notes:
- If the neo4j Python driver is not installed, these endpoints return 501 with an explanatory error.
- If Neo4j or credentials are not configured, these endpoints return 501. The app’s default in-memory /api/graph/schema remains fully functional.
Additional Instance export formats:
- GET /api/graph/instances.pkl?label= — Python pickle of the rows (application/octet-stream).
- GET /api/graph/instances.arrow?label= — Arrow IPC stream (requires pyarrow; returns 501 otherwise).
- Existing:
- GET /api/graph/instances.csv?label=
- GET /api/graph/instances.xlsx?label= (requires openpyxl; returns 501 otherwise)
Map page tweak:
- The Instances selector now defaults to the Scan label for a more demo-friendly starting point.
- Commit now writes both File→SCANNED_IN→Scan and Folder→SCANNED_IN→Scan for recursive and non-recursive scans.
- Cypher simplified: MERGE Scan once; two independent subqueries for files and standalone folders; proper WITH scoping and unique return aliases.
- Post-commit verification runs automatically; Files page Background tasks shows: attempted, prepared, verify ok/fail with counts.
- Neo4j configuration UX: Settings Save no longer clears password on empty; added explicit Clear Password; supports NEO4J_AUTH=none and URI-embedded creds; backoff after repeated auth failures.
- Tests: added mocked-neo4j unit test for commit and verification; password persistence test.
The development CLI helps agents and humans navigate common workflows. It is self-describing and supports machine-friendly outputs.
Key features:
- argparse-based subcommands with consistent help
- meta-commands for discovery:
menuandintrospect - global flags:
--json,--explain,--dry-run - JSON envelope for all command outputs (when
--jsonis used)
JSON envelope schema:
{
"status": "ok|error",
"command": "<name>",
"data": {},
"plan": {},
"warnings": []
}Quick start:
- List commands (human):
python3 -m dev.cli menu - List commands (JSON):
python3 -m dev.cli menu --json - Full introspection (metadata for agents):
python3 -m dev.cli introspectorpython3 -m dev.cli --json introspect
Global flags (place before the subcommand unless noted):
--jsonEmit structured envelope for agents--explainShow what would happen (no side-effects)--dry-runSimulate without side-effects (returns a plan)
Available commands:
ready-queue- Summary: Show ready tasks sorted by RICE (DoR true)
- Examples:
python -m dev.cli ready-queuepython -m dev.cli --json ready-queue
start [task_id]- Summary: Validate DoR, create/switch branch, print context
- Behavior: If task_id not provided, auto-picks top Ready task
- Examples:
python -m dev.cli start story:foo:barpython -m dev.cli --explain --json start(plan only, no side effects)
context <task_id>- Summary: Emit AI context for a task
- Examples:
python -m dev.cli context story:foo:barpython -m dev.cli --json context story:foo:bar
validate <task_id>- Summary: Validate Definition of Ready (DoR)
- Output (JSON):
{ ok: bool, missing: [fields] } - Examples:
python -m dev.cli validate story:foo:barpython -m dev.cli --json validate story:foo:bar
complete <task_id>- Summary: Run tests, print DoD checklist, and next steps
- Examples:
python -m dev.cli complete story:foo:barpython -m dev.cli --explain --json complete story:foo:bar
cycle-status- Summary: Show current cycle status from dev/cycles.md
- Examples:
python -m dev.cli cycle-statuspython -m dev.cli --json cycle-status
next-cycle- Summary: Propose next cycle using top Ready tasks
- Examples:
python -m dev.cli next-cyclepython -m dev.cli --json next-cycle
merge-safety [--base <branch>]- Summary: Report potentially risky deletions vs base branch
- Examples:
python -m dev.cli merge-safetypython -m dev.cli --json merge-safetypython -m dev.cli --json merge-safety --base origin/main- Plan only:
python -m dev.cli --json --dry-run merge-safety --base origin/main
dev-sync [--from <branch>]- Summary: Synchronize dev/ directory from a shared/base branch into the current branch
- Behavior: Restores dev/ from SCIDK_DEV_SHARED_BRANCH, SCIDK_BASE_BRANCH, or the specified --from branch
- Examples:
python -m dev.cli dev-syncpython -m dev.cli --json dev-syncpython -m dev.cli --json --explain dev-syncpython -m dev.cli --json dev-sync --from origin/main
Environment variables for dev sync:
SCIDK_DEV_SHARED_BRANCH: Preferred source branch to pull dev/ from (e.g.,origin/mainormain). If unset, the CLI falls back toSCIDK_BASE_BRANCH, thenorigin/main,main, etc.
Notes for agents:
- Prefer
menu --jsonfor a quick navigable overview of commands. - Use
introspectto obtain full metadata about args/options, side-effects, and conventions. - Place
--jsonbefore the subcommand to ensure the envelope applies to the whole invocation, e.g.,python -m dev.cli --json ready-queue.
We ship a docker-compose file that runs Neo4j 5 and exposes the built-in HTTP service on the classic port 7474, while Bolt remains on 7687. Follow Neo4j’s Docker volume guidelines for persistence (/data, /logs, /plugins, /import).
Quick start:
# Optional: set password (default is neo4j/neo4jiscool)
export NEO4J_AUTH=neo4j/neo4jiscool
# Optional: override host directories (defaults are under ./data/neo4j)
export NEO4J_HOST_DATA_DIR=${NEO4J_HOST_DATA_DIR:-./data/neo4j/data}
export NEO4J_HOST_LOGS_DIR=${NEO4J_HOST_LOGS_DIR:-./data/neo4j/logs}
export NEO4J_HOST_PLUGINS_DIR=${NEO4J_HOST_PLUGINS_DIR:-./data/neo4j/plugins}
export NEO4J_HOST_IMPORT_DIR=${NEO4J_HOST_IMPORT_DIR:-./data/neo4j/import}
# Start Neo4j in the background
docker compose -f docker-compose.neo4j.yml up -d
# Open the UI (Browser/Workspace availability depends on the server image/version)
http://localhost:7474/
Notes:
- We do NOT mount or write to system paths like /var/lib/neo4j on the host. By default we persist under the repository at ./data/neo4j, which works without root.
- You can override the host directories per environment using NEO4J_HOST_* variables shown above (use absolute or relative paths you own).
- Ports: 7474 (HTTP), 7687 (Bolt). Adjust in docker-compose.neo4j.yml if occupied.
- Volumes per Neo4j Docker docs: bind host dirs to /data, /logs, /plugins, and /import inside the container.
- Avoid mounting anything under /var/lib/neo4j inside the container. The entrypoint changes ownership in that path and can cause permission issues. Stick to /data, /logs, /plugins, and /import.
Manage lifecycle:
# Stop containers
docker compose -f docker-compose.neo4j.yml down
# Stop and remove all data (DANGER: wipes the graph)
docker compose -f docker-compose.neo4j.yml down -v
Connect SciDK to this Neo4j:
export NEO4J_URI=bolt://localhost:7687
export NEO4J_AUTH=${NEO4J_AUTH:-neo4j/neo4jiscool}
# Optional named database
echo "SCIDK_NEO4J_DATABASE=neo4j" >> .env
# Start SciDK
scidk-serve
# or
python -m scidk.app
Our CI runs all tests under Python 3.12 in three tiers using pytest markers:
- unit: fast, pure unit tests that do not touch network/DB/browser
- integration: tests that touch DB/files/HTTP without a browser
- e2e: full-browser Playwright tests
Local commands:
- make unit → pytest -m "not integration and not e2e"
- make integration → pytest -m integration
- make e2e → pytest -m e2e tests/e2e -q
- make check → runs unit, integration, and e2e sequentially
See .github/workflows/tests.yml for the CI matrix that runs each tier.
Follow these steps to verify the full test suite and automatically capture screenshots/JSON for the demo.
- Verify CI on GitHub
- Navigate to GitHub → Actions → "Tests" workflow (defined in
.github/workflows/tests.yml). - Confirm that all three matrix jobs are green:
- tier=unit
- tier=integration
- tier=e2e (installs Playwright browsers automatically)
- Click into the latest run to see logs if any job is red.
- Run all tests locally (mirrors CI)
make check
This runs unit → integration → e2e sequentially under Python 3.12.
- Capture demo screenshots and API snapshots (automated)
- Headless (recommended for CI or quick local runs):
make demo-record
- Headed with Playwright inspector (debugging):
make demo-record-headed
Artifacts are saved under dev/test-runs/last-demo by default. Override the output directory with:
DEMO_ARTIFACTS_DIR=dev/test-runs/my-demo make demo-record
Generated artifacts include:
01-home.png,02-datasets-before.png,03-datasets-after.png,04-map.pngapi-api-health.json,api-api-scans.json,api-api-directories.json,api-api-tasks.jsonSUMMARY.jsonwith the artifact path and timestamp
- Tag and record (optional) You can create a tag and attach the artifact folder to a GitHub Release:
git tag -a vX.Y.Z -m "Cycle demo: SQLite persistence + selective scan cache"
git push origin vX.Y.Z
# Then, on GitHub → Releases → Draft a new release → Attach the files from dev/test-runs/...
Troubleshooting:
- First-time Playwright run locally: install browsers with
make e2e-install-browsers. - Port conflicts: ensure 127.0.0.1:5001 is free; the E2E harness auto-starts the app on that port.
- Backend toggle: default is SQLite. Override with
export SCIDK_STATE_BACKEND=memorybeforemake e2eif you need the legacy path.
The app can read registry state (scans, directories, tasks, telemetry) through SQLite or in-memory structures.
- Default: SCIDK_STATE_BACKEND=sqlite
- Fallback: SCIDK_STATE_BACKEND=memory (restores legacy in-memory reads)
Set the backend via environment before starting the app:
export SCIDK_STATE_BACKEND=sqlite # or: memory
scidk-serve
Health endpoint includes SQLite details useful during migrations and troubleshooting:
- GET /api/health → { sqlite: { path, exists, journal_mode, wal_mode, schema_version, select1, error? } }
Notes:
- Auto-migrations run on boot and /api/health reports the final schema_version.
- WAL mode is enabled by default; journal_mode and wal_mode are both reported for clarity.