Every data science project eventually becomes a mess of hardcoded paths:
# Scattered across your codebase...
data = pd.read_csv("/home/user/projects/myproject/data/raw/sales.csv")
model.save("/home/user/projects/myproject/outputs/models/best_model.pt")
writer = SummaryWriter("/home/user/projects/myproject/runs/exp_001")
results.to_csv(f"/home/user/projects/myproject/outputs/{datetime.now():%Y%m%d}_results.csv")This leads to: - Brittle code that breaks when you move directories or share with collaborators - Inconsistent organization across experiments and team members
- Lost outputs when you forget which script created which file - Overwritten results when you re-run without changing output paths - Manual directory creation scattered throughout your code
projio centralizes all path management in one place:
from projio import PIO
# Configure once at startup
PIO.root = "./my_project"
# Use everywhere - paths are consistent, directories auto-created
data = pd.read_csv(PIO.data_dir / "raw" / "sales.csv")
model.save(PIO.checkpoint_path("best_model", run="exp_001"))
writer = SummaryWriter(PIO.tensorboard_run(run="exp_001"))
results.to_csv(PIO.path_for("outputs", "results", ext=".csv")) # Auto-datestamped!pip install projioOr install from GitHub:
pip install git+https://github.com/s01st/project-io.gitimport tempfile
from projio import ProjectIO, PIO
# Create a ProjectIO instance
tmp = tempfile.mkdtemp()
io = ProjectIO(root=tmp, use_datestamp=False)
print(f"Root: {io.root}")
print(f"Outputs: {io.outputs}")
print(f"Cache: {io.cache}")
print(f"Checkpoints: {io.checkpoints}")The problem: You have input data in one location and want outputs in another, but most paths should share a common base.
The solution: Set root once and iroot/oroot follow
automatically. Override individually when needed.
# Simple case: everything under one root
io = ProjectIO(root=tmp, use_datestamp=False)
print(f"Input root: {io.iroot}")
print(f"Output root: {io.oroot}")
print(f"Both follow root: {io.iroot == io.oroot == io.root}")# Advanced case: separate input/output locations
data_dir = tempfile.mkdtemp()
results_dir = tempfile.mkdtemp()
io = ProjectIO(
root=tmp,
iroot=data_dir, # Read data from here
oroot=results_dir, # Write outputs here
use_datestamp=False
)
print(f"Data comes from: {io.inputs}")
print(f"Results go to: {io.outputs}")
print(f"Config/resources from: {io.root}")When to use: - Shared datasets on network storage with local output directories - Read-only input mounts (e.g., in containers) - Separating raw data from generated artifacts
The problem: You run an experiment, then run it again next week. The old results are overwritten and lost forever.
The solution: Automatic date-based organization keeps every run separate.
io = ProjectIO(root=tmp, use_datestamp=True, datestamp_in="dirs", auto_create=False)
io.datestamp_value = lambda ts=None: "2024_03_15" # Mock for demo
# Paths automatically include today's date
print(f"Output: {io.path_for('outputs', 'results', ext='.csv')}")
print(f"Checkpoint: {io.checkpoint_path('model', run='baseline')}")# Different placement options
from pathlib import Path
for placement in ["dirs", "files", "both"]:
io = ProjectIO(root=tmp, use_datestamp=True, datestamp_in=placement, auto_create=False)
io.datestamp_value = lambda ts=None: "2024_03_15"
path = io.path_for('outputs', 'results', ext='.csv')
# Show path relative to root (resolve symlinks for macOS compatibility)
rel = path.relative_to(Path(tmp).resolve())
print(f"{placement:5} -> {rel}")When to use: - Long-running projects with multiple experiment runs - When you need to compare results across days/weeks - Audit trails for regulatory compliance - Any time you’ve ever overwritten important results
The problem: Different parts of your codebase use different conventions for organizing files.
The solution: One consistent API for all path types with automatic directory creation.
io = ProjectIO(root=tmp, use_datestamp=False, auto_create=True)
# All path types use the same pattern
print("Path types:")
print(f" outputs: {io.path_for('outputs', 'analysis', ext='.csv')}")
print(f" cache: {io.path_for('cache', 'preprocessed', ext='.pkl')}")
print(f" logs: {io.path_for('logs', 'training', ext='.log')}")# Subdirectories are easy
path = io.path_for('outputs', 'model', subdir=['experiment_1', 'fold_3'], ext='.pt')
print(f"Nested path: {path}")
print(f"Directory created: {path.parent.exists()}")When to use: - Any project with multiple output types - When you want directories created automatically - Team projects needing consistent organization
The problem: PyTorch Lightning projects need checkpoint directories, TensorBoard logs, and training logs - all organized consistently.
The solution: Built-in support for Lightning artifacts with dedicated path methods and callbacks.
io = ProjectIO(root=tmp, use_datestamp=False)
# Lightning-specific paths
print(f"Lightning root: {io.lightning_root}")
print(f"Checkpoints: {io.checkpoints}")
print(f"TensorBoard: {io.tensorboard}")# Organized by run name
print(f"\nRun-specific paths:")
print(f" Checkpoint: {io.checkpoint_path('epoch_10', run='baseline_v2')}")
print(f" TensorBoard: {io.tensorboard_run(run='baseline_v2')}")
print(f" Log: {io.log_path('metrics', run='baseline_v2')}")# Use callbacks for seamless integration
from projio.callbacks import IOCheckpointCallback, IOLogCallback
ckpt_cb = IOCheckpointCallback(io=io, run="experiment_1")
log_cb = IOLogCallback(io=io, run="experiment_1")
print(f"Checkpoint callback dir: {ckpt_cb.checkpoint_dir}")
print(f"Log callback dir: {log_cb.log_dir}")
# In your training script:
# trainer = Trainer(callbacks=[ckpt_cb, log_cb])When to use: - Any PyTorch Lightning project - When you need consistent checkpoint/log organization - Multi-run experiments with TensorBoard comparison
The problem: Some datasets consist of multiple related files (e.g.,
10X Genomics output has matrix.mtx, barcodes.tsv, features.tsv).
Managing these paths individually is tedious.
The solution: Templates define file patterns that resolve to multiple paths at once.
io = ProjectIO(root=tmp, use_datestamp=False, auto_create=False)
# Built-in template for single-cell data
paths = io.template_path("filtered_matrix")
print("10X Genomics filtered matrix files:")
for name, path in paths.items():
print(f" {name}: {path.name}")# Register your own templates
from projio.funcs import TemplateSpec
# Template for a trained model package
model_template = TemplateSpec(
name="trained_model",
base="outputs",
pattern={
"weights": "model/weights.pt",
"config": "model/config.json",
"tokenizer": "model/tokenizer.json",
"metrics": "model/eval_metrics.json"
}
)
io.register_template(model_template)
paths = io.template_path("trained_model")
print("\nTrained model files:")
for name, path in paths.items():
# Show path relative to root (resolve symlinks for macOS compatibility)
rel = path.relative_to(Path(tmp).resolve())
print(f" {name}: {rel}")When to use: - Bioinformatics (10X, FASTQ pairs, BAM+BAI) - ML model artifacts (weights, config, tokenizer) - Any multi-file data format
The problem: You find an output file but can’t remember which script created it or when.
The solution: Track which scripts produce which files for full reproducibility.
from pathlib import Path
io = ProjectIO(root=tmp, use_datestamp=False)
# Track what this script produces
output_file = io.path_for('outputs', 'processed_data', ext='.parquet')
io.track_producer(
target=output_file,
producer=Path('preprocess.py'),
kind='data'
)
model_file = io.path_for('outputs', 'model', ext='.pt')
io.track_producer(
target=model_file,
producer=Path('train.py'),
kind='model'
)
# Later, find out what produced a file
print("Who produced the model?")
for record in io.producers_of(model_file):
print(f" {record.producer.name} ({record.kind})")
# Or find all outputs from a script
print("\nWhat does train.py produce?")
for record in io.outputs_of(Path('train.py')):
print(f" {record.target.name}")When to use: - Complex pipelines with many intermediate outputs - Debugging data lineage issues - Reproducibility requirements
The problem: You want to preview what paths will be created without actually touching the filesystem.
The solution: Dry-run mode returns paths but doesn’t create directories or write files.
dry_tmp = tempfile.mkdtemp()
io = ProjectIO(root=dry_tmp, dry_run=True)
# Get paths without creating anything
checkpoint = io.checkpoint_path('model', run='test')
output = io.path_for('outputs', 'results', ext='.csv')
print(f"Would create checkpoint: {checkpoint}")
print(f"Would create output: {output}")
print(f"\nDirectories actually created: {any(Path(dry_tmp).iterdir())}")When to use: - Testing path configuration before running experiments - CI/CD pipelines that need to validate paths - Debugging path issues
The problem: You need to temporarily change settings (e.g., disable datestamps for a specific operation) then restore them.
The solution: The using() context manager handles save/restore
automatically.
io = ProjectIO(root=tmp, use_datestamp=True, auto_create=True)
print(f"Normal mode: datestamp={io.use_datestamp}, auto_create={io.auto_create}")
with io.using(use_datestamp=False, auto_create=False):
print(f"Inside context: datestamp={io.use_datestamp}, auto_create={io.auto_create}")
# Operations here use the temporary settings
print(f"After context: datestamp={io.use_datestamp}, auto_create={io.auto_create}")When to use: - Writing config files that shouldn’t be datestamped - Temporary dry-run for validation - Any setting override that should be scoped
The problem: You need to access paths from anywhere in your codebase
without passing an io instance everywhere.
The solution: The
PIO class provides
singleton-style access, similar to scanpy.settings.
from projio import PIO, ProjectIO
# Configure once at the start of your application
PIO.default = ProjectIO(root=tmp, use_datestamp=False)
# Access from anywhere without passing io around
print(f"PIO.root: {PIO.root}")
print(f"PIO.outputs: {PIO.outputs}")
print(f"PIO.checkpoints: {PIO.checkpoints}")# Methods work too
path = PIO.path_for('cache', 'embeddings', ext='.npy')
print(f"Cache path via PIO: {path}")When to use: - Large codebases where dependency injection is impractical - Interactive notebook workflows - Quick scripts where you want minimal boilerplate
The problem: You want to quickly see what directory structure has been created.
The solution: Built-in ASCII tree rendering.
# Create some structure
viz_tmp = tempfile.mkdtemp()
io = ProjectIO(root=viz_tmp, use_datestamp=False, auto_create=True)
# Access paths to create directories
_ = io.outputs
_ = io.cache
_ = io.checkpoints
_ = io.tensorboard
_ = io.path_for('outputs', 'exp1', subdir='run_1', ext='.txt')
_ = io.path_for('outputs', 'exp1', subdir='run_2', ext='.txt')
# Visualize
print(io.tree(io.root, max_depth=3))Here’s a realistic example combining multiple features:
from pathlib import Path
from projio import ProjectIO
from projio.callbacks import IOCheckpointCallback, IOLogCallback
# Project setup - configure once
project_root = tempfile.mkdtemp()
io = ProjectIO(
root=project_root,
use_datestamp=True,
datestamp_in="dirs",
auto_create=True
)
io.datestamp_value = lambda ts=None: "2024_03_15" # Mock for demo
run_name = "baseline_v1"
# Data loading
raw_data = io.inputs / "raw" / "dataset.csv"
print(f"Load data from: {raw_data}")
# Preprocessing with caching
cache_path = io.path_for('cache', 'preprocessed', ext='.pkl')
print(f"Cache preprocessed data: {cache_path}")
# Training with Lightning
ckpt_cb = IOCheckpointCallback(io=io, run=run_name)
log_cb = IOLogCallback(io=io, run=run_name)
print(f"Checkpoints: {ckpt_cb.checkpoint_dir}")
print(f"TensorBoard: {log_cb.log_dir}")
# Save final results
results_path = io.path_for('outputs', 'metrics', subdir=run_name, ext='.json')
print(f"Save results: {results_path}")
# Track what we produced
io.track_producer(results_path, Path('train.py'), kind='metrics')
# View the structure
print(f"\nProject structure:")
print(io.tree(io.root))| Property | Description |
|---|---|
root |
Shared base path (cascades to iroot/oroot) |
iroot / inputs |
Input/data root |
oroot / outputs |
Output root |
cache |
Cache directory |
logs |
Logs directory |
data_dir |
Data directory under inputs |
downloads |
Downloads directory under inputs |
lightning_root |
Root for Lightning artifacts |
checkpoints |
Checkpoints directory |
tensorboard |
TensorBoard logs directory |
resources |
Package resources directory |
| Method | Description |
|---|---|
path_for(kind, name, ...) |
Build path for a given kind |
checkpoint_path(name, ...) |
Build checkpoint file path |
log_path(name, ...) |
Build log file path |
tensorboard_run(run, ...) |
Build TensorBoard run directory |
resource_path(*parts, ...) |
Get path to a resource file |
template_path(name, ...) |
Resolve a template to paths |
| Parameter | Default | Description |
|---|---|---|
root |
cwd | Base directory for all paths |
iroot |
root | Input/data root (overrides cascade) |
oroot |
root | Output root (overrides cascade) |
use_datestamp |
True | Enable automatic datestamps |
datestamp_format |
%Y_%m_%d |
strftime format for dates |
datestamp_in |
dirs |
Where to add datestamp: dirs/files/both/none |
auto_create |
True | Automatically create directories |
dry_run |
False | Preview mode - don’t create anything |
| Method | Description |
|---|---|
datestamp_value() |
Get formatted datestamp string |
parse_datestamp(text) |
Parse datestamp to datetime |
tree(path, ...) |
Render ASCII directory tree |
describe() |
Get dict of current configuration |
using(**overrides) |
Context manager for temp overrides |
track_producer(...) |
Record file provenance |
producers_of(path) |
Find what produced a file |
outputs_of(script) |
Find what a script produces |
For more detailed examples, see the tutorials:
- Quick Start - Basic usage and concepts
- Datestamp Handling - Date-based organization
- Lightning Integration - PyTorch Lightning workflows
- Templates - Multi-file dataset patterns
- Advanced Features - Producer tracking, dry-run, gitignore
# Clone the repository
git clone https://github.com/s01st/project-io.git
cd project-io
# Install in development mode
pip install -e .
# Make changes under nbs/ directory
# ...
# Export and test
nbdev_prepare