TemporalShiftLab is a script-first benchmark for time-series anomaly detection under shift. It focuses on failure behavior (noise, missingness, drift), not just clean-data accuracy.
- point-level precision/recall/F1
- event-level precision/recall/F1
- mean detection latency across anomaly windows
core/: detectors, dataset loading, stress transforms, metrics, experiment runnerscripts/: benchmark/stress scripts + full experiment orchestratorconfigs/: experiment schema instances (experiment.yaml)reports/: generated outputs and run artifacts
python -m venv .venv
# Windows PowerShell
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
# optional detectors (autoencoder baseline)
pip install -r requirements-optional.txt# install
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
# run full experiment
python -m scripts.run_all --config configs/experiment.yaml
python -m scripts.export_run_report
python -m scripts.plot_run_summary --dataset synthetic_demo --stress baseline --metric event_f1
# inspect outputs
Get-ChildItem reports/runs
Get-Content reports/runs/<latest_run_dir>/run_manifest.json
Get-Content reports/runs/<latest_run_dir>/EXPERIMENT_REPORT.mdpython -m scripts.run_benchmark
python -m scripts.run_stress_suite
python -m scripts.export_report
python -m scripts.run_all --config configs/experiment.yaml
python -m scripts.export_run_reportSelect detectors explicitly:
python -m scripts.run_benchmark --detectors rolling_zscore,rolling_mad
python -m scripts.run_stress_suite --detectors rolling_zscore,rolling_madscripts/run_all.py executes a full matrix:
- datasets (synthetic + real CSV)
- detectors
- stress cases
- seeds
It also writes run governance artifacts:
run_manifest.json(git SHA, timestamp, seeds, detector list, config hash)resolved_config.yamlraw_runs.jsonsummary_stats.json(mean/std + 95% bootstrap CI)significance_event_f1.json(pairwise sign-test summary)skipped.json
Threshold strategy is configurable in configs/experiment.yaml:
validation(quantile search on calibration split)percentile(single q)fixed(manual value)robust_mad(median + k*MAD)
Generated files:
reports/benchmark_result.jsonreports/stress_suite.jsonreports/REPORT.mdreports/figures/benchmark_f1.pngreports/figures/stress_event_f1.pngreports/runs/<timestamp>_<experiment_name>/...reports/runs/<timestamp>_<experiment_name>/plot_<dataset>_<stress>_<metric>.png
This repo intentionally excludes backend/frontend/container layers to stay focused on reproducible AI evaluation logic for portfolio review.
Detectors are registered in core/detectors.py and follow a simple contract:
fit(train_values)score(test_values) -> anomaly_scores
Add a new detector class there, register its name in _registry(), then pass it via --detectors.
Current detector set includes:
rolling_zscorerolling_madisolation_forestautoencoder(optional; requirestorch)
If torch is not installed, autoencoder runs are captured in skipped artifacts with a clear reason.
Real CSV datasets must include:
value(float)label(0/1)
Optional timestamp is allowed and ignored by the runner.
See demo_data/real_sample.csv and configs/experiment.yaml.
UCR-style datasets are supported as first-class config entries:
type: ucr_styletrain_path,test_path- optional
labels_pathwith eitherstart,endwindows orlabelcolumn - for UCR 2018 class datasets, use
labels_path: __from_test_first_col__to map first-class label to anomaly proxy (min class => normal, others => anomaly)
SMAP/MSL is scaffolded for later drop-in:
type: smap_msl_scaffoldroot_dircontainingtrain.csv,test.csv,labels.csv- sample structure provided in
demo_data/smap_msl_scaffold/
Current benchmark config uses the repo-root data/ folder directly:
data/UCRArchive_2018/...(UCR representative subsets)data/archive/...(SMAP/MSL with officiallabeled_anomalies.csv)data/archive (2)/...+data/nab_combined_windows.json(NAB combined-windows format)
Implemented:
- point precision/recall/F1 + event precision/recall/F1 + latency
- multiple threshold strategies (fixed/percentile/validation/robust_mad)
- stress tests for missingness/noise/drift
- UCR-style dataset flow + SMAP/MSL scaffold path
- optional autoencoder baseline path (dependency-gated)
- configurable experiment matrix with CI/significance + run manifests
Not implemented in this script-only repo:
- frontend threshold-lab UI interactions
- backend API routes/pages specified in original full-stack PRD