Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,9 @@ dev = [
"pytest>=6.0",
"ruff>=0.1.0",
]
workflow = [
"snakemake>=8.0.0",
]

[project.urls]
Homepage = "https://github.com/mvdoc/hyperface-data-paper"
Expand Down
284 changes: 284 additions & 0 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
"""Snakemake workflow for QA pipeline plots.

This workflow runs the QA pipeline scripts to generate:
- Motion plots (directly from fMRIprep confounds)
- tSNR plots (requires tSNR computation first)
- ISC plots (requires ISC computation first)

Usage:
# Run all QA pipelines (from project root)
snakemake --snakefile workflow/Snakefile --cores 1

# Run specific pipeline
snakemake --snakefile workflow/Snakefile motion_plots --cores 1
snakemake --snakefile workflow/Snakefile tsnr_plots --cores 1
snakemake --snakefile workflow/Snakefile isc_plots --cores 1

# Dry run to see what would be executed
snakemake --snakefile workflow/Snakefile -n

# Process specific subjects (motion/tsnr only)
snakemake --snakefile workflow/Snakefile motion_plots --cores 1 --config subjects="sub-001 sub-002"

# Force rerun of specific target
snakemake --snakefile workflow/Snakefile tsnr_plots --cores 1 --forcerun
"""

import shutil
import sys
from pathlib import Path

# Get project root (parent of workflow directory)
PROJECT_ROOT = Path(workflow.basedir).parent

# Configuration - use path relative to workflow directory
configfile: workflow.source_path("config/config.yaml")

# Python interpreter - prefer active interpreter, fallback to venv
VENV_PYTHON = PROJECT_ROOT / ".venv" / "bin" / "python"
if Path(sys.executable).is_file():
PYTHON = str(sys.executable)
elif VENV_PYTHON.is_file():
PYTHON = str(VENV_PYTHON)
else:
# Fallback to system python
PYTHON = shutil.which("python3") or shutil.which("python") or "python"

# Scripts directory
SCRIPTS_DIR = str(PROJECT_ROOT / "scripts" / "qa")

# Data directories (from qa_config.yaml defaults)
DATA_DIR = str(PROJECT_ROOT / "data")
DERIVATIVES_DIR = f"{DATA_DIR}/derivatives"
FMRIPREP_DIR = f"{DERIVATIVES_DIR}/fmriprep"
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMRIPREP_DIR is defined but never used in this Snakefile, which can be confusing when maintaining the workflow. Either remove it or use it when wiring rule inputs (e.g., as part of declared inputs for motion/tSNR rules).

Suggested change
FMRIPREP_DIR = f"{DERIVATIVES_DIR}/fmriprep"

Copilot uses AI. Check for mistakes.
QA_DIR = f"{DERIVATIVES_DIR}/qa"
TSNR_DIR = f"{QA_DIR}/tsnr"
MOTION_DIR = f"{QA_DIR}/motion"
ISC_DIR = f"{QA_DIR}/isc"
Comment on lines +50 to +57
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow hard-codes QA output locations under data/derivatives/qa instead of deriving them from the Hyperface QA config (which can be overridden via --config, --data-dir, or HYPERFACE_QA_CONFIG). This can lead to logs/cleanup targeting a different directory than the scripts actually write to. Consider loading hyperface.qa.get_config() (or parsing the same YAML) in the Snakefile and building QA_DIR/TSNR_DIR/MOTION_DIR/ISC_DIR from config.paths, and/or exposing qa_config/data_dir as snakemake config entries passed through to the scripts.

Copilot uses AI. Check for mistakes.


def get_subjects_arg():
"""Get --subjects argument if configured."""
subjects = config.get("subjects", "")
if subjects:
return f"--subjects {subjects}"
Comment on lines +60 to +64
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subjects is interpolated into the shell: command as a raw string. Because it comes from snakemake config, this makes the command fragile (subjects containing whitespace/shell metacharacters) and enables shell injection via --config subjects=.... Prefer passing subjects as a list (e.g., store parsed tokens and use snakemake.utils.quote/shlex.join for safe rendering) or move execution to a run: block using subprocess.run([...]) to avoid shell parsing.

Copilot uses AI. Check for mistakes.
return ""


# Declare rules that don't create output files (run locally, not on cluster)
localrules: all, clean_motion, clean_tsnr, clean_isc, clean_all


# Default target: run all QA pipelines
rule all:
input:
f"{MOTION_DIR}/.motion_plots.done" if config.get("run_motion", True) else [],
f"{TSNR_DIR}/.tsnr_plots.done" if config.get("run_tsnr", True) else [],
f"{ISC_DIR}/.isc_plots.done" if config.get("run_isc", True) else [],


# =============================================================================
# Motion Plots Pipeline
# =============================================================================
# Motion plots read directly from fMRIprep confounds files (no pre-computation)

rule motion_plots:
"""Generate motion QA plots from fMRIprep confounds."""
input:
script=f"{SCRIPTS_DIR}/qa-plot-motion.py",
output:
done=touch(f"{MOTION_DIR}/.motion_plots.done"),
log:
f"{MOTION_DIR}/logs/motion_plots.log",
params:
subjects=get_subjects_arg(),
shell:
"""
mkdir -p {MOTION_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untrusted params.subjects is interpolated directly into this shell command, so shell metacharacters in the subjects config (e.g., passed via --config subjects=...) can break out of the intended --subjects argument and execute arbitrary commands. Because Snakemake formats this into /bin/sh without additional quoting, an attacker who can influence subjects could run code with the privileges of the Snakemake user. To mitigate this, ensure the subjects list is passed as properly quoted arguments or via a non-shell interface (for example, using Snakemake's safer argument passing mechanisms instead of raw string interpolation).

Copilot uses AI. Check for mistakes.
"""


# =============================================================================
# tSNR Pipeline
# =============================================================================
# tSNR plots require pre-computed tSNR volumes

rule compute_tsnr:
"""Compute tSNR volumes from fMRIprep BOLD data."""
input:
script=f"{SCRIPTS_DIR}/qa-save-tsnr-volume.py",
output:
done=touch(f"{TSNR_DIR}/.tsnr_computed.done"),
log:
f"{TSNR_DIR}/logs/compute_tsnr.log",
params:
subjects=get_subjects_arg(),
shell:
"""
mkdir -p {TSNR_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Untrusted params.subjects is interpolated directly into this shell command, so shell metacharacters in the subjects config (e.g., via --config subjects=...) can escape the intended --subjects argument and execute arbitrary commands when this rule runs. Since Snakemake passes this formatted string to the shell without quoting, an attacker controlling subjects could achieve code execution under the Snakemake user account. To remediate, ensure the subjects value is treated as data only (e.g., passed as a properly quoted argument or through a non-shell execution path) rather than concatenated into a shell command.

Copilot uses AI. Check for mistakes.
"""


rule tsnr_plots:
"""Generate tSNR QA plots from pre-computed tSNR volumes."""
input:
script=f"{SCRIPTS_DIR}/qa-plot-tsnr.py",
tsnr_computed=f"{TSNR_DIR}/.tsnr_computed.done",
output:
done=touch(f"{TSNR_DIR}/.tsnr_plots.done"),
log:
f"{TSNR_DIR}/logs/tsnr_plots.log",
params:
subjects=get_subjects_arg(),
shell:
"""
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tsnr_plots rule writes its log to {TSNR_DIR}/logs/... but does not ensure that directory exists. If compute_tsnr was run previously (so .tsnr_computed.done exists) but the logs directory is missing, this rule will fail when tee tries to open the log file. Add a mkdir -p for the logs directory (and, if needed, the output directory) in this rule’s shell command.

Suggested change
"""
"""
mkdir -p {TSNR_DIR}/logs

Copilot uses AI. Check for mistakes.
mkdir -p {TSNR_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, params.subjects is again interpolated unescaped into a shell command, which allows shell metacharacters in the subjects config to terminate the Python invocation and run arbitrary additional commands. Because this rule may be invoked with user-specified --config subjects=..., a malicious value can lead to code execution as the Snakemake user. This should be fixed by avoiding raw string concatenation of subjects into the shell command and instead passing it as a properly quoted argument or through a safer execution mechanism.

Copilot uses AI. Check for mistakes.
"""


# =============================================================================
# ISC Pipeline
# =============================================================================
# ISC plots require pre-computed ISC data

rule compute_isc:
"""Compute inter-subject correlation for visualmemory task."""
input:
script=f"{SCRIPTS_DIR}/qa-save-isc.py",
output:
done=touch(f"{ISC_DIR}/.isc_computed.done"),
log:
f"{ISC_DIR}/logs/compute_isc.log",
threads:
workflow.cores
shell:
"""
mkdir -p {ISC_DIR}/logs
{PYTHON} {input.script} --n-jobs {threads} 2>&1 | tee {log}
"""


rule isc_plots:
"""Generate ISC visualization plots."""
input:
script=f"{SCRIPTS_DIR}/qa-plot-isc.py",
isc_computed=f"{ISC_DIR}/.isc_computed.done",
output:
done=touch(f"{ISC_DIR}/.isc_plots.done"),
log:
f"{ISC_DIR}/logs/isc_plots.log",
shell:
"""
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The isc_plots rule writes its log to {ISC_DIR}/logs/... but does not create that directory. If the logs directory is missing, tee will fail. Create {ISC_DIR}/logs in the shell command (and/or ensure it exists via a dedicated directory rule).

Suggested change
"""
"""
mkdir -p {ISC_DIR}/logs

Copilot uses AI. Check for mistakes.
mkdir -p {ISC_DIR}/logs
{PYTHON} {input.script} 2>&1 | tee {log}
"""


# =============================================================================
# HTML Reports (optional, after plots)
# =============================================================================

rule motion_report:
"""Generate HTML reports for motion QA."""
input:
script=f"{SCRIPTS_DIR}/qa-generate-html-reports-motion.py",
plots_done=f"{MOTION_DIR}/.motion_plots.done",
output:
done=touch(f"{MOTION_DIR}/.motion_reports.done"),
log:
f"{MOTION_DIR}/logs/motion_reports.log",
params:
subjects=get_subjects_arg(),
shell:
"""
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motion_report rule writes its log to {MOTION_DIR}/logs/... but does not ensure that directory exists. Add a mkdir -p for the logs directory in this rule to avoid failures when running the rule in isolation (e.g., after cleaning logs but keeping .motion_plots.done).

Suggested change
"""
"""
mkdir -p {MOTION_DIR}/logs

Copilot uses AI. Check for mistakes.
mkdir -p {MOTION_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The params.subjects value is used directly in this shell command for the motion HTML reports, so a crafted subjects config string containing shell metacharacters can inject arbitrary commands. Since Snakemake does not automatically escape this interpolation, any component that passes attacker-controlled subjects (for example via --config) could be abused for code execution. To address this, ensure subjects is passed as a data argument rather than concatenated into the shell command string, using proper quoting or non-shell execution.

Copilot uses AI. Check for mistakes.
"""


rule tsnr_report:
"""Generate HTML reports for tSNR QA."""
input:
script=f"{SCRIPTS_DIR}/qa-generate-html-reports-tsnr.py",
plots_done=f"{TSNR_DIR}/.tsnr_plots.done",
output:
done=touch(f"{TSNR_DIR}/.tsnr_reports.done"),
log:
f"{TSNR_DIR}/logs/tsnr_reports.log",
params:
subjects=get_subjects_arg(),
shell:
"""
mkdir -p {TSNR_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this tSNR HTML report rule, params.subjects is interpolated directly into the shell command, enabling shell injection if subjects contains characters like ;, &&, or backticks. An attacker who can influence subjects via configuration or CLI could execute arbitrary commands with the Snakemake user's privileges. This should be hardened by treating subjects purely as an argument (properly quoted or passed via a safer interface) instead of splicing it into the shell command string.

Copilot uses AI. Check for mistakes.
"""


# =============================================================================
# Summary Statistics (optional)
# =============================================================================

rule tsnr_summary:
"""Print tSNR summary statistics."""
input:
script=f"{SCRIPTS_DIR}/print-tsnr-summary.py",
tsnr_computed=f"{TSNR_DIR}/.tsnr_computed.done",
output:
summary=f"{TSNR_DIR}/tsnr_summary.txt",
log:
f"{TSNR_DIR}/logs/tsnr_summary.log",
params:
subjects=get_subjects_arg(),
shell:
"""
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tsnr_summary rule logs to {TSNR_DIR}/logs/... but doesn’t ensure the logs directory exists. If the directory is absent, tee will fail and the rule won’t run. Add mkdir -p {TSNR_DIR}/logs (and ensure {TSNR_DIR} exists) before invoking the script.

Suggested change
"""
"""
mkdir -p {TSNR_DIR}/logs

Copilot uses AI. Check for mistakes.
mkdir -p {TSNR_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tSNR summary rule also interpolates untrusted params.subjects into a shell command, which can be exploited for shell injection if subjects is supplied with embedded shell syntax. Because Snakemake formats this string into the shell without additional escaping, a malicious subjects value can break out of the intended argument and run arbitrary commands. To mitigate, avoid embedding subjects directly in the shell string and instead pass it as a safely quoted parameter or via a non-shell invocation path.

Copilot uses AI. Check for mistakes.
"""


rule motion_summary:
"""Print motion summary statistics."""
input:
script=f"{SCRIPTS_DIR}/print-motion-summary.py",
output:
summary=f"{MOTION_DIR}/motion_summary.txt",
log:
f"{MOTION_DIR}/logs/motion_summary.log",
params:
subjects=get_subjects_arg(),
shell:
"""
mkdir -p {MOTION_DIR}/logs
{PYTHON} {input.script} {params.subjects} 2>&1 | tee {log}
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motion summary rule uses params.subjects directly in the shell command, so shell metacharacters in the subjects config can terminate the Python command and execute arbitrary additional commands. Since this value may be controlled via --config subjects=..., an attacker could gain code execution under the Snakemake user account. This should be corrected by ensuring subjects is passed as a properly escaped argument or through a safer execution mechanism instead of raw shell string interpolation.

Copilot uses AI. Check for mistakes.
"""


# =============================================================================
# Cleanup Rules
# =============================================================================

rule clean_motion:
"""Remove motion QA outputs (regenerate with snakemake motion_plots)."""
shell:
"rm -rf {MOTION_DIR}"


rule clean_tsnr:
"""Remove tSNR QA outputs (regenerate with snakemake tsnr_plots)."""
shell:
"rm -rf {TSNR_DIR}"


rule clean_isc:
"""Remove ISC QA outputs (regenerate with snakemake isc_plots)."""
shell:
"rm -rf {ISC_DIR}"


rule clean_all:
"""Remove all QA outputs."""
shell:
"rm -rf {QA_DIR}"
14 changes: 14 additions & 0 deletions workflow/config/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Snakemake configuration for QA pipelines
#
# This config controls which pipelines run and their parameters.
# The actual data paths are read from the hyperface QA config
# (src/hyperface/assets/qa_config.yaml).
Comment on lines +4 to +5
Copy link

Copilot AI Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This config file says the workflow’s data paths are read from src/hyperface/assets/qa_config.yaml, but the Snakefile currently hardcodes the paths instead. Either update this comment to match reality or update the Snakefile to derive its directories from the QA config to avoid confusing users.

Suggested change
# The actual data paths are read from the hyperface QA config
# (src/hyperface/assets/qa_config.yaml).
# Note: data paths (e.g., input/output directories) are currently
# configured directly in the Snakefile, not via src/hyperface/assets/qa_config.yaml.

Copilot uses AI. Check for mistakes.

# Which pipelines to run with the default 'all' target
run_motion: true
run_tsnr: true
run_isc: true

# Subject filtering (space-separated list, leave empty for all subjects)
# Example: "sub-001 sub-002 sub-003"
subjects: ""