pyprideap

Python PRIDE Affinity Proteomics (pyprideap), a library for reading, validating, and analyzing affinity proteomics datasets from the PRIDE Affinity Archive (PAD).

Supports Olink (Explore, Explore HT, Target, Reveal) and SomaScan platforms.

Installation

Install pyprideap directly from PyPI:

pip install pyprideap

Or from source:

pip install "pyprideap[all] @ git+https://github.com/PRIDE-Archive/pyprideap.git"

With plotting and QC report support:

pip install "pyprideap[plots]"

With statistical testing:

pip install "pyprideap[all]"

Quick Start

Read a dataset

import pyprideap as pp

# Auto-detect format from file extension and content
dataset = pp.read("olink_npx.csv")
dataset = pp.read("raw_data.adat")
dataset = pp.read("data.parquet")

# Force platform when auto-detection is ambiguous
dataset = pp.read("ambiguous.csv", platform="olink")
dataset = pp.read("ambiguous.csv", platform="somascan")

Generate a QC report

dataset = pp.read("olink_npx.csv")
pp.qc_report(dataset, "my_report.html")

The report includes a dataset summary table with traffic-light quality indicators and interactive plots: expression distributions, PCA/t-SNE, LOD analysis, sample correlation, completeness, CV distributions, and more. All plots are rendered with Plotly and include help tooltips explaining how to interpret each visualization.

Generate individual plot files for embedding:

pp.qc_report_split(dataset, "plots_dir/")

Validate against PRIDE-AP guidelines

results = pp.validate(dataset)

for r in results:
    print(f"[{r.level.value}] {r.rule}: {r.message}")

Compute statistics

stats = pp.compute_stats(dataset)
print(stats.summary())

Fetch data from PRIDE Archive

client = pp.PrideClient()
project = client.get_project("PAD000001")
files = client.list_files("PAD000001")
urls = client.get_download_urls("PAD000001")

Command-Line Interface

pyprideap includes a CLI (powered by Click) for generating QC reports:

# From a local file (format auto-detected)
pyprideap report data.npx.csv
pyprideap report data.parquet -o my_report.html

# Force platform type
pyprideap report data.csv -p olink
pyprideap report data.adat -p somascan

# From a PRIDE accession (downloads data automatically)
pyprideap report -a PAD000001

# Generate individual plot files instead of a single report
pyprideap report data.npx.csv --split -o plots_dir/

# Include SDRF metadata for volcano plots
pyprideap report data.npx.csv --sdrf samples.sdrf.tsv

# Enable verbose logging (shows format detection, LOD method, PCA variance, etc.)
pyprideap report data.npx.csv -v

# List proteins above LOD from a local file
pyprideap proteins-above-lod data.npx.csv
pyprideap proteins-above-lod data.npx.csv -t 80 -o proteins.txt

# List proteins above LOD from a PRIDE accession
pyprideap proteins-above-lod -a PAD000001

Or via python -m:

python -m pyprideap report data.npx.csv

Verbose mode

Use -v / --verbose to enable detailed debug logging. This shows progress through each processing stage:

Reading olink_npx.csv...
08:12:01 [DEBUG] pyprideap.io.readers.registry: Format detected: olink_csv
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Sample key selected: SampleID
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Pivot shape: 20 samples x 1470 features
  20 samples, 1470 features (olink_explore)
08:12:01 [DEBUG] pyprideap.processing.lod: LOD method selected: REPORTED
08:12:02 [DEBUG] pyprideap.viz.qc.compute: Computing PCA...
08:12:02 [DEBUG] pyprideap.viz.qc.compute: PCA: variance explained=[0.42, 0.18]
...

QC Report

The HTML report is a self-contained, interactive document with a sidebar table of contents. It includes:

Section	Plots
Quality Overview	LOD source comparison, QC x LOD stacked bar
Signal & Distribution	Per-sample expression histograms, protein detectability
Sample Completeness	Per-sample above/below LOD stacked bars
Missing Frequency Distribution	Per-protein missing rate histogram with 30% threshold
Sample Relationships	PCA / t-SNE (toggle switch), sample correlation heatmap, clustered expression heatmap
Normalization QC	Hybridization control scale (SomaScan)
Variability	CV distribution, intra/inter-plate CV
Assay QC	IQR/Median outlier detection, UniProt duplicate mapping (Olink)
SomaScan QC	ColCheck pass/flag summary (SomaScan)
Differential Expression	Volcano plots per variable (requires SDRF metadata)

Each plot has a ? help button with guidance on interpretation.

Embedding reports in web pages

Reports automatically detect when loaded inside an <iframe> and switch to an embedded mode that hides the header, sidebar, and footer:

<iframe
  src="my_report.html"
  style="width: 100%; border: none; min-height: 600px;"
  id="qc-report">
</iframe>

<script>
// Auto-resize iframe to fit content
window.addEventListener('message', function(e) {
  if (e.data && e.data.type === 'pride-qc-resize') {
    document.getElementById('qc-report').style.height = e.data.height + 'px';
  }
});
</script>

The embedded report posts pride-qc-resize messages with the document height, allowing the parent page to resize the iframe automatically. The CSS class pride-embedded is added to the body, which:

Removes the sidebar navigation, header, and footer
Makes the background transparent
Removes card shadows for a seamless look

SDRF Integration

pyprideap can read SDRF (Sample and Data Relationship Format) files and merge sample metadata into datasets:

from pyprideap.io.readers.sdrf import read_sdrf, merge_sdrf, get_grouping_columns

# Read and parse an SDRF file
sdrf = read_sdrf("samples.sdrf.tsv")

# Merge SDRF metadata into an existing dataset
dataset = pp.read("olink_npx.csv")
dataset = merge_sdrf(dataset, sdrf)

# Identify columns suitable for differential expression grouping
group_cols = get_grouping_columns(sdrf)
# e.g. ["disease", "sex", "treatment"]

Column names are automatically shortened from the full SDRF syntax (e.g. characteristics[disease] becomes disease). Duplicate column names are disambiguated with numeric suffixes.

Supported File Formats

Format	Platform	Function
`.npx.csv`	Olink Explore / Target	`pp.read()`
`.parquet`	Olink Explore HT	`pp.read()`
`.xlsx`	Olink	`pp.read()`
`.adat`	SomaScan	`pp.read()`
`.csv` (SomaScan)	SomaScan	`pp.read()`
`.sdrf.tsv`	Any	`read_sdrf()`

All readers produce an AffinityDataset with a unified structure regardless of input format.

Data Model

@dataclass
class AffinityDataset:
    platform: Platform          # OLINK_EXPLORE, OLINK_EXPLORE_HT, SOMASCAN, etc.
    samples: pd.DataFrame       # Sample metadata (SampleID, SampleType, QC flags, ...)
    features: pd.DataFrame      # Protein/aptamer annotations (OlinkID, UniProt, Panel, ...)
    expression: pd.DataFrame    # Quantification matrix (NPX or RFU)
    metadata: dict              # Platform-specific extras

LOD (Limit of Detection)

pyprideap supports multiple LOD sources with automatic fallback:

Reported LOD — from the LOD column in the data file
NCLOD — computed from negative control samples (requires >= 10 controls)
FixedLOD — pre-computed Olink reference values (bundled for Explore, Explore HT, Reveal)
eLOD — estimated from buffer samples using MAD formula (SomaScan)

Statistical Analysis

With pip install "pyprideap[stats]":

# Per-protein t-test between groups
results = pp.ttest(dataset, group_var="Treatment")

# Wilcoxon rank-sum test
results = pp.wilcoxon(dataset, group_var="Treatment")

# ANOVA with covariates
results = pp.anova(dataset, group_var="Treatment", covariates=["Age", "Sex"])

# Post-hoc pairwise comparisons
posthoc = pp.anova_posthoc(dataset, group_var="Treatment")

Normalization

# Bridge normalization (combining two runs with shared samples)
normalized = pp.bridge_normalize(dataset1, dataset2, bridge_samples=["S1", "S2"])

# Subset normalization using reference proteins
normalized = pp.subset_normalize(dataset1, dataset2, reference_proteins=["P1", "P2"])

# Reference median normalization
normalized = pp.reference_median_normalize(dataset, reference_medians=medians)

# Select optimal bridge samples
bridges = pp.select_bridge_samples(dataset, n=8)

# Assess bridgeability between product versions
report = pp.assess_bridgeability(dataset1, dataset2)

Additional normalization methods are available via direct import:

from pyprideap.processing.normalization import (
    lift_somascan,                # Cross-version SomaScan calibration (5k ↔ 7k ↔ 11k)
    quantile_smooth_normalize,    # Quantile normalization with smoothing
    scale_analytes,               # Per-analyte multiplicative scaling
    normalize_n,                  # Multi-step normalization pipeline
)

Preprocessing Pipelines

Platform-specific preprocessing pipelines bundle common QC and filtering steps:

from pyprideap.processing.olink import preprocess_olink
from pyprideap.processing.somascan import preprocess_somascan

# Olink: filter controls, detect outliers, LOD filtering, UniProt dedup
dataset, report = preprocess_olink(
    dataset,
    filter_controls=True,
    filter_qc_outliers=True,
    filter_lod=False,
)

# SomaScan: filter features/controls, RowCheck QC, outlier detection
dataset, report = preprocess_somascan(
    dataset,
    filter_features=True,
    filter_controls=True,
    filter_rowcheck=True,
)

print(report.summary())

Experimental Design

# Randomize samples to plates
plate_assignment = pp.randomize_plates(
    samples=sample_df,
    n_plates=4,
    keep_paired="SubjectID",  # keep longitudinal samples on same plate
    seed=42,
)

License

Apache License 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 139 Commits
.github/workflows		.github/workflows
docs		docs
src/pyprideap		src/pyprideap
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pyprideap

Installation

Quick Start

Read a dataset

Generate a QC report

Validate against PRIDE-AP guidelines

Compute statistics

Fetch data from PRIDE Archive

Command-Line Interface

Verbose mode

QC Report

Embedding reports in web pages

SDRF Integration

Supported File Formats

Data Model

LOD (Limit of Detection)

Statistical Analysis

Normalization

Preprocessing Pipelines

Experimental Design

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pyprideap

Installation

Quick Start

Read a dataset

Generate a QC report

Validate against PRIDE-AP guidelines

Compute statistics

Fetch data from PRIDE Archive

Command-Line Interface

Verbose mode

QC Report

Embedding reports in web pages

SDRF Integration

Supported File Formats

Data Model

LOD (Limit of Detection)

Statistical Analysis

Normalization

Preprocessing Pipelines

Experimental Design

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages