Python PRIDE Affinity Proteomics (pyprideap), a library for reading, validating, and analyzing affinity proteomics datasets from the PRIDE Affinity Archive (PAD).
Supports Olink (Explore, Explore HT, Target, Reveal) and SomaScan platforms.
Install pyprideap directly from PyPI:
pip install pyprideapOr from source:
pip install "pyprideap[all] @ git+https://github.com/PRIDE-Archive/pyprideap.git"With plotting and QC report support:
pip install "pyprideap[plots]"With statistical testing:
pip install "pyprideap[all]"import pyprideap as pp
# Auto-detect format from file extension and content
dataset = pp.read("olink_npx.csv")
dataset = pp.read("raw_data.adat")
dataset = pp.read("data.parquet")
# Force platform when auto-detection is ambiguous
dataset = pp.read("ambiguous.csv", platform="olink")
dataset = pp.read("ambiguous.csv", platform="somascan")dataset = pp.read("olink_npx.csv")
pp.qc_report(dataset, "my_report.html")The report includes a dataset summary table with traffic-light quality indicators and interactive plots: expression distributions, PCA/t-SNE, LOD analysis, sample correlation, completeness, CV distributions, and more. All plots are rendered with Plotly and include help tooltips explaining how to interpret each visualization.
Generate individual plot files for embedding:
pp.qc_report_split(dataset, "plots_dir/")results = pp.validate(dataset)
for r in results:
print(f"[{r.level.value}] {r.rule}: {r.message}")stats = pp.compute_stats(dataset)
print(stats.summary())client = pp.PrideClient()
project = client.get_project("PAD000001")
files = client.list_files("PAD000001")
urls = client.get_download_urls("PAD000001")pyprideap includes a CLI (powered by Click) for generating QC reports:
# From a local file (format auto-detected)
pyprideap report data.npx.csv
pyprideap report data.parquet -o my_report.html
# Force platform type
pyprideap report data.csv -p olink
pyprideap report data.adat -p somascan
# From a PRIDE accession (downloads data automatically)
pyprideap report -a PAD000001
# Generate individual plot files instead of a single report
pyprideap report data.npx.csv --split -o plots_dir/
# Include SDRF metadata for volcano plots
pyprideap report data.npx.csv --sdrf samples.sdrf.tsv
# Enable verbose logging (shows format detection, LOD method, PCA variance, etc.)
pyprideap report data.npx.csv -v
# List proteins above LOD from a local file
pyprideap proteins-above-lod data.npx.csv
pyprideap proteins-above-lod data.npx.csv -t 80 -o proteins.txt
# List proteins above LOD from a PRIDE accession
pyprideap proteins-above-lod -a PAD000001Or via python -m:
python -m pyprideap report data.npx.csvUse -v / --verbose to enable detailed debug logging. This shows progress through each processing stage:
Reading olink_npx.csv...
08:12:01 [DEBUG] pyprideap.io.readers.registry: Format detected: olink_csv
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Sample key selected: SampleID
08:12:01 [DEBUG] pyprideap.io.readers.olink_csv: Pivot shape: 20 samples x 1470 features
20 samples, 1470 features (olink_explore)
08:12:01 [DEBUG] pyprideap.processing.lod: LOD method selected: REPORTED
08:12:02 [DEBUG] pyprideap.viz.qc.compute: Computing PCA...
08:12:02 [DEBUG] pyprideap.viz.qc.compute: PCA: variance explained=[0.42, 0.18]
...
The HTML report is a self-contained, interactive document with a sidebar table of contents. It includes:
| Section | Plots |
|---|---|
| Quality Overview | LOD source comparison, QC x LOD stacked bar |
| Signal & Distribution | Per-sample expression histograms, protein detectability |
| Sample Completeness | Per-sample above/below LOD stacked bars |
| Missing Frequency Distribution | Per-protein missing rate histogram with 30% threshold |
| Sample Relationships | PCA / t-SNE (toggle switch), sample correlation heatmap, clustered expression heatmap |
| Normalization QC | Hybridization control scale (SomaScan) |
| Variability | CV distribution, intra/inter-plate CV |
| Assay QC | IQR/Median outlier detection, UniProt duplicate mapping (Olink) |
| SomaScan QC | ColCheck pass/flag summary (SomaScan) |
| Differential Expression | Volcano plots per variable (requires SDRF metadata) |
Each plot has a ? help button with guidance on interpretation.
Reports automatically detect when loaded inside an <iframe> and switch to an embedded mode that hides the header, sidebar, and footer:
<iframe
src="my_report.html"
style="width: 100%; border: none; min-height: 600px;"
id="qc-report">
</iframe>
<script>
// Auto-resize iframe to fit content
window.addEventListener('message', function(e) {
if (e.data && e.data.type === 'pride-qc-resize') {
document.getElementById('qc-report').style.height = e.data.height + 'px';
}
});
</script>The embedded report posts pride-qc-resize messages with the document height, allowing the parent page to resize the iframe automatically. The CSS class pride-embedded is added to the body, which:
- Removes the sidebar navigation, header, and footer
- Makes the background transparent
- Removes card shadows for a seamless look
pyprideap can read SDRF (Sample and Data Relationship Format) files and merge sample metadata into datasets:
from pyprideap.io.readers.sdrf import read_sdrf, merge_sdrf, get_grouping_columns
# Read and parse an SDRF file
sdrf = read_sdrf("samples.sdrf.tsv")
# Merge SDRF metadata into an existing dataset
dataset = pp.read("olink_npx.csv")
dataset = merge_sdrf(dataset, sdrf)
# Identify columns suitable for differential expression grouping
group_cols = get_grouping_columns(sdrf)
# e.g. ["disease", "sex", "treatment"]Column names are automatically shortened from the full SDRF syntax (e.g. characteristics[disease] becomes disease). Duplicate column names are disambiguated with numeric suffixes.
| Format | Platform | Function |
|---|---|---|
.npx.csv |
Olink Explore / Target | pp.read() |
.parquet |
Olink Explore HT | pp.read() |
.xlsx |
Olink | pp.read() |
.adat |
SomaScan | pp.read() |
.csv (SomaScan) |
SomaScan | pp.read() |
.sdrf.tsv |
Any | read_sdrf() |
All readers produce an AffinityDataset with a unified structure regardless of input format.
@dataclass
class AffinityDataset:
platform: Platform # OLINK_EXPLORE, OLINK_EXPLORE_HT, SOMASCAN, etc.
samples: pd.DataFrame # Sample metadata (SampleID, SampleType, QC flags, ...)
features: pd.DataFrame # Protein/aptamer annotations (OlinkID, UniProt, Panel, ...)
expression: pd.DataFrame # Quantification matrix (NPX or RFU)
metadata: dict # Platform-specific extraspyprideap supports multiple LOD sources with automatic fallback:
- Reported LOD — from the LOD column in the data file
- NCLOD — computed from negative control samples (requires >= 10 controls)
- FixedLOD — pre-computed Olink reference values (bundled for Explore, Explore HT, Reveal)
- eLOD — estimated from buffer samples using MAD formula (SomaScan)
With pip install "pyprideap[stats]":
# Per-protein t-test between groups
results = pp.ttest(dataset, group_var="Treatment")
# Wilcoxon rank-sum test
results = pp.wilcoxon(dataset, group_var="Treatment")
# ANOVA with covariates
results = pp.anova(dataset, group_var="Treatment", covariates=["Age", "Sex"])
# Post-hoc pairwise comparisons
posthoc = pp.anova_posthoc(dataset, group_var="Treatment")# Bridge normalization (combining two runs with shared samples)
normalized = pp.bridge_normalize(dataset1, dataset2, bridge_samples=["S1", "S2"])
# Subset normalization using reference proteins
normalized = pp.subset_normalize(dataset1, dataset2, reference_proteins=["P1", "P2"])
# Reference median normalization
normalized = pp.reference_median_normalize(dataset, reference_medians=medians)
# Select optimal bridge samples
bridges = pp.select_bridge_samples(dataset, n=8)
# Assess bridgeability between product versions
report = pp.assess_bridgeability(dataset1, dataset2)Additional normalization methods are available via direct import:
from pyprideap.processing.normalization import (
lift_somascan, # Cross-version SomaScan calibration (5k ↔ 7k ↔ 11k)
quantile_smooth_normalize, # Quantile normalization with smoothing
scale_analytes, # Per-analyte multiplicative scaling
normalize_n, # Multi-step normalization pipeline
)Platform-specific preprocessing pipelines bundle common QC and filtering steps:
from pyprideap.processing.olink import preprocess_olink
from pyprideap.processing.somascan import preprocess_somascan
# Olink: filter controls, detect outliers, LOD filtering, UniProt dedup
dataset, report = preprocess_olink(
dataset,
filter_controls=True,
filter_qc_outliers=True,
filter_lod=False,
)
# SomaScan: filter features/controls, RowCheck QC, outlier detection
dataset, report = preprocess_somascan(
dataset,
filter_features=True,
filter_controls=True,
filter_rowcheck=True,
)
print(report.summary())# Randomize samples to plates
plate_assignment = pp.randomize_plates(
samples=sample_df,
n_plates=4,
keep_paired="SubjectID", # keep longitudinal samples on same plate
seed=42,
)Apache License 2.0