Skip to content

Discussion: replace ConsensusID with a more robust multi-engine rescoring path #707

@ypriverol

Description

@ypriverol

Context

In the multi-engine search path (e.g. Comet + MSGF+, optionally + Sage), quantms currently merges per-engine PSMs with CONSENSUSID after per-engine PercolatorAdapter rescoring, then applies FDR. This generates per-file CONSENSUSID fan-out and relies on IDRipper when the rescoring step was previously merged per-sample or per-project.

With ms2features_enable=true (quantms-rescoring via ms2rescore), every engine's PSMs receive the same run-invariant features (MS²PIP, DeepLC, precursor-ppm). Those features are numerically comparable across engines by construction, which raises the question: can we simplify the multi-engine path by dropping CONSENSUSID and merging engines directly into a single PercolatorAdapter call?

This issue is a discussion thread to decide which direction to take.

What we learned from the OpenMS source

Short version: the naive path IDMerger → PercolatorAdapter is not supported by OpenMS.

  • IDMerger is a dumb concatenator. It preserves per-PeptideHit meta-values verbatim as a sparse flat-map. A Comet PSM carries xcorr and no hyperscore; a Sage PSM carries hyperscore and no xcorr. There is no unified feature schema in the merged idXML — just ragged per-hit meta-values (IDMergerAlgorithm.cpp:170-237).
  • PercolatorAdapter explicitly refuses multi-engine input (PercolatorAdapter.cpp:493-498):
    if (se1 != se2) { OPENMS_LOG_ERROR << "... Use TOPP_PSMFeatureExtractor to merge ...";
                      return INPUT_FILE_CORRUPT; }
    
  • Even if you bypassed that check, PercolatorInfile::store silently drops any PSM missing any feature column (PercolatorInfile.cpp:531-566). Ragged rows never reach the percolator binary.

The sanctioned multi-engine path in OpenMS is:

IDMerger -merge_proteins_add_PSMs
  → PSMFeatureExtractor -multiple_search_engines -impute
  → PercolatorAdapter

PSMFeatureExtractor -impute fills missing features with observed min/max (or float::max with -limit_imputation) — that's the "NaN handling" step.

Gap: PercolatorFeatureSetHelper::addMULTISEFeatures (line 502-529) branches only on Comet, MSGF+, X!Tandem, Mascot. Sage is not in this code path. PSMFeatureExtractor is also marked @experimental.

Candidate approaches for discussion

1. Adopt the OpenMS-sanctioned path. Use IDMerger -merge_proteins_add_PSMs → PSMFeatureExtractor -multiple_search_engines -impute → PercolatorAdapter. Works for Comet + MSGF+ (+ X!Tandem, Mascot). Does not work for Sage without an upstream patch to PercolatorFeatureSetHelper. Conservative, uses well-tested OpenMS primitives.

2. Bypass per-engine features, rely only on rescoring features. With ms2rescore enabled, every engine's PSMs carry identical run-invariant features (MS²PIP / DeepLC / precursor-ppm). Feed the merged idXML to PercolatorAdapter with -generic-feature-set (or a similar restriction) so only the shared columns drive the SVM, ignoring engine-native scores. Works for any engine including Sage. Loses engine-native features from the SVM, which may or may not hurt PSM recovery — needs a benchmark.

3. Keep CONSENSUSID. It was purpose-built for harmonising scores across engines. The per-file fan-out is real but the step is cheap. In light of the blockers above, this may remain the right tool and the scope of this issue shrinks to "stop discussing the replacement, invest elsewhere."

4. Contribute upstream. Open a PR against OpenMS to add Sage support to PercolatorFeatureSetHelper::addMULTISEFeatures and lift PSMFeatureExtractor's experimental label. This unblocks option 1 for every engine and benefits the broader OpenMS user base. Long timeline.

What would help

  • Benchmark evidence on Comet + MSGF+ datasets comparing options 1, 2, and 3 at 1 % FDR (PSM / peptide counts, calibration). No published benchmark covers this at quantms scale.
  • Scientific input on whether option 2 (rescoring-only feature set) is defensible — is it equivalent to option 1 empirically, or does dropping engine-native features cost us?
  • Maintainer preference on upstreaming Sage support vs. working around it in quantms.

Prerequisite already in flight

#706 removes ms2features_range = by_sample | by_project and the IDRipper Nextflow module. These exist only to enable pooled Percolator runs that get split back per-run for CONSENSUSID, and removing them is a no-op for the default path while simplifying any future CONSENSUSID replacement work. That PR is independent of the discussion here — the multi-engine topology is untouched.

Not in scope for this issue

  • Changes to the DIA workflow (no CONSENSUSID usage).
  • Pooling Percolator across files (separate discussion, requires its own benchmark).
  • Native-Percolator bypass (running percolator outside PercolatorAdapter on merged pin files).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions