Discussion: replace ConsensusID with a more robust multi-engine rescoring path

## Context

In the multi-engine search path (e.g. Comet + MSGF+, optionally + Sage), quantms currently merges per-engine PSMs with `CONSENSUSID` after per-engine `PercolatorAdapter` rescoring, then applies FDR. This generates per-file `CONSENSUSID` fan-out and relies on `IDRipper` when the rescoring step was previously merged per-sample or per-project.

With `ms2features_enable=true` (`quantms-rescoring` via `ms2rescore`), every engine's PSMs receive the same run-invariant features (MS²PIP, DeepLC, precursor-ppm). Those features are numerically comparable across engines by construction, which raises the question: can we simplify the multi-engine path by dropping `CONSENSUSID` and merging engines directly into a single `PercolatorAdapter` call?

This issue is a discussion thread to decide which direction to take.

## What we learned from the OpenMS source

Short version: the naive path `IDMerger → PercolatorAdapter` is **not supported** by OpenMS.

- **`IDMerger` is a dumb concatenator.** It preserves per-`PeptideHit` meta-values verbatim as a sparse flat-map. A Comet PSM carries `xcorr` and no `hyperscore`; a Sage PSM carries `hyperscore` and no `xcorr`. There is no unified feature schema in the merged idXML — just ragged per-hit meta-values ([`IDMergerAlgorithm.cpp:170-237`](https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/IDMergerAlgorithm.cpp#L170-L237)).
- **`PercolatorAdapter` explicitly refuses multi-engine input** ([`PercolatorAdapter.cpp:493-498`](https://github.com/OpenMS/OpenMS/blob/develop/src/topp/PercolatorAdapter.cpp#L493-L498)):
  ```
  if (se1 != se2) { OPENMS_LOG_ERROR << "... Use TOPP_PSMFeatureExtractor to merge ...";
                    return INPUT_FILE_CORRUPT; }
  ```
- **Even if you bypassed that check, `PercolatorInfile::store` silently drops any PSM missing any feature column** ([`PercolatorInfile.cpp:531-566`](https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/FORMAT/PercolatorInfile.cpp#L531-L566)). Ragged rows never reach the `percolator` binary.

The sanctioned multi-engine path in OpenMS is:

```
IDMerger -merge_proteins_add_PSMs
  → PSMFeatureExtractor -multiple_search_engines -impute
  → PercolatorAdapter
```

`PSMFeatureExtractor -impute` fills missing features with observed min/max (or `float::max` with `-limit_imputation`) — that's the "NaN handling" step.

**Gap**: `PercolatorFeatureSetHelper::addMULTISEFeatures` ([line 502-529](https://github.com/OpenMS/OpenMS/blob/develop/src/openms/source/ANALYSIS/ID/PercolatorFeatureSetHelper.cpp#L502)) branches only on Comet, MSGF+, X!Tandem, Mascot. **Sage is not in this code path.** `PSMFeatureExtractor` is also marked `@experimental`.

## Candidate approaches for discussion

**1. Adopt the OpenMS-sanctioned path.** Use `IDMerger -merge_proteins_add_PSMs → PSMFeatureExtractor -multiple_search_engines -impute → PercolatorAdapter`. Works for Comet + MSGF+ (+ X!Tandem, Mascot). **Does not work for Sage** without an upstream patch to `PercolatorFeatureSetHelper`. Conservative, uses well-tested OpenMS primitives.

**2. Bypass per-engine features, rely only on rescoring features.** With `ms2rescore` enabled, every engine's PSMs carry identical run-invariant features (MS²PIP / DeepLC / precursor-ppm). Feed the merged idXML to `PercolatorAdapter` with `-generic-feature-set` (or a similar restriction) so only the shared columns drive the SVM, ignoring engine-native scores. Works for any engine including Sage. Loses engine-native features from the SVM, which may or may not hurt PSM recovery — needs a benchmark.

**3. Keep `CONSENSUSID`.** It was purpose-built for harmonising scores across engines. The per-file fan-out is real but the step is cheap. In light of the blockers above, this may remain the right tool and the scope of this issue shrinks to "stop discussing the replacement, invest elsewhere."

**4. Contribute upstream.** Open a PR against OpenMS to add Sage support to `PercolatorFeatureSetHelper::addMULTISEFeatures` and lift `PSMFeatureExtractor`'s experimental label. This unblocks option 1 for every engine and benefits the broader OpenMS user base. Long timeline.

## What would help

- **Benchmark evidence** on Comet + MSGF+ datasets comparing options 1, 2, and 3 at 1 % FDR (PSM / peptide counts, calibration). No published benchmark covers this at quantms scale.
- **Scientific input** on whether option 2 (rescoring-only feature set) is defensible — is it equivalent to option 1 empirically, or does dropping engine-native features cost us?
- **Maintainer preference** on upstreaming Sage support vs. working around it in quantms.

## Prerequisite already in flight

#706 removes `ms2features_range = by_sample | by_project` and the `IDRipper` Nextflow module. These exist only to enable pooled Percolator runs that get split back per-run for `CONSENSUSID`, and removing them is a no-op for the default path while simplifying any future `CONSENSUSID` replacement work. That PR is independent of the discussion here — the multi-engine topology is untouched.

## Not in scope for this issue

- Changes to the DIA workflow (no `CONSENSUSID` usage).
- Pooling Percolator across files (separate discussion, requires its own benchmark).
- Native-Percolator bypass (running `percolator` outside `PercolatorAdapter` on merged `pin` files).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: replace ConsensusID with a more robust multi-engine rescoring path #707

Context

What we learned from the OpenMS source

Candidate approaches for discussion

What would help

Prerequisite already in flight

Not in scope for this issue

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discussion: replace ConsensusID with a more robust multi-engine rescoring path #707

Description

Context

What we learned from the OpenMS source

Candidate approaches for discussion

What would help

Prerequisite already in flight

Not in scope for this issue

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions