Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7cc7c32
feat: CIDConsensus — CID-optimal consensus tree search (T-150)
ms609 Mar 19, 2026
f2f4a53
perf: 7x CID scoring speedup via precomputed input splits
ms609 Mar 19, 2026
282489a
perf: further 20% scorer speedup via for-loop and raw split matrices
ms609 Mar 19, 2026
cab2190
feat: add CID scoring engine (ts_cid.cpp, ts_cid.h)
ms609 Mar 21, 2026
0d4d5eb
refactor: reframe CIDConsensus as MCI maximization
ms609 Mar 21, 2026
d1dad8f
feat(CID): wire score_budget early termination, fix parallel data race
ms609 Mar 21, 2026
deeea94
feat(CID): wire CID scoring into drift, ratchet, and sector search
ms609 Mar 23, 2026
75fa111
fix: post-merge fixes + inst/analysis -> dev/analysis
ms609 Mar 25, 2026
d22d33d
Rename CIDConsensus -> InfoConsensus; replace ape with TreeTools
ms609 Mar 25, 2026
8a7ce30
fix: populate phi/eff_k in build_mrp_dataset to prevent SIGSEGV in TB…
ms609 Mar 25, 2026
0015f06
perf: optimize InfoConsensus search
ms609 Mar 26, 2026
36a0b5c
Merge origin/cpp-search into feature/cid-consensus
ms609 Mar 26, 2026
6cf32e7
perf: C++ batch CID scoring + rogue prescreen; reduce CID defaults
ms609 Mar 26, 2026
7ad6669
perf: MRP split deduplication in build_mrp_dataset()
ms609 Mar 26, 2026
a595267
feat: plateau stopping for CID convergence detection
ms609 Mar 26, 2026
c8ec3e8
fix: reduce CID scoreTol default from 0.001 to 1e-5
ms609 Mar 26, 2026
fe8503f
Merge origin/cpp-search into feature/cid-consensus
ms609 Mar 26, 2026
55c04c7
feat: automatic tree subsampling for Phase 1 CID search
ms609 Mar 26, 2026
596612a
docs: CID scaling analysis — T_sub benchmarks and per-candidate profi…
ms609 Mar 26, 2026
3d09892
fix: close ts_test_strategy_tracker in RcppExports.cpp; remove stale …
ms609 Mar 26, 2026
85838ca
fix: remove duplicate RcppExports entries (ts_wagner_bias_bench, ts_t…
ms609 Mar 26, 2026
c8a6904
fix: remove duplicate ts_wagner_bias_bench and ts_test_strategy_track…
ms609 Mar 26, 2026
c9b04b9
fix: close ts_test_strategy_tracker and remove stale ts_cid_consensus…
ms609 Mar 26, 2026
17f6260
bench: treeSample auto vs Inf benchmark + incremental CID analysis
ms609 Mar 26, 2026
0aaa4a6
fix: treeSample Inf coercion in bench script
ms609 Mar 26, 2026
01ce631
feat: batch top-k CID candidate evaluation in TBR (screeningTopK param)
ms609 Mar 26, 2026
0d1ee31
bench: top-k CID screening benchmark script
ms609 Mar 26, 2026
6a8b50d
Merge remote-tracking branch 'origin/cpp-search' into feature/cid-con…
ms609 Mar 26, 2026
6636924
feat: SPIC scoring for InfoConsensus (T-cid-consensus)
ms609 Mar 27, 2026
f9fee3b
Merge remote-tracking branch 'origin/cpp-search' into feature/cid-con…
ms609 Mar 27, 2026
0c1f0fe
fix: use ape::consensus (lowercase) in test-Morphy.R — Consensus() im…
ms609 Mar 27, 2026
23d93f2
fix: use TreeTools::Consensus() in vignettes (bare Consensus() not on…
ms609 Mar 27, 2026
5db96f4
fix: update InfoConsensus.Rd and SearchControl.Rd to match code
ms609 Mar 27, 2026
9b7ee66
fix: add Splitwise to WORDLIST (T-150 spell-check fix)
ms609 Mar 27, 2026
f8bfee4
fix: qualify bare Consensus() as TreeTools::Consensus() in vignettes …
ms609 Mar 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
# Plan: CID-Optimal Consensus Tree Search (T-150)

**Date:** 2026-03-19
**Task:** T-150 — CID-optimal consensus tree search
**Branch:** `feature/cid-consensus`

---

## Goal

Implement `CIDConsensus()`: a function that finds a consensus tree minimizing
mean Clustering Information Distance (CID) to a set of input trees, using
TreeSearch's existing R-level search infrastructure (`TreeSearch()`, `Ratchet()`,
`EdgeListSearch()`) with a custom CID scorer.

The proof of concept (briefing-cid-consensus.md) showed CID hill-climbing from
a transfer consensus reduces mean CID from 3.03 (MR) / 2.85 (TC) to **2.16**
on a 20-tip test case.

---

## Architecture

Plug into the existing `TreeScorer` / `EdgeSwapper` / `Bootstrapper` interfaces:

```
CIDConsensus(trees, ...)
→ Ratchet(startTree, cidData,
InitializeData = identity,
CleanUpData = .NoOp,
TreeScorer = .CIDScorer,
Bootstrapper = .CIDBootstrap,
swappers = list(RootedTBRSwap, RootedSPRSwap, RootedNNISwap))
```

**Dataset representation:** An environment (reference semantics) containing:
- `$trees` — the input `multiPhylo`
- `$tipLabels` — shared tip labels
- `$nTip` — number of tips
- `$metric` — distance function (default: `ClusteringInfoDistance`)

Using an environment allows `CIDBootstrap` to temporarily swap the tree list
(resample with replacement) without copying large objects.

---

## Phases

### Phase 1: Core CID consensus (binary trees)

Use the existing `Ratchet()` / `TreeSearch()` on **binary** trees with CID
scoring. This alone delivered the POC's best result (2.16 CID via SPR from TC).

#### New file: `R/CIDConsensus.R`

**Exported:**

| Function | Purpose |
|----------|---------|
| `CIDConsensus(trees, ...)` | Main user-facing function |

**Internal:**

| Function | Purpose |
|----------|---------|
| `.CIDScorer(parent, child, dataset, ...)` | TreeScorer: build phylo, compute `mean(metric(candidate, trees))` |
| `.MakeCIDData(trees, metric)` | Create env with trees, tipLabels, metric |
| `.CIDBootstrap(edgeList, cidData, ...)` | Bootstrapper: resample input trees, search, restore |
| `.EdgeListToPhylo(parent, child, tipLabels)` | Helper: parent/child → phylo object |

**`CIDConsensus()` signature:**

```r
CIDConsensus <- function(
trees,
metric = ClusteringInfoDistance,
start = NULL, # phylo, or NULL → majority rule consensus
method = c("ratchet", "spr", "tbr", "nni"),
ratchIter = 100L,
ratchHits = 10L,
searchIter = 500L,
searchHits = 20L,
verbosity = 1L,
...
)
```

**Design decisions:**

1. **`start = NULL`** defaults to `MakeTreeBinary(Consensus(trees, p = 0.5))`.
Uses `TreeTools::Consensus` (rooted, Day 1985) and
`TreeTools::MakeTreeBinary` (uniform topology sampling, no spurious
zero-length edges). User can pass any phylo (e.g., a transfer consensus
from TreeDist dev). Starting tree is always resolved with
`MakeTreeBinary()` for Phase 1.

2. **`method = "ratchet"`** is the default. Delegates to `Ratchet()` with
`CIDBootstrap` and `list(RootedTBRSwap, RootedSPRSwap, RootedNNISwap)`.
Other methods delegate to `TreeSearch()` with the corresponding swapper.

3. **Lower default `searchIter`/`searchHits`** than parsimony (500/20 vs
4000/42) because CID scoring is ~100× slower per candidate than Fitch.

4. **`metric` parameter** allows swapping in `MutualClusteringInfo`,
`SharedPhylogeneticInfo`, or any function with signature `f(tree1, tree2)`.
Default is `ClusteringInfoDistance`.

5. **Return value:** A `phylo` tree with attributes `"score"` (mean CID)
and `"hits"`.

6. **`InitializeData = identity`**, **`CleanUpData`** = no-op function.
The dataset is the env created by `.MakeCIDData()`, passed directly.

**`.CIDBootstrap()` design:**

```r
.CIDBootstrap <- function(edgeList, cidData, EdgeSwapper, maxIter,
maxHits, verbosity, stopAtPeak, stopAtPlateau, ...) {
origTrees <- cidData$trees
nTree <- length(origTrees)
cidData$trees <- origTrees[sample.int(nTree, replace = TRUE)]
on.exit(cidData$trees <- origTrees)
res <- EdgeListSearch(edgeList[1:2], cidData,
TreeScorer = .CIDScorer,
EdgeSwapper = EdgeSwapper,
maxIter = maxIter, maxHits = maxHits,
verbosity = verbosity,
stopAtPeak = stopAtPeak,
stopAtPlateau = stopAtPlateau, ...)
res[1:2]
}
```

Resampling input trees with replacement is the CID analogue of character
bootstrapping — it perturbs the objective function to escape local optima.

#### New file: `tests/testthat/test-CIDConsensus.R`

Tier 2 (skip on CRAN). Tests:

1. `.CIDScorer()` returns correct mean CID for a known tree/tree-set pair.
2. `.CIDBootstrap()` returns a valid edgeList (2 elements, valid topology).
3. `CIDConsensus()` with `method = "spr"` improves or equals starting score.
4. `CIDConsensus()` with `method = "ratchet"` runs without error on a small
case (10 tips, 20 trees, `ratchIter = 3`).
5. `CIDConsensus()` accepts a custom `metric` (e.g., `MutualClusteringInfo`).
6. `CIDConsensus()` rejects non-multiPhylo input.
7. Starting from user-supplied tree works.
8. Score attribute is set on returned tree.

#### File modifications

| File | Change |
|------|--------|
| `DESCRIPTION` | Add `CIDConsensus.R` to Collate field |
| `NAMESPACE` | `export(CIDConsensus)` + any new TreeDist imports |
| `R/CIDConsensus.R` | New file |
| `tests/testthat/test-CIDConsensus.R` | New file |

### Phase 2: Collapse and Resolve moves (non-binary trees)

Add EdgeSwapper functions that operate on potentially non-binary trees,
enabling the search to find partially-resolved consensus optima.

#### New functions in `R/CIDConsensus.R`:

| Function | Purpose |
|----------|---------|
| `.CollapseSwap(parent, child, nTip)` | Contract a random internal edge → polytomy |
| `.ResolveSwap(parent, child, nTip)` | Resolve a random polytomy → new binary split |
| `.CollapseAllSwap(parent, child, nTip)` | Return list of all single-collapse candidates |
| `.ResolveAllSwap(parent, child, nTip)` | Return list of all single-resolve candidates |

**Collapse algorithm:**
1. Find internal edges (both endpoints > nTip, excluding root edge).
2. Pick one (random, or enumerate all with `edgeToBreak = -1`).
3. Reparent all children of the collapsed child node to its parent.
4. Remove the child node, renumber, return new parent/child.

**Resolve algorithm:**
1. Find polytomy nodes (degree > 2 in parent vector).
2. Pick one polytomy node and two or more of its children.
3. Insert a new internal node as intermediate parent of the selected children.
4. Renumber, return new parent/child.

**Search strategy with non-binary trees:**

A new composite swapper `ConsensusSwap` tries three move types per iteration:
1. With probability p₁: collapse a random edge
2. With probability p₂: resolve a random polytomy
3. With probability p₃: SPR on the fully-resolved version
(temporarily resolve all polytomies, SPR, then re-collapse original polytomies)

OR, simpler: use `Ratchet()` with a mixed swapper list:
```r
swappers = list(
.ResolveAndSPRSwap, # resolve → SPR (coarse)
.CollapseSwap, # collapse weak edges
.ResolveSwap # refine polytomies
)
```

The key insight: `RearrangeEdges()` and `EdgeListSearch()` don't enforce
bifurcation — only the existing `*Swap` functions do. New swappers that
handle polytomies plug in cleanly.

**Updated `CIDConsensus()` behavior:**
When `start` is a non-binary tree (or when the search reaches a non-binary
optimum via collapse), the mixed swapper list is used automatically.

#### Additional tests

9. `.CollapseSwap()` produces valid non-binary topology.
10. `.ResolveSwap()` produces valid topology with one fewer polytomy degree.
11. Collapse then resolve is reversible (same split count).
12. CIDConsensus with collapse/resolve finds equal-or-better score than
binary-only search on a case where optimal consensus is non-binary.

### Phase 3: Performance optimizations (stretch)

Not blocking initial delivery; document as future work.

- **Precompute per-tree split entropies** and cache in cidData to avoid
recomputation across candidates.
- **Parallel candidate evaluation**: `parallel::mclapply` or `future`
over candidates within `RearrangeEdges`.
- **C++ CID scorer**: Avoid R dispatch overhead by computing CID entirely
in C++ (would require porting LAP solver).

---

## Implementation order

1. Create `R/CIDConsensus.R` with Phase 1 functions.
2. Write `tests/testthat/test-CIDConsensus.R` (Tier 2).
3. Update `DESCRIPTION` (Collate) and `NAMESPACE`.
4. Run tests, verify on briefing's competing-topology example.
5. Add Phase 2 collapse/resolve functions.
6. Add Phase 2 tests.
7. Verify full search on non-binary case.
8. Update documentation/vignette.

---

## Risks and mitigations

| Risk | Mitigation |
|------|------------|
| CID scoring too slow for large trees (>50 tips, >200 input trees) | Lower default `searchIter`; document performance expectations; Phase 3 optimizations |
| SPR local optima on binary trees miss non-binary optimum | Phase 2 adds collapse/resolve; multi-start with `nSearch` parameter |
| `Ratchet()` bifurcation check (line 89-90) rejects non-binary trees | Phase 2: bypass by calling `EdgeListSearch` directly for non-binary case, or add a `bifurcating = TRUE` parameter |
| TreeDist `TransferConsensus` not yet on CRAN | Default start = majority rule via `TreeTools::Consensus`; user can pass any starting tree |
| Edge renumbering after collapse/resolve may break node numbering conventions | Use `ape::collapse.singles()` and `TreeTools::Renumber()` for canonicalization |
Loading
Loading