Skip to content

Enable smc(k) model#1829

Open
currocam wants to merge 2 commits into
popsim-consortium:mainfrom
currocam:main
Open

Enable smc(k) model#1829
currocam wants to merge 2 commits into
popsim-consortium:mainfrom
currocam:main

Conversation

@currocam
Copy link
Copy Markdown

@currocam currocam commented Apr 30, 2026

Related to #1807

Hi! I’m interested in using the new SMC(k) model in msprime

(1) Passing strings

The current code relies in many places on being able to specify models with strings (e.g. the command line), whereas the new "msprime.SMCK" requires a k=X argument. To get the (significant) speedup of the new SMCK, one could simply update

model_class_map = {
"hudson": msprime.StandardCoalescent,
"dtwf": msprime.DiscreteTimeWrightFisher,
"smc": msprime.SmcApproxCoalescent,
"smc_prime": msprime.SmcPrimeApproxCoalescent,
}

into:

model_class_map = {
    "hudson": msprime.StandardCoalescent,
    "dtwf": msprime.DiscreteTimeWrightFisher,
    "smc": msprime.SMCK(k=0),
    "smc_prime": msprime.SMCK(k=1),
}

but, of course, doesn't follow the current msprime convention. Do you know if there are plans to "deprecate" legacy msprime.SmcPrimeApproxCoalescent and msprime.SmcApproxCoalescent in favour of msprime.SMCK(k=0) and msprime.SMCK(k=1)?

If not, another option would be to come up with string aliases for the models (smc_k0 vs. smc_k1?) but it doesn't feel right to come up with aliases outside of the msprime (although it would be neat for the CLI).

(2) Passing msprime.AncestryModel (this PR)

The last alternative (I can think of) would be to allow passing msprime.AncestryModel or [msprime.AncestryModel, ...] directly. This fits nicely, but doesn't play nicely with the command-line nor the msprime_change_model argument. I've outlined this option in the PR. It's not complete, but perhaps a good start to start thinking about how to enable this feature?

Quick test (Vaquita2Epoch_1R22, chr1, n=4 lineages)

I made a quick test and it looks like it’s working.

The smc(k=1) is very fast for this long chromosome (1.86M)! Almost 1000x speedup compared to the naive rejection SMC' and a 10X speedup compare to Hudson.

msprime_model_benchmark msprime_model_num_trees

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.82%. Comparing base (3fc5413) to head (8e56bea).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1829   +/-   ##
=======================================
  Coverage   99.82%   99.82%           
=======================================
  Files         143      143           
  Lines        5030     5053   +23     
  Branches      515      519    +4     
=======================================
+ Hits         5021     5044   +23     
  Misses          6        6           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@gregorgorjanc
Copy link
Copy Markdown
Contributor

FYI we have done some testing and for large number of samples SMC is in fact slower than Hudson, but the opposite is true with small number of samples, as you see here.

@currocam
Copy link
Copy Markdown
Author

Thanks @gregorgorjanc. Yes, I have also observed that.

I have a very limited understanding of Hudson and the generalised SMC algorithm, but, tbh, I thought it was some implementation issue (e.g. the msprime Hudson being far more optimised than SMCK). Also, for a fixed number of lineages, say, n=100, I often see that if I increase Ne enough, eventually smc(k=1) gets faster than Hudson, although I haven't tested this idea rigorously at all. Do you know if this is something expected / intrinsic to complexity the algorithm?

P.S sorry for the digression.

@gregorgorjanc
Copy link
Copy Markdown
Contributor

I have only tested one specific model, so can't comment more. Konrad L tells me that there is something O(n^2) in the SMC(k) which makes it slow in large regime, but you should talk to him, or Jerome.

@petrelharp
Copy link
Copy Markdown
Contributor

FYI we have done some testing and for large number of samples SMC is in fact slower than Hudson, but the opposite is true with small number of samples, as you see here.

What's a "large number"? (Sounds like it depends on the ratio of sample size squared and Ne, but, roughly?)

I think changing this to allow passing an msprime.AncestryModel is a good idea. We could also parse a string of the form "smc(4)", say, but maybe we don't need to worry about making it available on the CLI for now because of these questions about efficiency and sample size? (Is it clear that people would def want to be using it for some common range of parameter values?)

Comment thread stdpopsim/engines.py Outdated
@petrelharp
Copy link
Copy Markdown
Contributor

About the PR - I don't think you've got the API for model changes right yet? I think (based on a quick read) that what happens there is that model applies from time 0 back to the first model change time (if any); and then at each time the model changes to the corresponding model, for the (time, model) tuples in model_changes. So, I think that model should not be a list; and you should extend the model in those tuples to be AncestryModels as well as strings. (But I'm sure you'd have got that after a quick read.)

Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py Outdated
@petrelharp
Copy link
Copy Markdown
Contributor

I'm happy to review this when you've got a reasonable PR? LMK.

@currocam
Copy link
Copy Markdown
Author

currocam commented Apr 30, 2026

About the PR - I don't think you've got the API for model changes right yet? I think (based on a quick read) that what happens there is that model applies from time 0 back to the first model change time (if any); and then at each time the model changes to the corresponding model, for the (time, model) tuples in model_changes. So, I think that model should not be a list; and you should extend the model in those tuples to be AncestryModels as well as strings. (But I'm sure you'd have got that after a quick read.)

Yeah, I tried to avoid dealing with it. Conceptually, my problem is msprime.AncestryModel takes a duration argument, which might disagree with the stdpopsim specific tuples msprime_change_model.

# current API (I think)
engine.simulate(msprime_model="dtwf", msprime_change_model = [(100, "hudson")],**kwargs)
# This could also work now
engine.simulate(msprime_model="dtwf", msprime_change_model = [(100, msprime.SMCK(k=1))],**kwargs) 
# But what if the user also uses duration? User might specify an invalid state
engine.simulate(msprime_model=msprime.SMCK(k=1, duration=50), msprime_change_model = [(100, msprime.SMCK(k=0))],**kwargs) 

IMO, copying the MSPrime API would be cleanest (having a list of MSPrime.AncestryModel with durations), but that breaks the existing signature. But I guess it's OK to assert duration is set to None here and force the user to use the tuple signature.

I'm happy to review this when you've got a reasonable PR? LMK.

Yes, let me get back to you :=)

@jeromekelleher
Copy link
Copy Markdown
Member

Regarding the complexity question, the SMC(k) implementation is best used for small n (sample size). This is an inherent tradeoff in tracking pairs of lineages that can potentially coalesce, given the overlap constraints. If there's a lot of such lineages, it slows down. So it's great for small sample sizes and large chromosomes, but it's not a silver bullet, and good ole Hudson still wins generally.

@currocam
Copy link
Copy Markdown
Author

currocam commented May 4, 2026

@petrelharp I think I now got model_changes right. Ready for review (I think?)

Copy link
Copy Markdown
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks right to me, but we need a better way of testing we're doing the right models at the right times. (Also a test of whatever we're needing to deepcopy for.) Could you see whether this (model switching in msprime) a thing we were testing robustly before?

Comment thread stdpopsim/engines.py Outdated
Comment thread stdpopsim/engines.py Outdated
Comment thread stdpopsim/engines.py Outdated
Comment thread stdpopsim/engines.py Outdated
Comment thread tests/test_engines.py Outdated
@petrelharp
Copy link
Copy Markdown
Contributor

Also, I should have said - that code looks a tricky; nice job sorting that out.

@currocam
Copy link
Copy Markdown
Author

currocam commented May 6, 2026

I took a closer look and I don’t think msprime_model_changes was being tested directly. I’ve added a couple of tests that directly test the execution order is correct (parsing the tree-sequence provenance, which is a bit hacky…) and added some better (IMP) error handling.

@petrelharp
Copy link
Copy Markdown
Contributor

Thanks! I'll have a look.

parsing the tree-sequence provenance, which is a bit hacky

Yeah, but I'm not sure where else we'd get it? I guess there are things we can check for most models, like if times are node integer for DTWF. This seems fine, anyhow.

Copy link
Copy Markdown
Contributor

@petrelharp petrelharp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the testing; but there's still the non-public function being used in the testing code.

Comment thread stdpopsim/engines.py Outdated
Comment thread tests/test_engines.py Outdated
Comment thread stdpopsim/engines.py Outdated
Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py Outdated
],
)

def test_msprime_ancestry_model(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about factor out the "check execution order" code above into a method (like maybe self.check_execution_order(ts, expected)) and apply it to these, so we're checking more than just "did this run"?

Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py Outdated
Comment thread tests/test_engines.py
Comment thread tests/test_engines.py Outdated
Comment on lines +187 to +189
)
if msprime_change_model is not None:
kwargs["msprime_change_model"] = msprime_change_model
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is the default for this parameter, so this should work?

Suggested change
)
if msprime_change_model is not None:
kwargs["msprime_change_model"] = msprime_change_model
msprime_model = msprime_change_model,
)

Comment thread tests/test_engines.py Outdated
samples={"YRI": 5, "CHB": 5, "CEU": 5},
dry_run=True,
)
# Valid single None, string or `msprime.AncestryModel`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs some refactoring so it's clear what's going on: currently it first sets up the kwargs for a HomSap simulation, then it calls check_execution_order a few times, which doesn't use those kwargs, then it goes on to use the kwargs for something else.

It looks like it should be three tests: checking execution order, testing errors, and testing we don't mutate the model?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure what's the distinction between the check_execution_order tests here and in the next one?

Comment thread tests/test_engines.py Outdated
msprime_change_model=[(12, model), (99, model)],
)
assert model.duration == 10
model = msprime.SMCK(k=1, duration=10)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
model = msprime.SMCK(k=1, duration=10)

@petrelharp
Copy link
Copy Markdown
Contributor

Almost there - the tests need a bit of organization. I've gone ahead and rebased and squashed this also, so the tests would run; I hope that doesn't mess you up!

@currocam
Copy link
Copy Markdown
Author

No worries 🙂 Thanks for keep the PR going :)

@petrelharp
Copy link
Copy Markdown
Contributor

lmk when I should take a look again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants