Enable smc(k) model by currocam · Pull Request #1829 · popsim-consortium/stdpopsim

currocam · 2026-04-30T10:29:42Z

Related to #1807

Hi! I’m interested in using the new SMC(k) model in msprime

(1) Passing strings

The current code relies in many places on being able to specify models with strings (e.g. the command line), whereas the new "msprime.SMCK" requires a k=X argument. To get the (significant) speedup of the new SMCK, one could simply update

stdpopsim/stdpopsim/engines.py

Lines 149 to 154 in c96504f

    
           model_class_map = { 
        
               "hudson": msprime.StandardCoalescent, 
        
               "dtwf": msprime.DiscreteTimeWrightFisher, 
        
               "smc": msprime.SmcApproxCoalescent, 
        
               "smc_prime": msprime.SmcPrimeApproxCoalescent, 
        
           }

into:

model_class_map = {
    "hudson": msprime.StandardCoalescent,
    "dtwf": msprime.DiscreteTimeWrightFisher,
    "smc": msprime.SMCK(k=0),
    "smc_prime": msprime.SMCK(k=1),
}

but, of course, doesn't follow the current msprime convention. Do you know if there are plans to "deprecate" legacy msprime.SmcPrimeApproxCoalescent and msprime.SmcApproxCoalescent in favour of msprime.SMCK(k=0) and msprime.SMCK(k=1)?

If not, another option would be to come up with string aliases for the models (smc_k0 vs. smc_k1?) but it doesn't feel right to come up with aliases outside of the msprime (although it would be neat for the CLI).

(2) Passing `msprime.AncestryModel` (this PR)

The last alternative (I can think of) would be to allow passing msprime.AncestryModel or [msprime.AncestryModel, ...] directly. This fits nicely, but doesn't play nicely with the command-line nor the msprime_change_model argument. I've outlined this option in the PR. It's not complete, but perhaps a good start to start thinking about how to enable this feature?

Quick test (Vaquita2Epoch_1R22, chr1, n=4 lineages)

I made a quick test and it looks like it’s working.

The smc(k=1) is very fast for this long chromosome (1.86M)! Almost 1000x speedup compared to the naive rejection SMC' and a 10X speedup compare to Hudson.

codecov · 2026-04-30T10:37:34Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.82%. Comparing base (3fc5413) to head (8e56bea).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1829   +/-   ##
=======================================
  Coverage   99.82%   99.82%           
=======================================
  Files         143      143           
  Lines        5030     5053   +23     
  Branches      515      519    +4     
=======================================
+ Hits         5021     5044   +23     
  Misses          6        6           
  Partials        3        3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

gregorgorjanc · 2026-04-30T12:44:02Z

FYI we have done some testing and for large number of samples SMC is in fact slower than Hudson, but the opposite is true with small number of samples, as you see here.

currocam · 2026-04-30T13:03:35Z

Thanks @gregorgorjanc. Yes, I have also observed that.

I have a very limited understanding of Hudson and the generalised SMC algorithm, but, tbh, I thought it was some implementation issue (e.g. the msprime Hudson being far more optimised than SMCK). Also, for a fixed number of lineages, say, n=100, I often see that if I increase Ne enough, eventually smc(k=1) gets faster than Hudson, although I haven't tested this idea rigorously at all. Do you know if this is something expected / intrinsic to complexity the algorithm?

P.S sorry for the digression.

gregorgorjanc · 2026-04-30T13:24:52Z

I have only tested one specific model, so can't comment more. Konrad L tells me that there is something O(n^2) in the SMC(k) which makes it slow in large regime, but you should talk to him, or Jerome.

petrelharp · 2026-04-30T16:35:25Z

FYI we have done some testing and for large number of samples SMC is in fact slower than Hudson, but the opposite is true with small number of samples, as you see here.

What's a "large number"? (Sounds like it depends on the ratio of sample size squared and Ne, but, roughly?)

I think changing this to allow passing an msprime.AncestryModel is a good idea. We could also parse a string of the form "smc(4)", say, but maybe we don't need to worry about making it available on the CLI for now because of these questions about efficiency and sample size? (Is it clear that people would def want to be using it for some common range of parameter values?)

petrelharp · 2026-04-30T16:40:37Z

About the PR - I don't think you've got the API for model changes right yet? I think (based on a quick read) that what happens there is that model applies from time 0 back to the first model change time (if any); and then at each time the model changes to the corresponding model, for the (time, model) tuples in model_changes. So, I think that model should not be a list; and you should extend the model in those tuples to be AncestryModels as well as strings. (But I'm sure you'd have got that after a quick read.)

petrelharp · 2026-04-30T16:42:28Z

I'm happy to review this when you've got a reasonable PR? LMK.

currocam · 2026-04-30T18:12:24Z

About the PR - I don't think you've got the API for model changes right yet? I think (based on a quick read) that what happens there is that model applies from time 0 back to the first model change time (if any); and then at each time the model changes to the corresponding model, for the (time, model) tuples in model_changes. So, I think that model should not be a list; and you should extend the model in those tuples to be AncestryModels as well as strings. (But I'm sure you'd have got that after a quick read.)

Yeah, I tried to avoid dealing with it. Conceptually, my problem is msprime.AncestryModel takes a duration argument, which might disagree with the stdpopsim specific tuples msprime_change_model.

# current API (I think)
engine.simulate(msprime_model="dtwf", msprime_change_model = [(100, "hudson")],**kwargs)
# This could also work now
engine.simulate(msprime_model="dtwf", msprime_change_model = [(100, msprime.SMCK(k=1))],**kwargs) 
# But what if the user also uses duration? User might specify an invalid state
engine.simulate(msprime_model=msprime.SMCK(k=1, duration=50), msprime_change_model = [(100, msprime.SMCK(k=0))],**kwargs)

IMO, copying the MSPrime API would be cleanest (having a list of MSPrime.AncestryModel with durations), but that breaks the existing signature. But I guess it's OK to assert duration is set to None here and force the user to use the tuple signature.

I'm happy to review this when you've got a reasonable PR? LMK.

Yes, let me get back to you :=)

jeromekelleher · 2026-05-01T05:40:41Z

Regarding the complexity question, the SMC(k) implementation is best used for small n (sample size). This is an inherent tradeoff in tracking pairs of lineages that can potentially coalesce, given the overlap constraints. If there's a lot of such lineages, it slows down. So it's great for small sample sizes and large chromosomes, but it's not a silver bullet, and good ole Hudson still wins generally.

currocam · 2026-05-04T14:04:18Z

@petrelharp I think I now got model_changes right. Ready for review (I think?)

petrelharp

The code looks right to me, but we need a better way of testing we're doing the right models at the right times. (Also a test of whatever we're needing to deepcopy for.) Could you see whether this (model switching in msprime) a thing we were testing robustly before?

petrelharp · 2026-05-06T04:00:40Z

Also, I should have said - that code looks a tricky; nice job sorting that out.

currocam · 2026-05-06T13:46:02Z

I took a closer look and I don’t think msprime_model_changes was being tested directly. I’ve added a couple of tests that directly test the execution order is correct (parsing the tree-sequence provenance, which is a bit hacky…) and added some better (IMP) error handling.

petrelharp · 2026-05-07T19:10:12Z

Thanks! I'll have a look.

parsing the tree-sequence provenance, which is a bit hacky

Yeah, but I'm not sure where else we'd get it? I guess there are things we can check for most models, like if times are node integer for DTWF. This seems fine, anyhow.

petrelharp

I like the testing; but there's still the non-public function being used in the testing code.

petrelharp · 2026-05-10T14:09:20Z

+                ],
+            )
+
+    def test_msprime_ancestry_model(self):


How about factor out the "check execution order" code above into a method (like maybe self.check_execution_order(ts, expected)) and apply it to these, so we're checking more than just "did this run"?

petrelharp · 2026-05-28T06:52:28Z

+        )
+        if msprime_change_model is not None:
+            kwargs["msprime_change_model"] = msprime_change_model


None is the default for this parameter, so this should work?

Suggested change

)

if msprime_change_model is not None:

kwargs["msprime_change_model"] = msprime_change_model

msprime_model = msprime_change_model,

)

petrelharp · 2026-05-28T06:55:33Z

+            samples={"YRI": 5, "CHB": 5, "CEU": 5},
+            dry_run=True,
+        )
+        # Valid single None, string or `msprime.AncestryModel`


This test needs some refactoring so it's clear what's going on: currently it first sets up the kwargs for a HomSap simulation, then it calls check_execution_order a few times, which doesn't use those kwargs, then it goes on to use the kwargs for something else.

It looks like it should be three tests: checking execution order, testing errors, and testing we don't mutate the model?

I'm also not sure what's the distinction between the check_execution_order tests here and in the next one?

petrelharp · 2026-05-28T06:57:19Z

+            msprime_change_model=[(12, model), (99, model)],
+        )
+        assert model.duration == 10
+        model = msprime.SMCK(k=1, duration=10)


Suggested change

model = msprime.SMCK(k=1, duration=10)

petrelharp · 2026-05-28T07:00:57Z

Almost there - the tests need a bit of organization. I've gone ahead and rebased and squashed this also, so the tests would run; I hope that doesn't mess you up!

currocam · 2026-05-28T07:11:27Z

No worries 🙂 Thanks for keep the PR going :)

petrelharp · 2026-05-31T15:30:26Z

lmk when I should take a look again!

petrelharp reviewed Apr 30, 2026

View reviewed changes

Comment thread stdpopsim/engines.py Outdated

petrelharp reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/test_engines.py Outdated

petrelharp reviewed Apr 30, 2026

View reviewed changes

Comment thread tests/test_engines.py Outdated

petrelharp reviewed May 6, 2026

View reviewed changes

Comment thread stdpopsim/engines.py Outdated

Comment thread stdpopsim/engines.py Outdated

Comment thread stdpopsim/engines.py Outdated

Comment thread stdpopsim/engines.py Outdated

Comment thread tests/test_engines.py Outdated

petrelharp reviewed May 10, 2026

View reviewed changes

Comment thread tests/test_engines.py

petrelharp reviewed May 28, 2026

View reviewed changes

add support for arbitrary msprime.AncestryModel

8713f04

petrelharp force-pushed the main branch from 22e20b7 to 8713f04 Compare May 28, 2026 07:00

refactor tests

8e56bea

	model_class_map = {
	"hudson": msprime.StandardCoalescent,
	"dtwf": msprime.DiscreteTimeWrightFisher,
	"smc": msprime.SmcApproxCoalescent,
	"smc_prime": msprime.SmcPrimeApproxCoalescent,
	}

Conversation

currocam commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

(1) Passing strings

(2) Passing msprime.AncestryModel (this PR)

Quick test (Vaquita2Epoch_1R22, chr1, n=4 lineages)

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gregorgorjanc commented Apr 30, 2026

Uh oh!

currocam commented Apr 30, 2026

Uh oh!

gregorgorjanc commented Apr 30, 2026

Uh oh!

petrelharp commented Apr 30, 2026

Uh oh!

Uh oh!

petrelharp commented Apr 30, 2026

Uh oh!

Uh oh!

Uh oh!

petrelharp commented Apr 30, 2026

Uh oh!

currocam commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented May 1, 2026

Uh oh!

currocam commented May 4, 2026

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp commented May 6, 2026

Uh oh!

currocam commented May 6, 2026

Uh oh!

petrelharp commented May 7, 2026

Uh oh!

petrelharp left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

petrelharp May 28, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp May 28, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp May 28, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp May 28, 2026

Choose a reason for hiding this comment

Uh oh!

petrelharp commented May 28, 2026

Uh oh!

currocam commented May 28, 2026

Uh oh!

petrelharp commented May 31, 2026

Uh oh!

currocam commented Apr 30, 2026 •

edited

Loading

(2) Passing `msprime.AncestryModel` (this PR)

codecov Bot commented Apr 30, 2026 •

edited

Loading

currocam commented Apr 30, 2026 •

edited

Loading