Enable smc(k) model#1829
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1829 +/- ##
=======================================
Coverage 99.82% 99.82%
=======================================
Files 143 143
Lines 5030 5053 +23
Branches 515 519 +4
=======================================
+ Hits 5021 5044 +23
Misses 6 6
Partials 3 3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
FYI we have done some testing and for large number of samples SMC is in fact slower than Hudson, but the opposite is true with small number of samples, as you see here. |
|
Thanks @gregorgorjanc. Yes, I have also observed that. I have a very limited understanding of Hudson and the generalised SMC algorithm, but, tbh, I thought it was some implementation issue (e.g. the msprime Hudson being far more optimised than SMCK). Also, for a fixed number of lineages, say, n=100, I often see that if I increase Ne enough, eventually smc(k=1) gets faster than Hudson, although I haven't tested this idea rigorously at all. Do you know if this is something expected / intrinsic to complexity the algorithm? P.S sorry for the digression. |
|
I have only tested one specific model, so can't comment more. Konrad L tells me that there is something O(n^2) in the SMC(k) which makes it slow in large regime, but you should talk to him, or Jerome. |
What's a "large number"? (Sounds like it depends on the ratio of sample size squared and Ne, but, roughly?) I think changing this to allow passing an msprime.AncestryModel is a good idea. We could also parse a string of the form |
|
About the PR - I don't think you've got the API for model changes right yet? I think (based on a quick read) that what happens there is that |
|
I'm happy to review this when you've got a reasonable PR? LMK. |
Yeah, I tried to avoid dealing with it. Conceptually, my problem is IMO, copying the MSPrime API would be cleanest (having a list of MSPrime.AncestryModel with durations), but that breaks the existing signature. But I guess it's OK to assert duration is set to None here and force the user to use the tuple signature.
Yes, let me get back to you :=) |
|
Regarding the complexity question, the SMC(k) implementation is best used for small n (sample size). This is an inherent tradeoff in tracking pairs of lineages that can potentially coalesce, given the overlap constraints. If there's a lot of such lineages, it slows down. So it's great for small sample sizes and large chromosomes, but it's not a silver bullet, and good ole Hudson still wins generally. |
|
@petrelharp I think I now got |
petrelharp
left a comment
There was a problem hiding this comment.
The code looks right to me, but we need a better way of testing we're doing the right models at the right times. (Also a test of whatever we're needing to deepcopy for.) Could you see whether this (model switching in msprime) a thing we were testing robustly before?
|
Also, I should have said - that code looks a tricky; nice job sorting that out. |
|
I took a closer look and I don’t think |
|
Thanks! I'll have a look.
Yeah, but I'm not sure where else we'd get it? I guess there are things we can check for most models, like if times are node integer for DTWF. This seems fine, anyhow. |
petrelharp
left a comment
There was a problem hiding this comment.
I like the testing; but there's still the non-public function being used in the testing code.
| ], | ||
| ) | ||
|
|
||
| def test_msprime_ancestry_model(self): |
There was a problem hiding this comment.
How about factor out the "check execution order" code above into a method (like maybe self.check_execution_order(ts, expected)) and apply it to these, so we're checking more than just "did this run"?
| ) | ||
| if msprime_change_model is not None: | ||
| kwargs["msprime_change_model"] = msprime_change_model |
There was a problem hiding this comment.
None is the default for this parameter, so this should work?
| ) | |
| if msprime_change_model is not None: | |
| kwargs["msprime_change_model"] = msprime_change_model | |
| msprime_model = msprime_change_model, | |
| ) |
| samples={"YRI": 5, "CHB": 5, "CEU": 5}, | ||
| dry_run=True, | ||
| ) | ||
| # Valid single None, string or `msprime.AncestryModel` |
There was a problem hiding this comment.
This test needs some refactoring so it's clear what's going on: currently it first sets up the kwargs for a HomSap simulation, then it calls check_execution_order a few times, which doesn't use those kwargs, then it goes on to use the kwargs for something else.
It looks like it should be three tests: checking execution order, testing errors, and testing we don't mutate the model?
There was a problem hiding this comment.
I'm also not sure what's the distinction between the check_execution_order tests here and in the next one?
| msprime_change_model=[(12, model), (99, model)], | ||
| ) | ||
| assert model.duration == 10 | ||
| model = msprime.SMCK(k=1, duration=10) |
There was a problem hiding this comment.
| model = msprime.SMCK(k=1, duration=10) |
|
Almost there - the tests need a bit of organization. I've gone ahead and rebased and squashed this also, so the tests would run; I hope that doesn't mess you up! |
|
No worries 🙂 Thanks for keep the PR going :) |
|
lmk when I should take a look again! |
Related to #1807
Hi! I’m interested in using the new SMC(k) model in msprime
(1) Passing strings
The current code relies in many places on being able to specify models with strings (e.g. the command line), whereas the new "msprime.SMCK" requires a
k=Xargument. To get the (significant) speedup of the new SMCK, one could simply updatestdpopsim/stdpopsim/engines.py
Lines 149 to 154 in c96504f
into:
but, of course, doesn't follow the current
msprimeconvention. Do you know if there are plans to "deprecate" legacy msprime.SmcPrimeApproxCoalescent and msprime.SmcApproxCoalescent in favour ofmsprime.SMCK(k=0)andmsprime.SMCK(k=1)?If not, another option would be to come up with string aliases for the models (smc_k0 vs. smc_k1?) but it doesn't feel right to come up with aliases outside of the
msprime(although it would be neat for the CLI).(2) Passing
msprime.AncestryModel(this PR)The last alternative (I can think of) would be to allow passing msprime.AncestryModel or [msprime.AncestryModel, ...] directly. This fits nicely, but doesn't play nicely with the command-line nor the
msprime_change_modelargument. I've outlined this option in the PR. It's not complete, but perhaps a good start to start thinking about how to enable this feature?Quick test (Vaquita2Epoch_1R22, chr1, n=4 lineages)
I made a quick test and it looks like it’s working.
The smc(k=1) is very fast for this long chromosome (1.86M)! Almost 1000x speedup compared to the naive rejection SMC' and a 10X speedup compare to Hudson.