Skip to content

Commit 1a2a0ae

Browse files
authored
Merge pull request #305 from hyanwong/simplification-tutes
Minor advanced simplification updates
2 parents 09827e1 + 56dad59 commit 1a2a0ae

1 file changed

Lines changed: 30 additions & 26 deletions

File tree

advanced_simplification.md

Lines changed: 30 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ kernelspec:
1919
# _Advanced simplification_
2020
% remove underscores in title when tutorial is complete or near-complete
2121

22-
:::{todo}
22+
:::{note}
2323
This tutorial is only partly complete, and there are a number of sections containing TODO items.
2424
:::
2525

@@ -41,8 +41,8 @@ import tskit_arg_visualizer as argviz
4141
4242
arg = msprime.sim_ancestry(
4343
samples=10,
44-
sequence_length=1.8e4,
45-
recombination_rate=1e-8,
44+
sequence_length=1e4,
45+
recombination_rate=2e-8,
4646
population_size=1e4,
4747
record_full_arg=True,
4848
random_seed=123,
@@ -84,7 +84,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
8484
### Obtaining the reverse map
8585

8686
Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
87-
a simple way to do this:
87+
a simple way to do that:
8888

8989
```{code-cell}
9090
def invert_map(node_mapping):
@@ -98,32 +98,33 @@ reverse_map = invert_map(node_map)
9898
print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
9999
```
100100

101-
Here's how the IDs in the first tree have changed:
101+
Here's a visualization of how the IDs in the first tree have changed:
102102

103103
```{code-cell}
104104
simp.first().draw_svg(
105105
size=(800, 300),
106+
time_scale="rank",
106107
node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
107108
)
108109
```
109110

110111
## 2) Information above the local roots
111112

112113
Usually there are no nodes or mutations in a tree sequence above each local root.
113-
However, as simplification deletes topology, it can create new local roots, leading to this
114-
expectation being broken.
114+
However, because simplification deletes topology, it can cause a node lower down a tree to
115+
become a new local root. This new root can therefore have nodes and mutations above it.
115116

116-
In some cases, you might want to retain nodes above the local roots, which is possible by
117+
In some cases, you might want to retain nodes above new roots, which is possible by
117118
setting `keep_input_roots=True`. The most common reason for this is to allow
118119
{ref}`recapitation <sec_completing_forward_simulations>` of forward-time simulations.
119120
See {ref}`this tutorial section <sec_completing_forward_simulations_input_roots>`
120121
for details.
121122

122123
### Removing mutations above the root
123124

124-
You can also end up with mutations above the root, for instance when all the chosen samples
125-
are monomorphic, sharing a single derived mutation. In long running forward-time
126-
simulations with mutation, this many mutations like this can gather above each local root.
125+
You can also end up with mutations above local roots, for instance when all the chosen
126+
samples share a single derived mutation. In long running forward-time
127+
simulations with mutation, many such mutations can gather above each local root.
127128
These can be removed by re-setting the `ancestral_state` of a site to the `derived_state`
128129
of the mutation immediately above the root node. There is currently no method
129130
provided to do this, but the following code should work
@@ -147,7 +148,7 @@ def remove_root_mutations(ts):
147148
if all([anc == anc_state for anc in root_states.values()]):
148149
if anc_state != s.ancestral_state:
149150
print(
150-
f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
151+
f"- changed ancestral state from {s.ancestral_state} to {anc_state} "
151152
f"for site {s.id} at position {s.position}"
152153
)
153154
tables.sites.append(s.replace(ancestral_state=anc_state))
@@ -164,7 +165,7 @@ for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
164165
assert (var1.states() == var2.states()).all()
165166
print(
166167
f"Original simplified tree sequence had {simp.num_mutations} mutations, "
167-
f"now has {simp_no_root_muts.num_mutations} mutations")
168+
f"and now has {simp_no_root_muts.num_mutations} mutations")
168169
```
169170

170171
## 3) Keeping ancestral individuals
@@ -179,10 +180,10 @@ which is relevant to the genetic ancestry (see also discussion in the SLiM manua
179180
[SLiM issue #139](https://github.com/MesserLab/SLiM/issues/139)).
180181

181182
To keep all the individuals associated with genetic ancestry, you can use
182-
`keep_unary_in_individuals=True`. In particular, this means
183-
that ancestral nodes which are not coalescent anywhere along the genome,
184-
but which are associated with an individual, will be retained (and
185-
so the referenced individuals will be retained too).
183+
`keep_unary_in_individuals=True`. In particular this will retain
184+
ancestral nodes which are are associated with an individual but not
185+
coalescent anywhere along the genome, and cause
186+
the referenced individuals to be retained too.
186187

187188
:::{todo}
188189
Should we have a demonstration here? {ref}`sec_tskit_forward_simulations` could be used to
@@ -226,12 +227,20 @@ all such nodes, which can be done by adding them to the focal node list but usin
226227
This is usually followed by an additional simplification pass to remove any topology
227228
above the additional nodes which is not ancestral to the true samples. Here are some examples:
228229

230+
:::{todo}
231+
Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
232+
above local roots are retained when `keep_unary=True`, see
233+
https://github.com/tskit-dev/tskit/issues/3450.
234+
:::
235+
229236
#### Keeping non-coalescent regions of coalescent nodes
230237

231238
By default, `simplify()` deletes not just non-coalescent nodes, but also
232239
removes from the ancestry those regions of coalescent nodes which do not represent
233-
a coalescence in the local tree (i.e. are "locally unary"). We can identify coalescent
234-
nodes using an initial round of simplification:
240+
a coalescence in the local tree (i.e. are "locally unary"). Until tskit has this
241+
functionality [built-in](https://github.com/tskit-dev/tskit/issues/2127), we can
242+
implement similar functionality by identifying coalescent nodes using an initial
243+
round of simplification, then keeping those specific nodes through simplification:
235244

236245
```{code-cell}
237246
simp, node_map = arg.simplify(focal, map_nodes=True)
@@ -244,8 +253,8 @@ part_simp_ts = tmp_ts.simplify(keep_unary=True)
244253
```
245254

246255
Often, leaving in the non-coalescent regions of coalescent nodes can lead to a reduction
247-
in the number of edges. This is one reason that the {meth}`TreeSequence.extend_haplotypes` exists
248-
(see the documentation of that method for more detail).
256+
in the number of edges. This is one reason that the {meth}`TreeSequence.extend_haplotypes`
257+
exists (see the documentation of that method for more detail).
249258

250259
```{code-cell}
251260
print(f"Full simplification to samples {focal} leads to {simp.num_edges} edges")
@@ -284,11 +293,6 @@ print("RE nodes before simplification\n", re_node_pairs)
284293
Identifying the _msprime_ recombination nodes that stay as pairs after simplification requires
285294
a little work:
286295

287-
:::{todo}
288-
Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
289-
above local roots are retained when `keep_unary=True`, see
290-
https://github.com/tskit-dev/tskit/issues/3450.
291-
:::
292296

293297
```{code-cell} ipython3
294298
simp, node_map = arg.simplify(focal, keep_unary=True, map_nodes=True)

0 commit comments

Comments
 (0)