@@ -19,7 +19,7 @@ kernelspec:
1919# _ Advanced simplification_
2020% remove underscores in title when tutorial is complete or near-complete
2121
22- :::{todo }
22+ :::{note }
2323This tutorial is only partly complete, and there are a number of sections containing TODO items.
2424:::
2525
@@ -41,8 +41,8 @@ import tskit_arg_visualizer as argviz
4141
4242arg = msprime.sim_ancestry(
4343 samples=10,
44- sequence_length=1.8e4 ,
45- recombination_rate=1e -8,
44+ sequence_length=1e4 ,
45+ recombination_rate=2e -8,
4646 population_size=1e4,
4747 record_full_arg=True,
4848 random_seed=123,
@@ -84,7 +84,7 @@ is always returned. To avoid compacting the node table, and leave node IDs uncha
8484### Obtaining the reverse map
8585
8686Often you might want a reverse map, mapping the new node IDs to the old ones. Here's
87- a simple way to do this :
87+ a simple way to do that :
8888
8989``` {code-cell}
9090def invert_map(node_mapping):
@@ -98,32 +98,33 @@ reverse_map = invert_map(node_map)
9898print("New sample ID 0", "maps to old ID", int(reverse_map[0]))
9999```
100100
101- Here's how the IDs in the first tree have changed:
101+ Here's a visualization of how the IDs in the first tree have changed:
102102
103103``` {code-cell}
104104simp.first().draw_svg(
105105 size=(800, 300),
106+ time_scale="rank",
106107 node_labels={nd.id: f"id:{nd.id} (old id:{reverse_map[nd.id]})" for nd in simp.nodes()}
107108)
108109```
109110
110111## 2) Information above the local roots
111112
112113Usually there are no nodes or mutations in a tree sequence above each local root.
113- However, as simplification deletes topology, it can create new local roots, leading to this
114- expectation being broken .
114+ However, because simplification deletes topology, it can cause a node lower down a tree to
115+ become a new local root. This new root can therefore have nodes and mutations above it .
115116
116- In some cases, you might want to retain nodes above the local roots, which is possible by
117+ In some cases, you might want to retain nodes above new roots, which is possible by
117118setting ` keep_input_roots=True ` . The most common reason for this is to allow
118119{ref}` recapitation <sec_completing_forward_simulations> ` of forward-time simulations.
119120See {ref}` this tutorial section <sec_completing_forward_simulations_input_roots> `
120121for details.
121122
122123### Removing mutations above the root
123124
124- You can also end up with mutations above the root , for instance when all the chosen samples
125- are monomorphic, sharing a single derived mutation. In long running forward-time
126- simulations with mutation, this many mutations like this can gather above each local root.
125+ You can also end up with mutations above local roots , for instance when all the chosen
126+ samples share a single derived mutation. In long running forward-time
127+ simulations with mutation, many such mutations can gather above each local root.
127128These can be removed by re-setting the ` ancestral_state ` of a site to the ` derived_state `
128129of the mutation immediately above the root node. There is currently no method
129130provided to do this, but the following code should work
@@ -147,7 +148,7 @@ def remove_root_mutations(ts):
147148 if all([anc == anc_state for anc in root_states.values()]):
148149 if anc_state != s.ancestral_state:
149150 print(
150- f"Changed ancestral state from {s.ancestral_state} to {anc_state} "
151+ f"- changed ancestral state from {s.ancestral_state} to {anc_state} "
151152 f"for site {s.id} at position {s.position}"
152153 )
153154 tables.sites.append(s.replace(ancestral_state=anc_state))
@@ -164,7 +165,7 @@ for var1, var2 in zip(simp.variants(), simp_no_root_muts.variants()):
164165 assert (var1.states() == var2.states()).all()
165166print(
166167 f"Original simplified tree sequence had {simp.num_mutations} mutations, "
167- f"now has {simp_no_root_muts.num_mutations} mutations")
168+ f"and now has {simp_no_root_muts.num_mutations} mutations")
168169```
169170
170171## 3) Keeping ancestral individuals
@@ -179,10 +180,10 @@ which is relevant to the genetic ancestry (see also discussion in the SLiM manua
179180[ SLiM issue #139 ] ( https://github.com/MesserLab/SLiM/issues/139 ) ).
180181
181182To keep all the individuals associated with genetic ancestry, you can use
182- ` keep_unary_in_individuals=True ` . In particular, this means
183- that ancestral nodes which are not coalescent anywhere along the genome,
184- but which are associated with an individual, will be retained ( and
185- so the referenced individuals will be retained too) .
183+ ` keep_unary_in_individuals=True ` . In particular this will retain
184+ ancestral nodes which are are associated with an individual but not
185+ coalescent anywhere along the genome, and cause
186+ the referenced individuals to be retained too.
186187
187188:::{todo}
188189Should we have a demonstration here? {ref}` sec_tskit_forward_simulations ` could be used to
@@ -226,12 +227,20 @@ all such nodes, which can be done by adding them to the focal node list but usin
226227This is usually followed by an additional simplification pass to remove any topology
227228above the additional nodes which is not ancestral to the true samples. Here are some examples:
228229
230+ :::{todo}
231+ Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
232+ above local roots are retained when ` keep_unary=True ` , see
233+ https://github.com/tskit-dev/tskit/issues/3450 .
234+ :::
235+
229236#### Keeping non-coalescent regions of coalescent nodes
230237
231238By default, ` simplify() ` deletes not just non-coalescent nodes, but also
232239removes from the ancestry those regions of coalescent nodes which do not represent
233- a coalescence in the local tree (i.e. are "locally unary"). We can identify coalescent
234- nodes using an initial round of simplification:
240+ a coalescence in the local tree (i.e. are "locally unary"). Until tskit has this
241+ functionality [ built-in] ( https://github.com/tskit-dev/tskit/issues/2127 ) , we can
242+ implement similar functionality by identifying coalescent nodes using an initial
243+ round of simplification, then keeping those specific nodes through simplification:
235244
236245``` {code-cell}
237246simp, node_map = arg.simplify(focal, map_nodes=True)
@@ -244,8 +253,8 @@ part_simp_ts = tmp_ts.simplify(keep_unary=True)
244253```
245254
246255Often, leaving in the non-coalescent regions of coalescent nodes can lead to a reduction
247- in the number of edges. This is one reason that the {meth}` TreeSequence.extend_haplotypes ` exists
248- (see the documentation of that method for more detail).
256+ in the number of edges. This is one reason that the {meth}` TreeSequence.extend_haplotypes `
257+ exists (see the documentation of that method for more detail).
249258
250259``` {code-cell}
251260print(f"Full simplification to samples {focal} leads to {simp.num_edges} edges")
@@ -284,11 +293,6 @@ print("RE nodes before simplification\n", re_node_pairs)
284293Identifying the _ msprime_ recombination nodes that stay as pairs after simplification requires
285294a little work:
286295
287- :::{todo}
288- Currently the code below wrongly includes a few extra RE and CA nodes, because nodes
289- above local roots are retained when ` keep_unary=True ` , see
290- https://github.com/tskit-dev/tskit/issues/3450 .
291- :::
292296
293297``` {code-cell} ipython3
294298simp, node_map = arg.simplify(focal, keep_unary=True, map_nodes=True)
0 commit comments