Skip to content

Commit 35e08a9

Browse files
authored
Merge pull request #312 from hyanwong/multi-viz
More phylogen editiing
2 parents 5d94489 + 110984c commit 35e08a9

1 file changed

Lines changed: 66 additions & 66 deletions

File tree

phylogen.md

Lines changed: 66 additions & 66 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,9 @@ kernelspec:
1818

1919
# {program}`Tskit` for phylogenetics
2020

21-
{program}`Tskit`, the tree sequence toolkit, can be used as an efficient library for
22-
very large evolutionary trees. {program}`Tskit` makes it easy to deal with trees with millions of
23-
tips, as in the example below:
21+
{program}`Tskit`, the tree sequence toolkit, is an efficient library for very large
22+
evolutionary trees. It makes trees with millions of tips straightforward to store
23+
and analyse, as in the example below:
2424

2525
```{code-cell}
2626
:"tags": ["hide-input"]
@@ -130,12 +130,12 @@ def collapse_tree_for_plot(tree, max_tips=20, style=None):
130130
%%time
131131
132132
num_tips = 1_000_000
133-
big_tree = tskit.Tree.generate_comb(num_tips);
133+
big_tree = tskit.Tree.generate_comb(num_tips)
134134
print("Tree sequence takes up", big_tree.tree_sequence.nbytes / 1024**2, "Mb")
135135
print(f"Generating a 'comb' (pectinate) tree of {num_tips} tips took:")
136136
```
137137

138-
Clearly, plotting a tree with a million tips is problematic, but we can draw a summary
138+
Plotting a tree with a million tips is impractical, but we can draw a summary
139139
(see the {ref}`visualization tutorial<sec_tskit_viz_SVG_examples_larger_plots>` for details):
140140

141141

@@ -149,7 +149,7 @@ plot_tree.draw_svg(
149149
```
150150

151151

152-
Calculations on these huge trees can be very efficient:
152+
Calculations on these large trees can be very efficient:
153153

154154
```{code-cell}
155155
%%time
@@ -181,27 +181,27 @@ import tsconvert # used for reading tree sequences from different formats
181181
182182
# example code reading in a large file, timed
183183
184-
# Or read smaller trees from strings (here we create a tree spanning 1000 genomic units)
184+
# Read a smaller tree from a string (here we create a tree spanning 1000 genomic units)
185185
ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000)
186186
```
187187

188188
The "succinct tree sequence" format used by {program}`tskit` can also store mutations
189189
(and optionally a reference genome) along with the tree(s). This results in a
190190
single unified representation of large genomic datasets, storing trees,
191-
sequence data and metadata in a single efficient structure. Examples are given
192-
in the section below entitled {ref}`sec_phylogen_unified_structure`.
191+
sequence data, and metadata in a single efficient structure. See
192+
{ref}`sec_phylogen_unified_structure` for examples.
193193

194194
As the name suggests, a tree sequence can also store and analyse a sequence of
195195
trees along a genome (i.e. a "phylogenetic network"). This is necessary to
196196
account for recombination between lineages, and may be important even when looking at
197197
species-level phylogenies due to the effects of hybridization and incomplete lineage
198-
sorting. An overview, and links to further details are given at the
198+
sorting. An overview and links to further details are given at the
199199
{ref}`end of this page <sec_phylogen_multiple_trees>`.
200200

201201
## Hints for phylogeneticists
202202

203-
Unlike other phylogenetic libraries, {program}`tskit` is designed to efficiently store not just
204-
single trees, but sequences of correlated trees along a genome. This means that the
203+
Unlike other phylogenetic libraries, {program}`tskit` is designed to efficiently store not
204+
just single trees, but also sequences of correlated trees along a genome. This means that the
205205
library has some features not found in more standard phylogenetic libraries.
206206
Here we focus on the {ref}`sec_python_api`,
207207
introducing seven {program}`tskit` concepts that may be useful to those with a background in
@@ -226,10 +226,10 @@ phylogenetics (each is linked to a separate section below):
226226
(sec_phylogen_tree_in_sequence)=
227227
### Trees are always part of a tree sequence
228228

229-
In tskit, all trees are stored as a "tree sequence" of correlated trees. This allows
230-
easy extension of the library to multiple trees produced e.g. by hybridization.
229+
In {program}`tskit`, all trees are stored as a "tree sequence" of correlated trees.
230+
This extends naturally to multiple trees, such as those produced by hybridization.
231231
In the simplest case, however, the tree sequence can contain just a single tree. This
232-
can be obtained using the {meth}`~TreeSequence.first()` method.
232+
tree can be obtained using the {meth}`~TreeSequence.first()` method.
233233

234234
```{code-cell}
235235
:"tags": ["hide-input"]
@@ -257,7 +257,7 @@ tree.tree_sequence # When output in a notebook, prints a summary of the tree se
257257
(sec_phylogen_ids)=
258258
### Integer node and edge IDs
259259

260-
The plot above labels nodes by their name, but internally the {program}`tskit` library relies
260+
The plot above labels nodes by name, but internally the {program}`tskit` library relies
261261
heavily on integer IDs. Here's the same tree with node IDs plotted instead:
262262

263263
```{code-cell}
@@ -268,35 +268,35 @@ tree.draw_svg()
268268
#### Nodes
269269

270270
Each {ref}`node<sec_terminology_nodes>` in a tree sequence is allocated an
271-
integer ID starting from 0 to `ts.num_nodes - 1` (IDs can be allocated in any order;
272-
often the tips are labelled starting from 0 but this is not necessarily so, and
271+
integer ID from 0 to `ts.num_nodes - 1` (IDs can be allocated in any order;
272+
often the tips are labelled starting from 0, but this is not guaranteed, and
273273
is not the case in the example above).
274274

275275
For efficiency reasons, tree traversal routines, as well as many other {program}`tskit`
276-
methods, tend to return integer IDs. You can use this ID to get specific information
276+
methods, tend to return integer IDs. You can use these IDs to get specific information
277277
about the node and its position in the tree, for example
278278

279279
```{code-cell}
280280
node_id = 4
281281
parent_id = tree.parent(node_id)
282282
child_ids = tree.children(node_id)
283283
print("The parent of", node_id, "is", parent_id, "and its children are", child_ids)
284-
# or e.g. get all parents them as an array (where -1 means there is no parent)
284+
# or get all parents as an array (where -1 means there is no parent)
285285
print(f"The parents of nodes 0..{ts.num_nodes-1} are", tree.parent_array)
286286
```
287287

288288
Other methods also exist to
289289
{ref}`examine nodes in a tree<sec_python_api_trees_node_measures>`, e.g.
290290
{meth}`Tree.is_leaf`, {meth}`Tree.mrca` for the most recent common ancestor between
291-
2 or more nodes, etc.
291+
two or more nodes, etc.
292292

293293
#### Edges
294294

295-
Rather than refer to "branches" of a tree, tskit tends to refer to
295+
Rather than refer to "branches" of a tree, {program}`tskit` tends to refer to
296296
{ref}`sec_terminology_edges` (the term "edge" emphasises that these can span
297297
{ref}`sec_phylogen_multiple_trees`, although for tree sequences containing a single
298298
tree, the terms are interchangeable). Like other entities in {program}`tskit`, edges
299-
are referred to by an integer ID. For instance, here is the edge above the internal node 4
299+
are referred to by an integer ID. For instance, here is the edge above the internal node 4:
300300

301301
```{code-cell}
302302
node_id = 4
@@ -313,10 +313,10 @@ important in tree sequences that contain more than one tree.
313313
### The `Tree` object
314314

315315
The {class}`Tree` object has {ref}`methods<sec_python_api_trees>` to perform basic operations
316-
on a tree such as traversing the nodes, identifying parents, children, and common
317-
ancestors, etc. {ref}`Several methods<sec_python_api_trees_node_measures_array>`
316+
on a tree, such as traversing nodes and identifying parents, children, and common
317+
ancestors. {ref}`Several methods<sec_python_api_trees_node_measures_array>`
318318
also return numpy arrays for use in
319-
{ref}`efficient algorithms using numba<sec_trees_numba>`
319+
{ref}`efficient algorithms using numba<sec_trees_numba>`.
320320

321321
```{code-cell}
322322
for n_id in tree.nodes(order="postorder"):
@@ -326,25 +326,25 @@ for n_id in tree.nodes(order="postorder"):
326326
print("Node IDs in postorder:", tree.postorder())
327327
```
328328

329-
Various phylogenetic statistics are also available on trees, e.g
329+
Various phylogenetic statistics are also available on trees, for example:
330330

331331
```{code-cell}
332-
print(f"The colless imbalance index is {tree.colless_index()}")
332+
print(f"The Colless imbalance index is {tree.colless_index()}")
333333
```
334334

335335
See {ref}`sec_phylogen_methods` for more examples.
336336

337337
(sec_phylogen_samples)=
338338
### Sample nodes
339339

340-
Often we are only have detailed information about specific nodes that we have sampled,
340+
Often, we only have detailed information about specific nodes that we have sampled,
341341
such as genomes A, B, C, D, and E in the example above. These are designated as
342342
*sample nodes*, and are plotted as square nodes. The concept of
343343
{ref}`sample nodes<sec_data_model_definitions_sample>` is integral
344344
to the {program}`tskit` format. They can be identified by using the
345345
{meth}`Node.is_sample` and {meth}`Tree.is_sample` methods, or can be listed using
346346
{meth}`TreeSequence.samples` or {meth}`Tree.samples()` (internally, the `node.flags`
347-
field is used to {ref}`flag up<sec_node_table_definition>` which nodes are samples):
347+
field is used to {ref}`record<sec_node_table_definition>` which nodes are samples):
348348

349349
```{code-cell}
350350
for n_id in tree.nodes():
@@ -355,7 +355,7 @@ print("Sample nodes are", tree.tree_sequence.samples())
355355
```
356356

357357
Often the sample nodes are the leaves of a tree, but this need not be the case. There
358-
are fast methods for identifying the sample nodes under an internal node in the tree,
358+
are fast methods for identifying the sample nodes below an internal node in the tree,
359359
etc.
360360

361361

@@ -371,16 +371,16 @@ node object via the `tree_sequence` to which this tree belongs:
371371
tree.tree_sequence.node(node_id) # or simply ts.node(node_id)
372372
```
373373

374-
Attributes such as `id`, `flags` and `time` are always present. Arbitrary information,
375-
such a name or e.g. bootstrap values, are stored in *metadata*
374+
Attributes such as `id`, `flags`, and `time` are always present. Arbitrary information,
375+
such as a name or bootstrap values, is stored in *metadata*.
376376

377377
```{code-cell}
378378
for n_id in tree.nodes():
379379
print("Node", n_id, tree.tree_sequence.node(n_id).metadata.get("name", "<no name>"))
380380
```
381381

382-
However, for large datasets, it may be more efficient to access the array of e.g.
383-
times for all nodes, which provides direct memory access into the
382+
However, for large datasets, it may be more efficient to access the time array for
383+
all nodes, which provides direct memory access into the
384384
{ref}`tables<sec_tables>` that underlie the tree sequence format:
385385

386386
```{code-cell}
@@ -391,22 +391,23 @@ tree.tree_sequence.nodes_time
391391
(sec_phylogen_node_time)=
392392
### Nodes must have times
393393

394-
Perhaps the most noticable different between a {program}`tskit` tree and the encoding of trees
394+
Perhaps the most noticeable difference between a {program}`tskit` tree and the encoding of trees
395395
in other phylogenetic libraries is that {program}`tskit` does not explicitly store branch lengths.
396396
Instead, each node has a *time* associated with it. Branch lengths can therefore be
397397
found by calculating the difference between the time of a node and the time of its
398398
parent node.
399399

400-
Since nodes *must* have a time, {program}`tskit` trees aways have these (implicit) branch
400+
Since nodes *must* have a time, {program}`tskit` trees always have implicit branch
401401
lengths. To represent a tree ("cladogram") in which the branch lengths are not
402402
meaningful, the {attr}`TreeSequence.time_units` of a tree sequence can be
403-
specified as `"uncalibrated"` (see below)
403+
specified as `"uncalibrated"` (see below).
404404

405405
Another implication of storing node times rather than branch lengths is that {program}`tskit`
406406
trees are always directional (i.e. they are "rooted"). The reason that {program}`tskit` stores
407-
times of nodes (rather than e.g. genetic distances between them) is to ensure temporal
408-
consistency. In particular it makes it impossible for a node to be an ancestor of a
409-
node in one tree, and a descendant of the same node in another tree in the tree sequence.
407+
node times (rather than e.g. genetic distances between them) is to ensure temporal
408+
consistency. In particular, it makes it impossible for one node to be an ancestor of
409+
another node in one tree, and a descendant of that same node in another tree in the
410+
tree sequence.
410411
This is of critical importance when extending the concept of genetic ancestry to
411412
{ref}`sec_phylogen_multiple_trees` along a genome.
412413

@@ -442,13 +443,13 @@ print(
442443
"and",
443444
target_node_2,
444445
"is",
445-
tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts number of edges
446+
tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts edges
446447
)
447448
```
448449

449450
It is worth noting that this distance is the basis for the "genetic divergence"
450451
between two samples in a tree. For this reason, an equivalent way to carry out the
451-
calculation is to use {meth}`TreeSequence.divergence`, part of the the standard {program}`tskit`
452+
calculation is to use {meth}`TreeSequence.divergence`, part of the standard {program}`tskit`
452453
{ref}`sec_stats` framework, setting `mode="branch"` and
453454
`windows="trees"`. This is a more flexible approach, as it allows the distance between
454455
multiple sets of samples in {ref}`sec_phylogen_multiple_trees` to be calculated
@@ -467,7 +468,7 @@ print(
467468
(sec_phylogen_multiroot)=
468469
### Roots and multiroot trees
469470

470-
In {program}`tskit`, {ref}`sec_data_model_tree_roots` of trees are defined with respect to the
471+
In {program}`tskit`, {ref}`tree roots<sec_data_model_tree_roots>` are defined with respect to the
471472
sample nodes. In particular, if we move back in time along the tree branches from a
472473
sample, the oldest node that we encounter is defined as a root. The ID of a root can be
473474
obtained using {attr}`Tree.root`:
@@ -477,9 +478,9 @@ print("The root node of the following tree has ID", tree.root)
477478
tree.draw_svg()
478479
```
479480

480-
But in {program}`tskit`, we can also create a single "tree" consisting of multiple unlinked
481+
We can also create a single "tree" consisting of multiple unlinked
481482
clades. In our example, we can create one of these phylogenetically unusual objects
482-
if we remove the edge above node 4, by
483+
by removing the edge above node 4 via
483484
{ref}`editing the underlying tables<sec_tables_editing>`:
484485

485486
```{code-cell}
@@ -492,15 +493,15 @@ new_tree = new_ts.first()
492493
new_tree.draw_svg()
493494
```
494495

495-
Although there are two separate topologies in this plot, in {program}`tskit` terminology,
496-
it is considered a single tree, but with two roots:
496+
Although there are two separate topologies in this plot, in {program}`tskit` terminology
497+
this is considered a single tree with two roots:
497498

498499
```{code-cell}
499500
print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots)
500501
```
501502

502503
This also means that if we have no topology at all (i.e. an "empty tree"), each
503-
sample is its own root.
504+
sample is its own root:
504505

505506
```{code-cell}
506507
tables.edges.clear()
@@ -510,9 +511,9 @@ print("This empty tree has", len(empty_tree.roots), "roots:", empty_tree.roots)
510511
empty_tree.draw_svg()
511512
```
512513

513-
The samples here are {ref}`sec_data_model_tree_isolated_nodes`. This may seem like a
514+
The samples here are {ref}`isolated nodes<sec_data_model_tree_isolated_nodes>`. This may seem like a
514515
strange corner case, but in {program}`tskit`, isolated sample nodes are used to represent
515-
{ref}`sec_data_model_missing_data`. This therefore represents a tree in which
516+
{ref}`sec_data_model_missing_data`. The empty tree therefore represents a case in which
516517
relationships between the samples are not known. This could apply, for instance,
517518
in regions of the genome where no genetic data exists, or where genetic ancestry
518519
has not been simulated.
@@ -532,16 +533,16 @@ Demo some phylogenetic methods. e.g.
532533
(sec_phylogen_unified_structure)=
533534
## Storing and accessing genetic data
534535

535-
{program}`Tskit` has been designed to capture both evolutionary tree topologies and the genetic
536-
sequences that evolve along the branches of these trees. This is achieved by defining
537-
{ref}`sec_terminology_mutations_and_sites` which are associated with specific positions
536+
{program}`Tskit` is designed to capture both evolutionary tree topologies and the genetic
537+
sequences that evolve along the branches of these trees. It does this by defining
538+
{ref}`mutations and sites<sec_terminology_mutations_and_sites>`, which are associated with specific positions
538539
along the genome.
539540

540541
```{code-cell}
541-
import msprime # The `msprime` package can throw mutations onto a tree sequence
542+
import msprime # The `msprime` package can add mutations to a tree sequence
542543
mutated_ts = msprime.sim_mutations(ts, rate=3e-3, random_seed=321)
543544
mutated_tree = mutated_ts.first()
544-
print("Variable sites with the following IDs generated")
545+
print("Variable sites with the following IDs were generated:")
545546
for site in mutated_tree.sites():
546547
print(
547548
f"Site ID {site.id} @ genomic position {site.position:g}:",
@@ -550,9 +551,9 @@ for site in mutated_tree.sites():
550551
mutated_tree.draw_svg()
551552
```
552553

553-
Mutations occur above nodes in a tree, with all the descendant
554-
nodes inheriting that specific mutation (unless replaced by a subsequent
555-
mutation at the same site). This allows genetic variation to be
554+
Mutations occur above nodes in a tree, and all descendant nodes inherit
555+
that specific mutation (unless it is replaced by a subsequent mutation at the
556+
same site). This allows genetic variation to be
556557
{ref}`efficiently represented<sec_what_is_dna_data>` using the tree topology.
557558
To obtain the genetic variation at each site across the entire genome, you can use the
558559
{meth}`TreeSequence.sites` method, or (less efficiently), you can use
@@ -571,7 +572,7 @@ for node_id, alignment in zip(
571572
(sec_phylogen_multiple_trees)=
572573
## Multiple trees
573574

574-
Where {program}`tskit` really shines is when the ancestry of your dataset cannot be adequately
575+
{program}`Tskit` is particularly useful when the ancestry of your dataset cannot be adequately
575576
represented by a single tree. This is a pervasive issue in genomes (even from different
576577
species) that have undergone recombination in the past. The resulting series of
577578
{ref}`local trees<sec_what_is_local_trees>` along a genome are highly correlated
@@ -588,20 +589,19 @@ tables = ts.dump_tables()
588589
edge_id_above_node_4 = ts.first().edge(4)
589590
left_coord_for_edges = tables.edges.left
590591
left_coord_for_edges[edge_id_above_node_4] = 50
591-
tables.edges.left = left_coord_for_edges # reset the right coords
592+
tables.edges.left = left_coord_for_edges # apply the modified left coordinates
592593
tables.sort()
593594
multi_ts = tables.tree_sequence()
594595
595596
multi_ts.draw_svg()
596597
```
597598

598-
For the left hand side of the genome we lack information about the ancestry of
599-
node 4, but for the right hand side we know this information. The result is to
600-
generate 2 trees in the tree sequence, which differ only in the presence of absence of
601-
a single branch. We do not have to separately store the entire tree on the right: all
599+
For the left-hand side of the genome, we lack information about the ancestry of
600+
node 4, but for the right-hand side, we know this information. This generates two
601+
trees in the tree sequence, which differ only in the presence or absence of
602+
a single branch. We do not have to store the entire tree on the right separately: all
602603
the edges that are shared between trees are stored only once.
603604

604605
The rest of the {program}`tskit` tutorials will lead you through the concepts involved with
605606
storing and analysing sequences of many correlated trees. For a simple introduction, you
606607
might want to start with {ref}`sec_what_is`.
607-

0 commit comments

Comments
 (0)