@@ -18,9 +18,9 @@ kernelspec:
1818
1919# {program}` Tskit ` for phylogenetics
2020
21- {program}` Tskit ` , the tree sequence toolkit, can be used as an efficient library for
22- very large evolutionary trees. {program} ` Tskit ` makes it easy to deal with trees with millions of
23- tips , as in the example below:
21+ {program}` Tskit ` , the tree sequence toolkit, is an efficient library for very large
22+ evolutionary trees. It makes trees with millions of tips straightforward to store
23+ and analyse , as in the example below:
2424
2525``` {code-cell}
2626:"tags": ["hide-input"]
@@ -130,12 +130,12 @@ def collapse_tree_for_plot(tree, max_tips=20, style=None):
130130%%time
131131
132132num_tips = 1_000_000
133- big_tree = tskit.Tree.generate_comb(num_tips);
133+ big_tree = tskit.Tree.generate_comb(num_tips)
134134print("Tree sequence takes up", big_tree.tree_sequence.nbytes / 1024**2, "Mb")
135135print(f"Generating a 'comb' (pectinate) tree of {num_tips} tips took:")
136136```
137137
138- Clearly, plotting a tree with a million tips is problematic , but we can draw a summary
138+ Plotting a tree with a million tips is impractical , but we can draw a summary
139139(see the {ref}` visualization tutorial<sec_tskit_viz_SVG_examples_larger_plots> ` for details):
140140
141141
@@ -149,7 +149,7 @@ plot_tree.draw_svg(
149149```
150150
151151
152- Calculations on these huge trees can be very efficient:
152+ Calculations on these large trees can be very efficient:
153153
154154``` {code-cell}
155155%%time
@@ -181,27 +181,27 @@ import tsconvert # used for reading tree sequences from different formats
181181
182182# example code reading in a large file, timed
183183
184- # Or read smaller trees from strings (here we create a tree spanning 1000 genomic units)
184+ # Read a smaller tree from a string (here we create a tree spanning 1000 genomic units)
185185ts = tsconvert.from_newick("(A:6,((B:1,C:1):2,(D:2,E:2):1):3);", span=1000)
186186```
187187
188188The "succinct tree sequence" format used by {program}` tskit ` can also store mutations
189189(and optionally a reference genome) along with the tree(s). This results in a
190190single unified representation of large genomic datasets, storing trees,
191- sequence data and metadata in a single efficient structure. Examples are given
192- in the section below entitled {ref}` sec_phylogen_unified_structure ` .
191+ sequence data, and metadata in a single efficient structure. See
192+ {ref}` sec_phylogen_unified_structure ` for examples .
193193
194194As the name suggests, a tree sequence can also store and analyse a sequence of
195195trees along a genome (i.e. a "phylogenetic network"). This is necessary to
196196account for recombination between lineages, and may be important even when looking at
197197species-level phylogenies due to the effects of hybridization and incomplete lineage
198- sorting. An overview, and links to further details are given at the
198+ sorting. An overview and links to further details are given at the
199199{ref}` end of this page <sec_phylogen_multiple_trees> ` .
200200
201201## Hints for phylogeneticists
202202
203- Unlike other phylogenetic libraries, {program}` tskit ` is designed to efficiently store not just
204- single trees, but sequences of correlated trees along a genome. This means that the
203+ Unlike other phylogenetic libraries, {program}` tskit ` is designed to efficiently store not
204+ just single trees, but also sequences of correlated trees along a genome. This means that the
205205library has some features not found in more standard phylogenetic libraries.
206206Here we focus on the {ref}` sec_python_api ` ,
207207introducing seven {program}` tskit ` concepts that may be useful to those with a background in
@@ -226,10 +226,10 @@ phylogenetics (each is linked to a separate section below):
226226(sec_phylogen_tree_in_sequence)=
227227### Trees are always part of a tree sequence
228228
229- In tskit, all trees are stored as a "tree sequence" of correlated trees. This allows
230- easy extension of the library to multiple trees produced e.g. by hybridization.
229+ In {program} ` tskit ` , all trees are stored as a "tree sequence" of correlated trees.
230+ This extends naturally to multiple trees, such as those produced by hybridization.
231231In the simplest case, however, the tree sequence can contain just a single tree. This
232- can be obtained using the {meth}` ~TreeSequence.first() ` method.
232+ tree can be obtained using the {meth}` ~TreeSequence.first() ` method.
233233
234234``` {code-cell}
235235:"tags": ["hide-input"]
@@ -257,7 +257,7 @@ tree.tree_sequence # When output in a notebook, prints a summary of the tree se
257257(sec_phylogen_ids)=
258258### Integer node and edge IDs
259259
260- The plot above labels nodes by their name, but internally the {program}` tskit ` library relies
260+ The plot above labels nodes by name, but internally the {program}` tskit ` library relies
261261heavily on integer IDs. Here's the same tree with node IDs plotted instead:
262262
263263``` {code-cell}
@@ -268,35 +268,35 @@ tree.draw_svg()
268268#### Nodes
269269
270270Each {ref}` node<sec_terminology_nodes> ` in a tree sequence is allocated an
271- integer ID starting from 0 to ` ts.num_nodes - 1 ` (IDs can be allocated in any order;
272- often the tips are labelled starting from 0 but this is not necessarily so , and
271+ integer ID from 0 to ` ts.num_nodes - 1 ` (IDs can be allocated in any order;
272+ often the tips are labelled starting from 0, but this is not guaranteed , and
273273is not the case in the example above).
274274
275275For efficiency reasons, tree traversal routines, as well as many other {program}` tskit `
276- methods, tend to return integer IDs. You can use this ID to get specific information
276+ methods, tend to return integer IDs. You can use these IDs to get specific information
277277about the node and its position in the tree, for example
278278
279279``` {code-cell}
280280node_id = 4
281281parent_id = tree.parent(node_id)
282282child_ids = tree.children(node_id)
283283print("The parent of", node_id, "is", parent_id, "and its children are", child_ids)
284- # or e.g. get all parents them as an array (where -1 means there is no parent)
284+ # or get all parents as an array (where -1 means there is no parent)
285285print(f"The parents of nodes 0..{ts.num_nodes-1} are", tree.parent_array)
286286```
287287
288288Other methods also exist to
289289{ref}` examine nodes in a tree<sec_python_api_trees_node_measures> ` , e.g.
290290{meth}` Tree.is_leaf ` , {meth}` Tree.mrca ` for the most recent common ancestor between
291- 2 or more nodes, etc.
291+ two or more nodes, etc.
292292
293293#### Edges
294294
295- Rather than refer to "branches" of a tree, tskit tends to refer to
295+ Rather than refer to "branches" of a tree, {program} ` tskit ` tends to refer to
296296{ref}` sec_terminology_edges ` (the term "edge" emphasises that these can span
297297{ref}` sec_phylogen_multiple_trees ` , although for tree sequences containing a single
298298tree, the terms are interchangeable). Like other entities in {program}` tskit ` , edges
299- are referred to by an integer ID. For instance, here is the edge above the internal node 4
299+ are referred to by an integer ID. For instance, here is the edge above the internal node 4:
300300
301301``` {code-cell}
302302node_id = 4
@@ -313,10 +313,10 @@ important in tree sequences that contain more than one tree.
313313### The ` Tree ` object
314314
315315The {class}` Tree ` object has {ref}` methods<sec_python_api_trees> ` to perform basic operations
316- on a tree such as traversing the nodes, identifying parents, children, and common
317- ancestors, etc . {ref}` Several methods<sec_python_api_trees_node_measures_array> `
316+ on a tree, such as traversing nodes and identifying parents, children, and common
317+ ancestors. {ref}` Several methods<sec_python_api_trees_node_measures_array> `
318318also return numpy arrays for use in
319- {ref}` efficient algorithms using numba<sec_trees_numba> `
319+ {ref}` efficient algorithms using numba<sec_trees_numba> ` .
320320
321321``` {code-cell}
322322for n_id in tree.nodes(order="postorder"):
@@ -326,25 +326,25 @@ for n_id in tree.nodes(order="postorder"):
326326print("Node IDs in postorder:", tree.postorder())
327327```
328328
329- Various phylogenetic statistics are also available on trees, e.g
329+ Various phylogenetic statistics are also available on trees, for example:
330330
331331``` {code-cell}
332- print(f"The colless imbalance index is {tree.colless_index()}")
332+ print(f"The Colless imbalance index is {tree.colless_index()}")
333333```
334334
335335See {ref}` sec_phylogen_methods ` for more examples.
336336
337337(sec_phylogen_samples)=
338338### Sample nodes
339339
340- Often we are only have detailed information about specific nodes that we have sampled,
340+ Often, we only have detailed information about specific nodes that we have sampled,
341341such as genomes A, B, C, D, and E in the example above. These are designated as
342342* sample nodes* , and are plotted as square nodes. The concept of
343343{ref}` sample nodes<sec_data_model_definitions_sample> ` is integral
344344to the {program}` tskit ` format. They can be identified by using the
345345{meth}` Node.is_sample ` and {meth}` Tree.is_sample ` methods, or can be listed using
346346{meth}` TreeSequence.samples ` or {meth}` Tree.samples() ` (internally, the ` node.flags `
347- field is used to {ref}` flag up <sec_node_table_definition>` which nodes are samples):
347+ field is used to {ref}` record <sec_node_table_definition>` which nodes are samples):
348348
349349``` {code-cell}
350350for n_id in tree.nodes():
@@ -355,7 +355,7 @@ print("Sample nodes are", tree.tree_sequence.samples())
355355```
356356
357357Often the sample nodes are the leaves of a tree, but this need not be the case. There
358- are fast methods for identifying the sample nodes under an internal node in the tree,
358+ are fast methods for identifying the sample nodes below an internal node in the tree,
359359etc.
360360
361361
@@ -371,16 +371,16 @@ node object via the `tree_sequence` to which this tree belongs:
371371tree.tree_sequence.node(node_id) # or simply ts.node(node_id)
372372```
373373
374- Attributes such as ` id ` , ` flags ` and ` time ` are always present. Arbitrary information,
375- such a name or e.g. bootstrap values, are stored in * metadata*
374+ Attributes such as ` id ` , ` flags ` , and ` time ` are always present. Arbitrary information,
375+ such as a name or bootstrap values, is stored in * metadata* .
376376
377377``` {code-cell}
378378for n_id in tree.nodes():
379379 print("Node", n_id, tree.tree_sequence.node(n_id).metadata.get("name", "<no name>"))
380380```
381381
382- However, for large datasets, it may be more efficient to access the array of e.g.
383- times for all nodes, which provides direct memory access into the
382+ However, for large datasets, it may be more efficient to access the time array for
383+ all nodes, which provides direct memory access into the
384384{ref}` tables<sec_tables> ` that underlie the tree sequence format:
385385
386386``` {code-cell}
@@ -391,22 +391,23 @@ tree.tree_sequence.nodes_time
391391(sec_phylogen_node_time)=
392392### Nodes must have times
393393
394- Perhaps the most noticable different between a {program}` tskit ` tree and the encoding of trees
394+ Perhaps the most noticeable difference between a {program}` tskit ` tree and the encoding of trees
395395in other phylogenetic libraries is that {program}` tskit ` does not explicitly store branch lengths.
396396Instead, each node has a * time* associated with it. Branch lengths can therefore be
397397found by calculating the difference between the time of a node and the time of its
398398parent node.
399399
400- Since nodes * must* have a time, {program}` tskit ` trees aways have these ( implicit) branch
400+ Since nodes * must* have a time, {program}` tskit ` trees always have implicit branch
401401lengths. To represent a tree ("cladogram") in which the branch lengths are not
402402meaningful, the {attr}` TreeSequence.time_units ` of a tree sequence can be
403- specified as ` "uncalibrated" ` (see below)
403+ specified as ` "uncalibrated" ` (see below).
404404
405405Another implication of storing node times rather than branch lengths is that {program}` tskit `
406406trees are always directional (i.e. they are "rooted"). The reason that {program}` tskit ` stores
407- times of nodes (rather than e.g. genetic distances between them) is to ensure temporal
408- consistency. In particular it makes it impossible for a node to be an ancestor of a
409- node in one tree, and a descendant of the same node in another tree in the tree sequence.
407+ node times (rather than e.g. genetic distances between them) is to ensure temporal
408+ consistency. In particular, it makes it impossible for one node to be an ancestor of
409+ another node in one tree, and a descendant of that same node in another tree in the
410+ tree sequence.
410411This is of critical importance when extending the concept of genetic ancestry to
411412{ref}` sec_phylogen_multiple_trees ` along a genome.
412413
@@ -442,13 +443,13 @@ print(
442443 "and",
443444 target_node_2,
444445 "is",
445- tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts number of edges
446+ tree.distance_between(target_node_1, target_node_2), # beware: tree.path_length counts edges
446447)
447448```
448449
449450It is worth noting that this distance is the basis for the "genetic divergence"
450451between two samples in a tree. For this reason, an equivalent way to carry out the
451- calculation is to use {meth}` TreeSequence.divergence ` , part of the the standard {program}` tskit `
452+ calculation is to use {meth}` TreeSequence.divergence ` , part of the standard {program}` tskit `
452453{ref}` sec_stats ` framework, setting ` mode="branch" ` and
453454` windows="trees" ` . This is a more flexible approach, as it allows the distance between
454455multiple sets of samples in {ref}` sec_phylogen_multiple_trees ` to be calculated
@@ -467,7 +468,7 @@ print(
467468(sec_phylogen_multiroot)=
468469### Roots and multiroot trees
469470
470- In {program}` tskit ` , {ref}` sec_data_model_tree_roots ` of trees are defined with respect to the
471+ In {program}` tskit ` , {ref}` tree roots< sec_data_model_tree_roots> ` are defined with respect to the
471472sample nodes. In particular, if we move back in time along the tree branches from a
472473sample, the oldest node that we encounter is defined as a root. The ID of a root can be
473474obtained using {attr}` Tree.root ` :
@@ -477,9 +478,9 @@ print("The root node of the following tree has ID", tree.root)
477478tree.draw_svg()
478479```
479480
480- But in {program} ` tskit ` , we can also create a single "tree" consisting of multiple unlinked
481+ We can also create a single "tree" consisting of multiple unlinked
481482clades. In our example, we can create one of these phylogenetically unusual objects
482- if we remove the edge above node 4, by
483+ by removing the edge above node 4 via
483484{ref}` editing the underlying tables<sec_tables_editing> ` :
484485
485486``` {code-cell}
@@ -492,15 +493,15 @@ new_tree = new_ts.first()
492493new_tree.draw_svg()
493494```
494495
495- Although there are two separate topologies in this plot, in {program}` tskit ` terminology,
496- it is considered a single tree, but with two roots:
496+ Although there are two separate topologies in this plot, in {program}` tskit ` terminology
497+ this is considered a single tree with two roots:
497498
498499``` {code-cell}
499500print("The first tree has", len(new_tree.roots), "roots:", new_tree.roots)
500501```
501502
502503This also means that if we have no topology at all (i.e. an "empty tree"), each
503- sample is its own root.
504+ sample is its own root:
504505
505506``` {code-cell}
506507tables.edges.clear()
@@ -510,9 +511,9 @@ print("This empty tree has", len(empty_tree.roots), "roots:", empty_tree.roots)
510511empty_tree.draw_svg()
511512```
512513
513- The samples here are {ref}` sec_data_model_tree_isolated_nodes ` . This may seem like a
514+ The samples here are {ref}` isolated nodes< sec_data_model_tree_isolated_nodes> ` . This may seem like a
514515strange corner case, but in {program}` tskit ` , isolated sample nodes are used to represent
515- {ref}` sec_data_model_missing_data ` . This therefore represents a tree in which
516+ {ref}` sec_data_model_missing_data ` . The empty tree therefore represents a case in which
516517relationships between the samples are not known. This could apply, for instance,
517518in regions of the genome where no genetic data exists, or where genetic ancestry
518519has not been simulated.
@@ -532,16 +533,16 @@ Demo some phylogenetic methods. e.g.
532533(sec_phylogen_unified_structure)=
533534## Storing and accessing genetic data
534535
535- {program}` Tskit ` has been designed to capture both evolutionary tree topologies and the genetic
536- sequences that evolve along the branches of these trees. This is achieved by defining
537- {ref}` sec_terminology_mutations_and_sites ` which are associated with specific positions
536+ {program}` Tskit ` is designed to capture both evolutionary tree topologies and the genetic
537+ sequences that evolve along the branches of these trees. It does this by defining
538+ {ref}` mutations and sites< sec_terminology_mutations_and_sites> ` , which are associated with specific positions
538539along the genome.
539540
540541``` {code-cell}
541- import msprime # The `msprime` package can throw mutations onto a tree sequence
542+ import msprime # The `msprime` package can add mutations to a tree sequence
542543mutated_ts = msprime.sim_mutations(ts, rate=3e-3, random_seed=321)
543544mutated_tree = mutated_ts.first()
544- print("Variable sites with the following IDs generated")
545+ print("Variable sites with the following IDs were generated: ")
545546for site in mutated_tree.sites():
546547 print(
547548 f"Site ID {site.id} @ genomic position {site.position:g}:",
@@ -550,9 +551,9 @@ for site in mutated_tree.sites():
550551mutated_tree.draw_svg()
551552```
552553
553- Mutations occur above nodes in a tree, with all the descendant
554- nodes inheriting that specific mutation (unless replaced by a subsequent
555- mutation at the same site). This allows genetic variation to be
554+ Mutations occur above nodes in a tree, and all descendant nodes inherit
555+ that specific mutation (unless it is replaced by a subsequent mutation at the
556+ same site). This allows genetic variation to be
556557{ref}` efficiently represented<sec_what_is_dna_data> ` using the tree topology.
557558To obtain the genetic variation at each site across the entire genome, you can use the
558559{meth}` TreeSequence.sites ` method, or (less efficiently), you can use
@@ -571,7 +572,7 @@ for node_id, alignment in zip(
571572(sec_phylogen_multiple_trees)=
572573## Multiple trees
573574
574- Where {program}` tskit ` really shines is when the ancestry of your dataset cannot be adequately
575+ {program}` Tskit ` is particularly useful when the ancestry of your dataset cannot be adequately
575576represented by a single tree. This is a pervasive issue in genomes (even from different
576577species) that have undergone recombination in the past. The resulting series of
577578{ref}` local trees<sec_what_is_local_trees> ` along a genome are highly correlated
@@ -588,20 +589,19 @@ tables = ts.dump_tables()
588589edge_id_above_node_4 = ts.first().edge(4)
589590left_coord_for_edges = tables.edges.left
590591left_coord_for_edges[edge_id_above_node_4] = 50
591- tables.edges.left = left_coord_for_edges # reset the right coords
592+ tables.edges.left = left_coord_for_edges # apply the modified left coordinates
592593tables.sort()
593594multi_ts = tables.tree_sequence()
594595
595596multi_ts.draw_svg()
596597```
597598
598- For the left hand side of the genome we lack information about the ancestry of
599- node 4, but for the right hand side we know this information. The result is to
600- generate 2 trees in the tree sequence, which differ only in the presence of absence of
601- a single branch. We do not have to separately store the entire tree on the right: all
599+ For the left- hand side of the genome, we lack information about the ancestry of
600+ node 4, but for the right- hand side, we know this information. This generates two
601+ trees in the tree sequence, which differ only in the presence or absence of
602+ a single branch. We do not have to store the entire tree on the right separately : all
602603the edges that are shared between trees are stored only once.
603604
604605The rest of the {program}` tskit ` tutorials will lead you through the concepts involved with
605606storing and analysing sequences of many correlated trees. For a simple introduction, you
606607might want to start with {ref}` sec_what_is ` .
607-
0 commit comments