Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
131 changes: 60 additions & 71 deletions docs/data-model.md
Original file line number Diff line number Diff line change
Expand Up @@ -298,11 +298,20 @@ required for a valid set of mutations.
In simulations, trees can be thought of as spread across space, and it is
helpful for inferring demographic history to record this history.
Migrations are performed by individual ancestors, but most likely not by an
individual whose genome is tracked as a `node` (as in a discrete-deme model they are
unlikely to be both a migrant and a most recent common ancestor). So,
`tskit` records when a segment of ancestry has moved between
populations. This table is not required, even if different nodes come from
different populations.
individual whose genome is tracked as a `tskit` node (as in a discrete-deme
model they are unlikely to be both a migrant and a most recent common ancestor).
The migrations table can be used to record when a segment of ancestry has moved
between populations. Note that recording migrations is not compulsory: the migrations
table can be left empty even if different nodes come from different populations.

In particular, migrations chart how edges are associated with a population.
Specifically, a migration from source population X to destination population Y
at a given time, node, and left/right coordinate means that at the edge (or edges)
denoted by the node and coordinates, ancestors younger than the time of
the migration on the relevant edges can be thought of as belonging to population X
and the older ancestors along the relevant edges can be thought of as belonging to
population Y (at least until an intervening migration occurs). Migrations are
recorded as follows:

| Column | Type | Description |
| :--------- | -------- | -----------------------------------------------------: |
Expand All @@ -316,18 +325,22 @@ different populations.


The `left` and `right` columns are floating point values defining the
half-open segment of genome affected. The `source` and `dest` columns
record the IDs of the respective populations. The `node` column records the
ID of the node that was associated with the ancestry segment in question
at the time of the migration event. The `time` column is holds floating
point values recording the time of the event.
half-open segment of genome affected (these need not exactly correspond to
breakpoints between edges). The `source` column records the ID of the
population from which, considered in forward time, migration originates. The
`dest` column records where the migrating material ends up. The `time` column
holds floating point values recording the time of the event, with migrations
assumed to occur instantaneously. The `node` column records the ID of the child
node of the migrating segment; in consequence the population ID of the `node` will
match the `dest` ID (unless sequential migrations affect the same `node`, in which
case it will match the `dest` value of the youngest of those migrations).

The `metadata` column provides a location for client code to store
information about each migration. See the {ref}`sec_metadata_definition` section for
more details on how metadata columns should be used.

See the {ref}`sec_migration_requirements` section for details on the properties
required for a valid set of mutations.
required for a valid set of migrations.


(sec_population_table_definition)=
Expand Down Expand Up @@ -356,10 +369,14 @@ required for a valid set of populations.

#### Provenance Table

:::{todo}
Document the provenance table.
:::

| Column | Type | Description |
| :-------- | ----- | ----------------------------------------------------------------------: |
| timestamp | char | Timestamp in [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format. |
| record | char | Provenance record as JSON. |
| record | char | Provenance record. |


(sec_metadata_definition)=
Expand All @@ -370,16 +387,10 @@ Each table (excluding provenance) has a metadata column for storing and passing
information that tskit does not use or interpret. See {ref}`sec_metadata` for details.
The metadata columns are {ref}`binary columns <sec_tables_api_binary_columns>`.

When using the {ref}`sec_text_file_format`, metadata values are written as opaque
text. By default, :meth:`TreeSequence.dump_text` will base64-encode metadata values
that are stored as raw bytes (when ``base64_metadata=True``) so that binary data can
be safely printed and exchanged; in this case :func:`tskit.load_text` will base64-decode
the corresponding text fields back to bytes. When metadata has already been decoded
to a structured Python object (for example via a metadata schema), the textual
representation written by :meth:`TreeSequence.dump_text` is the ``repr`` of that
object, and :func:`tskit.load_text` does not attempt to reconstruct the original
structured value from this representation. For reliable metadata round-tripping,
prefer the native binary tree sequence file format over the text formats.
When using the {ref}`sec_text_file_format`, to ensure that metadata can be safely
interchanged, each row is [base 64 encoded](https://en.wikipedia.org/wiki/Base64).
Thus, binary information can be safely printed and exchanged, but may not be
human readable.

The tree sequence itself also has metadata stored as a byte array.

Expand All @@ -401,10 +412,6 @@ error message. Some more complex requirements may not be detectable at load-time
and errors may not occur until certain operations are attempted.
These are documented below.

At the tree-sequence level, we require that the coordinate space has a finite,
strictly positive length; that is, the `sequence_length` attribute must be a
finite value greater than zero.

The Python API also provides tools that can transform a collection of
tables into a valid collection of tables, so long as they are logically
consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
Expand All @@ -416,8 +423,7 @@ consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.

Individuals are a basic type in a tree sequence and are not defined with
respect to any other tables. Individuals can have a reference to their parent
individuals, if present these references must be valid or null (-1). An
individual cannot list itself as its own parent.
individuals, if present these references must be valid or null (-1).

A valid tree sequence does not require individuals to be sorted in any
particular order, and sorting a set of tables using {meth}`TableCollection.sort`
Expand All @@ -431,7 +437,6 @@ using {meth}`TableCollection.sort_individuals`.
Given a valid set of individuals and populations, the requirements for
each node are:

- `time` must be a finite (non-NaN, non-infinite) value;
- `population` must either be null (-1) or refer to a valid population ID;
- `individual` must either be null (-1) or refer to a valid individual ID.

Expand All @@ -451,7 +456,7 @@ has no effect on nodes.
Given a valid set of nodes and a sequence length {math}`L`, the simple
requirements for each edge are:

- We must have finite coordinates with {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
- We must have {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
- `parent` and `child` must be valid node IDs;
- `time[parent]` > `time[child]`;
- edges must be unique (i.e., no duplicate edges are allowed).
Expand Down Expand Up @@ -488,7 +493,7 @@ properties are fulfilled.
Given a valid set of nodes and a sequence length {math}`L`, the simple
requirements for a valid set of sites are:

- We must have a finite coordinate with {math}`0 \leq` `position` {math}`< L`;
- We must have {math}`0 \leq` `position` {math}`< L`;
- `position` values must be unique.

For simplicity and algorithmic efficiency, sites must also:
Expand All @@ -515,14 +520,13 @@ requirements for a valid set of mutations are:
`time` and equal to or less than the `time` of the `parent` mutation
if this mutation has one. If one mutation on a site has `UNKNOWN_TIME` then
all mutations at that site must, as a mixture of known and unknown is not valid.
- `parent` must either be the null ID (-1), if the mutation has no parent, or a
valid mutation ID within the current table.
- `parent` must either be the null ID (-1) or a valid mutation ID within the
current table

Furthermore,

- The `parent` value must be consistent with the topology of the tree at the site
of the mutation, such that a path from the child mutation to the parent mutation
exists without passing through any other mutations at the same site.
- If another mutation occurs on the tree above the mutation in
question, its ID must be listed as the `parent`.

For simplicity and algorithmic efficiency, mutations must also:

Expand Down Expand Up @@ -554,33 +558,19 @@ will always fail. Use `tskit.is_unknown_time` to detect unknown values.

#### Migration requirements

Given a valid set of nodes and edges, the requirements for a valid set of
Given a valid set of nodes and edges, the requirements for a value set of
migrations are:

- `left` and `right` must be finite values that lie within the tree sequence
coordinate space (i.e., from 0 to `sequence_length`), with {math}`0 \leq`
`left` {math}`<` `right` {math}`\leq L`;
- `node` must be a valid node ID;
- if population references are checked, `source` and `dest` must be valid
population IDs;
- `time` must be a finite value.

To enable efficient processing, migrations must also be sorted by
nondecreasing `time` value.
- `left` and `right` must lie within the tree sequence coordinate space (i.e.,
from 0 to `sequence_length`).
- `time` must be strictly between the time of its `node` and the time of any
ancestral node from which that node inherits on the segment `[left, right)`.
- The `population` of any such ancestor matching `source`, if another
`migration` does not intervene.

Conceptually, a migration records that a segment of ancestry for the given
`node` moves between populations along the tree. In typical demographic
models we expect:
To enable efficient processing, migrations must also be:

- `time` to lie strictly between the time of the migrating `node` and the time
of any ancestral node from which that node inherits on the segment
`[left, right)`;
- the `population` of any such ancestor to match the `source` population,
until another `migration` intervenes.

These conceptual relationships are not currently validated. It is
the responsibility of code that creates migrations to satisfy them where
required.
- Sorted by nondecreasing `time` value.

Note in particular that there is no requirement that adjacent migration records
should be "squashed". That is, we can have two records `m1` and `m2`
Expand All @@ -604,10 +594,8 @@ There are no requirements on a population table.
The `timestamp` column of a provenance table should be in
[ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format.

The `record` column stores a JSON document describing how and where the tree sequence
was produced. For tree sequences generated by tskit and related tools, this JSON is
expected to conform to the :ref:`provenance schema <sec_provenance_schema>` described
in {ref}`sec_provenance`.
The `record` should be valid JSON with structure defined in the Provenance
Schema section (TODO).


(sec_table_indexes)=
Expand Down Expand Up @@ -1053,14 +1041,14 @@ Node 7 in the example above provides a good case study. Note that it is a root n
at least one child (i.e. not a leaf) in trees 0 and 2; in contrast in tree 1 it is
isolated. Strictly, because it is isolated in tree 1, it is also a leaf node there,
although it is not attached to a root, not a sample, and is therefore not plotted. In
this case, in that tree we can think of node 7 as a "dead leaf" (and we don't normally
plot dead leaves). In fact, in a large tree sequence of many trees, most ancestral nodes
will be isolated in any given tree, and therefore most nodes in such a tree will be of
this sort. However, these dead leaves are excluded from most calculations on trees,
because algorithms usually traverse the tree by starting at a root and working down,
or by starting at a sample and working up. Hence when we refer to the leaves of a tree,
it is usually shorthand for the leaves **on** the tree (that is, attached via branches,
to one of the the tree roots). Dead leaves are excluded from this definition.
this case, in that tree we can think of node 7 as a "dead leaf" (not attached to a root:
we don't normally plot dead leaves). In fact, in a large tree sequence of many trees,
most ancestral nodes will be isolated in any given tree, and therefore most nodes in
such a tree will be of this sort. However, dead leaves are excluded from most
calculations on trees, because algorithms usually traverse the tree by starting at a
root and working down, or by starting at a sample and working up. Hence when we refer
to the leaves of a tree, it is usually shorthand for the active leaves **on** the tree
(that is, attached via branches, to one of the the tree roots). Dead leaves are excluded from this definition.

Note that it is also possible to have trees in which there are "dead branches": that is
sections of topology which are not accessible from a root, and whose tips are all
Expand Down Expand Up @@ -1172,3 +1160,4 @@ you won't see those parts of the tree sequence that are unrelated to the samples
If you need to get those, too, you could either
work with the {meth}`TreeSequence.edge_diffs` directly,
or iterate over all nodes (instead of over {meth}`Tree.nodes`).