tskit-dev · hyanwong · Dec 3, 2025 · Dec 3, 2025 · Dec 3, 2025
diff --git a/docs/data-model.md b/docs/data-model.md
@@ -298,11 +298,20 @@ required for a valid set of mutations.
 In simulations, trees can be thought of as spread across space, and it is
 helpful for inferring demographic history to record this history.
 Migrations are performed by individual ancestors, but most likely not by an
-individual whose genome is tracked as a `node` (as in a discrete-deme model they are
-unlikely to be both a migrant and a most recent common ancestor).  So,
-`tskit` records when a segment of ancestry has moved between
-populations. This table is not required, even if different nodes come from
-different populations.
+individual whose genome is tracked as a `tskit` node (as in a discrete-deme
+model they are unlikely to be both a migrant and a most recent common ancestor).
+The migrations table can be used to record when a segment of ancestry has moved
+between populations. Note that recording migrations is not compulsory: the migrations
+table can be left empty even if different nodes come from different populations.
+
+In particular, migrations chart how edges are associated with a population.
+Specifically, a migration from source population X to destination population Y
+at a given time, node, and left/right coordinate means that at the edge (or edges)
+denoted by the node and coordinates, ancestors younger than the time of
+the migration on the relevant edges can be thought of as belonging to population X
+and the older ancestors along the relevant edges can be thought of as belonging to
+population Y (at least until an intervening migration occurs). Migrations are
+recorded as follows:
 
 | Column     | Type     | Description                                            |
 | :--------- | -------- | -----------------------------------------------------: |
@@ -316,18 +325,22 @@ different populations.
 
 
 The `left` and `right` columns are floating point values defining the
-half-open segment of genome affected. The `source` and `dest` columns
-record the IDs of the respective populations. The `node` column records the
-ID of the node that was associated with the ancestry segment in question
-at the time of the migration event. The `time` column is holds floating
-point values recording the time of the event.
+half-open segment of genome affected (these need not exactly correspond to
+breakpoints between edges). The `source` column records the ID of the
+population from which, considered in forward time, migration originates. The
+`dest` column records where the migrating material ends up. The `time` column
+holds floating point values recording the time of the event, with migrations
+assumed to occur instantaneously. The `node` column records the ID of the child
+node of the migrating segment; in consequence the population ID of the `node` will
+match the `dest` ID (unless sequential migrations affect the same `node`, in which
+case it will match the `dest` value of the youngest of those migrations).
 
 The `metadata` column provides a location for client code to store
 information about each migration. See the {ref}`sec_metadata_definition` section for
 more details on how metadata columns should be used.
 
 See the {ref}`sec_migration_requirements` section for details on the properties
-required for a valid set of mutations.
+required for a valid set of migrations.
 
 
 (sec_population_table_definition)=
@@ -356,10 +369,14 @@ required for a valid set of populations.
 
 #### Provenance Table
 
+:::{todo}
+Document the provenance table.
+:::
+
 | Column    | Type  | Description                                                             |
 | :-------- | ----- | ----------------------------------------------------------------------: |
 | timestamp | char  | Timestamp in [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format. |
-| record    | char  | Provenance record as JSON.                                              |
+| record    | char  | Provenance record.                                                      |
 
 
 (sec_metadata_definition)=
@@ -370,16 +387,10 @@ Each table (excluding provenance) has a metadata column for storing and passing
 information that tskit does not use or interpret. See {ref}`sec_metadata` for details.
 The metadata columns are {ref}`binary columns <sec_tables_api_binary_columns>`.
 
-When using the {ref}`sec_text_file_format`, metadata values are written as opaque
-text. By default, :meth:`TreeSequence.dump_text` will base64-encode metadata values
-that are stored as raw bytes (when ``base64_metadata=True``) so that binary data can
-be safely printed and exchanged; in this case :func:`tskit.load_text` will base64-decode
-the corresponding text fields back to bytes. When metadata has already been decoded
-to a structured Python object (for example via a metadata schema), the textual
-representation written by :meth:`TreeSequence.dump_text` is the ``repr`` of that
-object, and :func:`tskit.load_text` does not attempt to reconstruct the original
-structured value from this representation. For reliable metadata round-tripping,
-prefer the native binary tree sequence file format over the text formats.
+When using the {ref}`sec_text_file_format`, to ensure that metadata can be safely
+interchanged, each row is [base 64 encoded](https://en.wikipedia.org/wiki/Base64).
+Thus, binary information can be safely printed and exchanged, but may not be
+human readable.
 
 The tree sequence itself also has metadata stored as a byte array.
 
@@ -401,10 +412,6 @@ error message. Some more complex requirements may not be detectable at load-time
 and errors may not occur until certain operations are attempted.
 These are documented below.
 
-At the tree-sequence level, we require that the coordinate space has a finite,
-strictly positive length; that is, the `sequence_length` attribute must be a
-finite value greater than zero.
-
 The Python API also provides tools that can transform a collection of
 tables into a valid collection of tables, so long as they are logically
 consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
@@ -416,8 +423,7 @@ consistent, see {ref}`sec_tables_api_creating_valid_tree_sequence`.
 
 Individuals are a basic type in a tree sequence and are not defined with
 respect to any other tables. Individuals can have a reference to their parent
-individuals, if present these references must be valid or null (-1). An
-individual cannot list itself as its own parent.
+individuals, if present these references must be valid or null (-1).
 
 A valid tree sequence does not require individuals to be sorted in any
 particular order, and sorting a set of tables using {meth}`TableCollection.sort`
@@ -431,7 +437,6 @@ using {meth}`TableCollection.sort_individuals`.
 Given a valid set of individuals and populations, the requirements for
 each node are:
 
-- `time` must be a finite (non-NaN, non-infinite) value;
 - `population` must either be null (-1) or refer to a valid population ID;
 - `individual` must either be null (-1) or refer to a valid individual ID.
 
@@ -451,7 +456,7 @@ has no effect on nodes.
 Given a valid set of nodes and a sequence length {math}`L`, the simple
 requirements for each edge are:
 
-- We must have finite coordinates with {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
+- We must have {math}`0 \leq` `left` {math}`<` `right` {math}`\leq L`;
 - `parent` and `child` must be valid node IDs;
 - `time[parent]` > `time[child]`;
 - edges must be unique (i.e., no duplicate edges are allowed).
@@ -488,7 +493,7 @@ properties are fulfilled.
 Given a valid set of nodes and a sequence length {math}`L`, the simple
 requirements for a valid set of sites are:
 
-- We must have a finite coordinate with {math}`0 \leq` `position` {math}`< L`;
+- We must have {math}`0 \leq` `position` {math}`< L`;
 - `position` values must be unique.
 
 For simplicity and algorithmic efficiency, sites must also:
@@ -515,14 +520,13 @@ requirements for a valid set of mutations are:
   `time` and equal to or less than the `time` of the `parent` mutation
   if this mutation has one. If one mutation on a site has `UNKNOWN_TIME` then
   all mutations at that site must, as a mixture of known and unknown is not valid.
-- `parent` must either be the null ID (-1), if the mutation has no parent, or a
-  valid mutation ID within the current table. 
+- `parent` must either be the null ID (-1) or a valid mutation ID within the
+  current table
 
 Furthermore,
 
-- The `parent` value must be consistent with the topology of the tree at the site
-  of the mutation, such that a path from the child mutation to the parent mutation
-  exists without passing through any other mutations at the same site.
+- If another mutation occurs on the tree above the mutation in
+  question, its ID must be listed as the `parent`.
 
 For simplicity and algorithmic efficiency, mutations must also:
 
@@ -554,33 +558,19 @@ will always fail. Use `tskit.is_unknown_time` to detect unknown values.
 
 #### Migration requirements
 
-Given a valid set of nodes and edges, the requirements for a valid set of
+Given a valid set of nodes and edges, the requirements for a value set of
 migrations are:
 
-- `left` and `right` must be finite values that lie within the tree sequence
-  coordinate space (i.e., from 0 to `sequence_length`), with {math}`0 \leq`
-  `left` {math}`<` `right` {math}`\leq L`;
-- `node` must be a valid node ID;
-- if population references are checked, `source` and `dest` must be valid
-  population IDs;
-- `time` must be a finite value.
-
-To enable efficient processing, migrations must also be sorted by
-nondecreasing `time` value.
+- `left` and `right` must lie within the tree sequence coordinate space (i.e.,
+  from 0 to `sequence_length`).
+- `time` must be strictly between the time of its `node` and the time of any
+  ancestral node from which that node inherits on the segment `[left, right)`.
+- The `population` of any such ancestor matching `source`, if another
+  `migration` does not intervene.
 
-Conceptually, a migration records that a segment of ancestry for the given
-`node` moves between populations along the tree. In typical demographic
-models we expect:
+To enable efficient processing, migrations must also be:
 
-- `time` to lie strictly between the time of the migrating `node` and the time
-  of any ancestral node from which that node inherits on the segment
-  `[left, right)`;
-- the `population` of any such ancestor to match the `source` population,
-  until another `migration` intervenes.
-
-These conceptual relationships are not currently validated. It is
-the responsibility of code that creates migrations to satisfy them where
-required.
+- Sorted by nondecreasing `time` value.
 
 Note in particular that there is no requirement that adjacent migration records
 should be "squashed". That is, we can have two records `m1` and `m2`
@@ -604,10 +594,8 @@ There are no requirements on a population table.
 The `timestamp` column of a provenance table should be in
 [ISO-8601](https://en.wikipedia.org/wiki/ISO_8601) format.
 
-The `record` column stores a JSON document describing how and where the tree sequence
-was produced. For tree sequences generated by tskit and related tools, this JSON is
-expected to conform to the :ref:`provenance schema <sec_provenance_schema>` described
-in {ref}`sec_provenance`. 
+The `record` should be valid JSON with structure defined in the Provenance
+Schema section (TODO).
 
 
 (sec_table_indexes)=
@@ -1053,14 +1041,14 @@ Node 7 in the example above provides a good case study. Note that it is a root n
 at least one child (i.e. not a leaf) in trees 0 and 2; in contrast in tree 1 it is
 isolated. Strictly, because it is isolated in tree 1, it is also a leaf node there,
 although it is not attached to a root, not a sample, and is therefore not plotted. In
-this case, in that tree we can think of node 7 as a "dead leaf" (and we don't normally
-plot dead leaves). In fact, in a large tree sequence of many trees, most ancestral nodes
-will be isolated in any given tree, and therefore most nodes in such a tree will be of
-this sort. However, these dead leaves are excluded from most calculations on trees,
-because algorithms usually traverse the tree by starting at a root and working down,
-or by starting at a sample and working up. Hence when we refer to the leaves of a tree,
-it is usually shorthand for the leaves **on** the tree (that is, attached via branches,
-to one of the the tree roots). Dead leaves are excluded from this definition.
+this case, in that tree we can think of node 7 as a "dead leaf" (not attached to a root:
+we don't normally plot dead leaves). In fact, in a large tree sequence of many trees,
+most ancestral nodes will be isolated in any given tree, and therefore most nodes in
+such a tree will be of this sort. However, dead leaves are excluded from most
+calculations on trees, because algorithms usually traverse the tree by starting at a
+root and working down, or by starting at a sample and working up. Hence when we refer
+to the leaves of a tree, it is usually shorthand for the active leaves **on** the tree 
+(that is, attached via branches, to one of the the tree roots). Dead leaves are excluded from this definition.
 
 Note that it is also possible to have trees in which there are "dead branches": that is
 sections of topology which are not accessible from a root, and whose tips are all
@@ -1172,3 +1160,4 @@ you won't see those parts of the tree sequence that are unrelated to the samples
 If you need to get those, too, you could either
 work with the {meth}`TreeSequence.edge_diffs` directly,
 or iterate over all nodes (instead of over {meth}`Tree.nodes`).
+