-
Notifications
You must be signed in to change notification settings - Fork 4
HoneyBadgerFish
The NeXML standard (https://github.com/nexml/nexml/wiki/NeXML-Manual) describes how to express the core data of a phylogenetic study in XML.
The standard also allows arbitrary key-value pairs to be added to any entity through the use of meta child elements.
Each meta can either be of type LiteralMeta or ResourceMeta.
Because Open Tree's study curation app's manipulations are primarily the addition, deletion, and changing of these meta elements, it makes sense for us to make them accessible.
In a naive transformation of NeXML to JSON, finding a meta property requires iterating through every child meta object, checking the "@property" for the desired property name, and then looking for the value in one of few places ("@content" or "$" for LiteralMeta elements, and "@href" or "$" for ResourceMeta).
The ot:* key-value pairs that the Open Tree project is using to add extra info are documented on the NexSON page.
The NexSON files are produced using a syntactic convention based on the BadgerFish convention (see below).
The XML tree will be mirrored as a tree of JS objects. The topmost object contains the root of the XML tree. Each element in the NeXML is processed using the following rules, such that an XML element becomes an JS object inside its parent.
The first 4 rules only deviate slightly from BadgerFish (see Note in rule #3)
-
The XML element name becomes the name of the property in the parent JS object.
-
The text value of the XML element is contained in the
$property of the object. Whitespace is stripped from the ends. If the text value of an XML element is broken up by intervening child elements, the$of the object is produced by stripping leading and trailing whitespace from each fragment and concatenating fragments. -
The child elements in XML maps to an array of objects. Note: in BadgerFish single elements are mapped to single JS objects. In the NeXML schema, all of the core objects can be repeated. So an array (of any length) is a more natural mapping. Missing elements are omitted (not written as empty arrays).
-
XML attributes become properties of the object with a name that is a prefix of
@before the property name. So:<alice charlie="david">bob</alice>at the top level would become:
{"alice": [{
"$" : "bob",
"@charlie" : "david" }]}
Rules 5 and 6 deal with XML namespaces. They mainly differ from BadgerFish in that the namespaces are only added to the root object:
- The default namespace becomes the
$property of an@xmlnsobject, and other namespaces become properties of that object. The names of the properties are the names of XML namespaces without the "xmlns:" qualifier. So
<alice xmlns="http://some-namespace" xmlns:charlie="http://some-other-namespace">bob</alice>
as a top-level object becomes:
{"alice": [{
"$" : "bob",
"@xmnls" : {
"$": "http://some-namespace",
"charlie": "http://some-other-namespace"}]}
Unlike BadgerFish, this @xmlns in only added to the root object.
- prefixes in an element or attribute name is just treated as part of the name (no substitution of the URL or cropping of the element name to exclude the prefix.
Rules 7-9 are special case handling of meta elements:
- If an element has
metachild element withxsi:type="nex:LiteralMeta"then it must have
- a
propertyattribute; we will call the value of this attribute prop-value; - a
datatypespecifying whichxsd:datatype the element holds; we will call the value of this attribute datatype-value; and - the data in a
contentattribute OR in the text content of the element; we will call this the content-value;
This sort of meta element will appear in the parent object under a name with a ^ prefix followed by prop-val. The content-value will be coerced to the JavaScript type that corresponds to datatype-value.
The exact representation of the property depends on what needs to be conveyed:
- Rule 7A: If there are no other attributes of the meta element needing to be mapped, then the key-value pair will have a JS primitive type as its value.
- Rule 7B: If there are other attributes that need to be written (such as an
idattribute), then the value will be a JS object with content-value stored in the$field.
- If an element has
metachild element withxsi:type="nex:ResourceMeta"then it must have
- a
relattribute; we will call the value of this attribute prop-value; - the data in an
hrefattribute OR a nestedmetaelement; we will call this the content-value;
This sort of meta element will appear in the parent object under a name with a ^ prefix followed by prop-val. The value will be a JS object with:
- Rule 8A: if the data is in a
hrefattribute, then@hrefproperty will hold the href string - Rule 8B: ifa nested
metaelement holds the data, then a$property will map to a JavaScript object that holds the representation of the innermeta.
- Many of the meta attributes can only occur once per element. To streamline the
metaencoding (and as an exception to Rule 3 above) we use the BadgerFish convention for dealing with cardinality:
- Rule 9A: If there is one element that maps to a property name, the value is the object described above (either a primitive for simple
nex:LiteralMeta-type elements, or a full JS object otherwise). - Rule 9B: If there are multiple elements that map to a property name, then value of the property is an array which holds each of the object represenations as described above.
Note that the type hints (datatype and xsi:type attributes) are not present in the JSON.
Reverse translation is possible by relying on:
- If the value is a primitive, then
nex:LiteralMetawill be used. - If the value is an object with a
$that is a primitive, thennex:LiteralMetawill be used. - If the value is an object with ah
hrefproperty, thennex:ResourceMetawill be used.
-
If there is an
aboutattribute with a value that refers to the same element'sid, then a@aboutis not present in the JSON. -
The top-level object in JSON will have a
@nexml2jsonproperty that maps to a version string such as "1.0.0a" or "1.0.0". Direct BadgerFish translations to JSON will lack this property, or will have a version string that starts with "0." (because most projects tweak the BadgerFish rules at least a little bit, it seems like a good idea to leave some room in the 0... namespace for distinguishing between versions JSON produced by those conventions).
There are three ways (that we are aware of) that roundtrip of XML -> JSON -> XML might not result in identical syntax:
-
The attribute and element order is not preserved. This is an trivial barrier to using diff to test roundtrips, but not a serious issue.
-
Introspection will provide the
datatypeofnex:LiteralMetaelements. This means thatxsd:integerandxsd:floatvalues will be used for integer and floating point numbers. Thus the details of the meta properties (e.g. integer vs long or float vs double) may not be "round-trip-able". We do not know of cases in NeXML documents in which this fine-grained distinctions of type is needed. -
A LiteralMeta form of
metacan store its value in acontentattribute or the text body of the element. Both of these map to$in JSON, so the exact placement cannot be recovered. This is not a substantive concern, as there is no indication in the NeXML standard that the two locations for the data should affect handling of the data.
The NeXML snippet below was pieced together from multiple files. So it does not make sense biologically. It was constructed to be valid NeXML and to show a diversity of the meta cases that introduce complexity:
The version-controlled home for the file is at https://github.com/OpenTreeOfLife/api.opentreeoflife.org/blob/roundtrip2xml/nexson-validator/tests/nexml/otu.xml
<?xml version="1.0" encoding="UTF-8"?>
<nex:nexml
xmlns:nex="http://www.nexml.org/2009"
xmlns="http://www.nexml.org/2009"
version="0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:ot="http://purl.org/opentree/nexson"
xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
xmlns:tb="http://purl.org/phylo/treebase/2.0/terms#"
xmlns:skos="http://www.w3.org/2004/02/skos/core#">
<meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">cpDNA</meta>
<meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">ingroup added</meta>
<meta property="ot:candidateTreeForSynthesis" xsi:type="nex:LiteralMeta" datatype="xsd:string">tr1</meta>
<otus id="ob1">
<otu about="#otu88801" id="otu88801" label="Ancyromonas sigmoides">
<meta property="ot:ottId" xsi:type="nex:LiteralMeta" datatype="xsd:integer">415973</meta>
<meta property="ot:originalLabel" id="bogus" xsi:type="nex:LiteralMeta" datatype="xsd:string">Ancyromonas sigmoides</meta>
<meta href="http://dx.doi.org/10.3732/ajb.94.12.2026" rel="ot:studyPublication" xsi:type="nex:ResourceMeta"/>
<meta content="7002" datatype="xsd:long" id="m0" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/94215" id="meta4912509" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
<meta href="http://purl.uniprot.org/taxonomy/102624" id="meta4912517" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
</otu>
</otus>
<trees id="tb1" otus="ob1">
<tree id="tr1" xsi:type="nex:FloatTree">
<node id="n1" otu="otu88801"/>
<node id="n0"/>
<edge id="e0" source="n0" target="n1"/>
</tree>
</trees>
</nex:nexml>
will be represented as (there is not much of interest after the otu object):
{
"nex:nexml": {
"@version": "0.9",
"@xmlns": {
"$": "http://www.nexml.org/2009",
"nex": "http://www.nexml.org/2009",
"ot": "http://purl.org/opentree/nexson",
"skos": "http://www.w3.org/2004/02/skos/core#",
"tb": "http://purl.org/phylo/treebase/2.0/terms#",
"xsd": "http://www.w3.org/2001/XMLSchema#",
"xsi": "http://www.w3.org/2001/XMLSchema-instance"
},
"^ot:candidateTreeForSynthesis": "tr1", # Rule 7A, 9A
"^ot:tag": ["cpDNA", "ingroup added"], # Rule 7A, 9B
"otus": [{
"@id": "ob1",
"otu": [{
"@id": "otu88801",
"@label": "Ancyromonas sigmoides",
"^ot:originalLabel": { # Rule 7B, 9A
"$": "Ancyromonas sigmoides",
"@id": "bogus"
},
"^ot:ottId": 415973, # Rule 7A, 9A
"^ot:studyPublication": { # Rule 8A, 9A
"@href": "http://dx.doi.org/10.3732/ajb.94.12.2026"
},
"^skos:closeMatch": [{ # Rule 8A, 9B
"@href": "http://purl.uniprot.org/taxonomy/94215",
"@id": "meta4912509"},{
"@href": "http://purl.uniprot.org/taxonomy/102624",
"@id": "meta4912517"
}],
"^tb:identifier.taxon": { # Rule 7B, 9B
"$": 7002,
"@id": "m0"
}
}
]
}],
"trees": [{
"@id": "tb1",
"@otus": "ob1",
"tree": [{
"@id": "tr1",
"@xsi:type": "nex:FloatTree",
"edge": [{
"@id": "e0",
"@source": "n0",
"@target": "n1"
}
],
"node": [{
"@id": "n1",
"@otu": "otu88801"
},{
"@id": "n0"
}
]
}
]
}
]
}
}
We can probably avoid supporting this form - it was proposed in email, but not implemented.
This representation is very similar to the @nexml2json=1.1.* with the following exception: a "byId" representation is used for some fields rather than an array. In this representation:
- a single object is used in place of array in the 1.0.0 syntax,
- The only permitted keys in the object are the
idattributes of the element, - The value associated with the key is an object identical to the 1.0.0 reprsentation except that the
@idis not included. - The NeXML form of the object is sequence of elements, one for each key-value pair.
Specifically:
- Instead of
nodeandedgearray, the tree representation is expressed as:-
internalEdgeandterminalEdgearrays instead ofedge(which if concatenated would recreate theedgearray of the 1.0.* representation). -
leafByIdandinternalNodeByIdobjects are used instead of anode, and: - The
^ot:isLeaffield is omitted (since the presence inleafconveys this info). - an
otuByIDobject replaces aotuarray. - an
otusByIDobject replaces aotusarray and the parent (nexml) object will have a^ot:otusElementOrderkey with an array of otusIDs to supply the order of the otus elements. - a
treesByIDobject replaces atreesarray and the parent (nexml) object will have a^ot:treesElementOrderkey with an array of treesIDs to supply the order of the trees elements - a trees group object will have a
^ot:treeElementOrderkey with an array of treeIDs to supply the order of the tree elements
-
This is the form (1.2.1) that MTH thinks should be stored in serialized form, but on-the-fly translation could make that decision less important for tools other than the api.opentree.org services.
This is the same as syntax 1.1.* except:
- the
internalEdgeandterminalEdgearrays are replaced by anedgeBySourceIdobjects with the following rules:
- The only permitted keys in the object are the
@sourceattributes of the egde, - The value associated with the key is an object with keys being the edge ids of the edges have that
@source. Despite the fact that the@sourcewould not need to be included in minimally sized representation. The@sourceis retained because most clients will want create "edgeById" and/or "edgeByTargeId" maps; the duplication here allows all 3 maps to share references to the same object. Note: in 1.2.0 the value was an array of edges; that is no longer supported by peyotl.
-
Each object in the
treearray will have a "^ot:rootNodeId" property that holds the ID of the node of the tree that is not the@targetof any edge. The@rootproperty is still retained in that node. The "^ot:specifiedRoot" is not identical to this, because that property is used to determine if the rooting is arbitrary. -
Instead of
leafByIdandinternalNodeByIdthere is just anodeByIdobject; there is still no^ot:isLeafrequired because internal node ids will be keys inedgeBySourceId, enabling a fast answer to the "isLeaf" question.
This representation allows for a very rapid construction of the tree:
- Start at "^ot:rootNodeId"
- build the tree in preorder by looking up all of the outgoing edges in edgeBySourceId
Each of these lookups can be done in constant time, so tree can built in order(N) time without any code to deal with partially connected trees during the building process or any additional memory. Subtrees can also be built by starting at the MRCA.
BadgerFish is one of several schemes for rendering XML as JSON. Several sites, including a site that appears to be the original, and several refinements were consulted in developing the mapping appropriate for NeXML.
Correctness of translation was verified by using a backtranslator and validating the resulting XML using the validator on the NeXML home page.
We were straying from strict BadgerFish by not emitting the active XML namespaces in each object, and occasionally omitting the "datatype" for "meta" elements.
Given that roundtripping a file required special tools, we decided to take the leap and clean up several aspects of the BadgerFish mapping to make data access easier on clients and reduce the size of NexSON.
MTH intends to add logic to the API code produce our old (close to straight BadgerFish conversion) via the API layer if the call include a output_nexml2json=0.* argument to calls.
Jim Allman, Karen Cranston, Cody Hinchliff, Mark Holder, Peter Midford, and Jonathan Rees participated in discussions and design of NexSON.