Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 30 additions & 22 deletions VOTable.tex
Original file line number Diff line number Diff line change
Expand Up @@ -407,28 +407,31 @@ \subsection{Primitives}
UTF-8-encoded byte, not to a single character.
Since UTF-8 is a variable-width encoding,
a character may require multiple bytes, and for arrays the
string length (length in characters) and primitive count (length in bytes)
string length (defined e.g.\ as the length in Unicode code points)
and primitive count (length in bytes)
will in general differ.
7-bit ASCII characters are however all encoded as a single byte in UTF-8,
so in the case of ASCII characters, which were required for this
datatype in earlier VOTable versions, the primitive and character count
are equal.
This means that a single (non-array) \literalvalue{char}
is capable of storing a 7-bit ASCII character only.
Strings must not be truncated mid-character,
so truncation of a string to fit a fixed-length char array may result in
unused bytes at the end of the array.
The character count as distinct from the primitive count
Strings must not be truncated midway through the multi-byte
representation of a code point,
so truncation of a string to fit a fixed-length \literalvalue{char} array
may result in unused bytes at the end of the array.
The code point count as distinct from the primitive count
may optionally be recorded by the \attr{width} attribute,
see \Aref{sec:form}.

For historical reasons the \literalvalue{unicodeChar} type can also be used
for character storage, but from VOTable 1.6 this type is deprecated.
For this type the primitive size of two bytes corresponds to a 2-byte
UTF-16 {\em code unit}.
Only characters in the Unicode Basic Multilingual Plane,
which all have 2-byte representations, are permitted for this datatype,
so that the primitive count matches the character count.
Only code points in the Unicode Basic Multilingual Plane,
which all have single code unit representations in UTF-16,
are permitted for this datatype,
so that the primitive count matches the code point count.
This is identical to the obsolete UCS-2 encoding,
which was the description used in earlier VOTable versions.

Expand Down Expand Up @@ -489,7 +492,7 @@ \subsection{Columns as Arrays}\label{array}
since the character $\Lambda$ (Lambda)
is encoded in two bytes (0xCE, 0x9B) by UTF-8
while the ASCII characters L, C, D, M are encoded in one byte.
As explained in \Aref{sec:form} the number of characters MAY
As explained in \Aref{sec:form} the number of code points MAY
be reported by the \attr{width} attribute.

A 1D array of strings can be represented as a 2D array of characters, but
Expand Down Expand Up @@ -1163,15 +1166,18 @@ \subsection{The \attr{precision} and \attr{width} Attributes}

When used with \attrval{datatype}{char} arrays the
\attr{width} attribute is interpreted as an upper bound for
the array length in characters, i.e.\ Unicode code points.
the string length in Unicode code points.
This value is distinct from the \attr{arraysize} attribute
which gives the array length in UTF-8-encoded bytes.
VOTable producers are not required to supply the \attr{width} attribute
for such columns,
but if known the string length can be useful to clients
for e.g.\ resource allocation.
In the case of multi-dimensional \texttt{char} arrays the value refers
to the number of characters in each string element.
A \attrval{datatype}{char} \elem{FIELD}
for which \attr{width} is equal to \attr{arraysize}
may be assumed to contain only ASCII data.
In the case of multi-dimensional \texttt{char} arrays the \attr{width} refers
to the number of code points in \emph{each} string element.

\subsection{Extended Datatype \attr{xtype}}
\label{sec:xtype}
Expand Down Expand Up @@ -2011,8 +2017,8 @@ \section{Definitions of Primitive Datatypes}
the \attr{arraysize} value is not NULL terminated.
The value MUST represent a legal UTF-8 encoded string,
and therefore MUST NOT be truncated midway through the multi-byte
representation of a character.
Characters are represented in the \elem{TABLEDATA} serialization
representation of a code point.
Character data is represented in the \elem{TABLEDATA} serialization
using the XML encoding of the VOTable document, which is typically UTF-8.
Also note also the significance of the {\em white space} characters
in the \elem{TABLEDATA} serialization
Expand All @@ -2024,20 +2030,22 @@ \section{Definitions of Primitive Datatypes}
attribute specifies data type {\literalvalue{unicodeChar}},
the field shall contain in the \elem{BINARY}/\elem{BINARY2} serialization
the 2-byte big-endian UTF-16 encoding
of a Unicode character from the Basic Multilingual Plane
(equivalent to the obsolete UCS-2 encoding).
of a Unicode code point from the Basic Multilingual Plane,
i.e.\ a non-surrogate UTF-16 code unit;
this is equivalent to the obsolete UCS-2 encoding.
The \attr{arraysize} attribute
indicates a string composed of Unicode BMP characters.
Characters are represented in the \elem{TABLEDATA} serialization
indicates a string composed of Unicode BMP code points.
Character data is represented in the \elem{TABLEDATA} serialization
using the XML encoding of the VOTable document, which is typically UTF-8.
Also note the significance of the {\em white space} characters
in the \elem{TABLEDATA} serialization
(\Arefs{elem:TD}).
Regardless of serialization, non-BMP characters are not permitted
by this standard, but readers MAY treat such characters normally
Regardless of serialization, non-BMP code points
are not permitted in \literalvalue{unicodeChar} data,
but readers MAY treat such characters normally
if encountered, for instance by using a UTF-16 decoder on BINARY data,
though note in this case the \attr{arraysize}
may no longer match the character count.
may no longer match the code point count.

\item {\bf 16-Bit Integer}\quad If the value of the {\attr{datatype}}
attribute specifies datatype {\literalvalue{short}},
Expand Down Expand Up @@ -2420,7 +2428,7 @@ \subsection{Differences Between Versions 1.5 and 1.6}
but enables inclusion of arbitrary Unicode content
using the usual UTF-8 encoding.
\item Related to the above, the \attr{width} attribute now has a meaning
for character data, namely field length in characters
for character data, namely string length in code points
(as opposed to code units).
\item \ARef{sec:mime} is renamed from ``MIME Type'' to ``Media Type''.
\item The {\tt content} parameter is defined for the
Expand Down
Loading