Tighten up language about characters in Unicode context#90
Conversation
Following James Tocknell's comments at issue ivoa-std#89.
msdemlei
left a comment
There was a problem hiding this comment.
For the benefit of other reviewers, here's the glossary (cf. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527):
- code point is roughly what you'd call a unicode character
- the code unit is the "word" in the serialisation stream, i.e., bytes for our char type and 16-bit ints for our unicodeChar
- a code unit sequence is 1 .. n code units encoding a single code point. I'm not 100% sure if the "single" actually strictly holds, but I think so.
If all of this is right, I'd say let's merge with that one comment I have.
| so truncation of a string to fit a fixed-length char array may result in | ||
| unused bytes at the end of the array. | ||
| The character count as distinct from the primitive count | ||
| Strings must not be truncated midway through a code point, |
There was a problem hiding this comment.
This needs to be "midway through a code unit sequence" (or "byte sequence", if you prefer).
|
I don't see the term "code unit sequence" defined at the reference you give (though I can see it used in that text). Is the definition there? |
|
On Mon, May 04, 2026 at 08:47:04AM -0700, Mark Taylor wrote:
mbtaylor left a comment (ivoa-std/VOTable#90)
I don't see the term "code unit sequence" defined at the reference
you give (though I can see it used in that text). Is the
definition there?
I don't think so. I've looked for something like a glossary in the
document but failed to find one.
|
|
In absence of a definition, it seems like a potentially confusing term to use ... "code unit sequence" (even worse "byte sequence") just sounds to me like a sequence of code units [not necessarily aligned on a code point boundary]. However, in this context I suppose there's not much else it can mean. In section 6 I see I've used the form of words
do you think that would be better? |
|
On Tue, May 05, 2026 at 04:50:18AM -0700, Mark Taylor wrote:
In section 6 I see I've used the form of words
> MUST NOT be truncated midway through the multi-byte representation of a code point
do you think that would be better?
"Better" I don't know about. Less jargony in any case, and I'd say
that makes in preferable.
|
|
@mbtaylor Yes, I think these changes remove the ambiguity that was there before. I don't think this is worth sticking in the VOTable document (I've never seen a standard which puts it in), but I presume there's no requirements around normalization forms? If you haven't come across it, here's some example Python code showing the different forms for the Angstrom character: I suspect this is something clients defer to their language/library/framework to handle. |
|
Gosh, there's always more to know about Unicode, isn't there? But yes, I agree that normalization is out of scope for VOTable, I don't think it needs to go beyond explaining how to encode a sequence of code points. Thanks for the approval. |
Following James Tocknell's comments at issue #89.