Tighten up language about characters in Unicode context by mbtaylor · Pull Request #90 · ivoa-std/VOTable

mbtaylor · 2026-05-01T12:22:37Z

Following James Tocknell's comments at issue #89.

Following James Tocknell's comments at issue ivoa-std#89.

msdemlei

For the benefit of other reviewers, here's the glossary (cf. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527):

code point is roughly what you'd call a unicode character
the code unit is the "word" in the serialisation stream, i.e., bytes for our char type and 16-bit ints for our unicodeChar
a code unit sequence is 1 .. n code units encoding a single code point. I'm not 100% sure if the "single" actually strictly holds, but I think so.

If all of this is right, I'd say let's merge with that one comment I have.

msdemlei · 2026-05-04T09:30:15Z

-so truncation of a string to fit a fixed-length char array may result in
-unused bytes at the end of the array.
-The character count as distinct from the primitive count
+Strings must not be truncated midway through a code point,


This needs to be "midway through a code unit sequence" (or "byte sequence", if you prefer).

mbtaylor · 2026-05-04T15:46:43Z

I don't see the term "code unit sequence" defined at the reference you give (though I can see it used in that text). Is the definition there?

msdemlei · 2026-05-05T11:12:35Z

On Mon, May 04, 2026 at 08:47:04AM -0700, Mark Taylor wrote: mbtaylor left a comment (ivoa-std/VOTable#90) I don't see the term "code unit sequence" defined at the reference you give (though I can see it used in that text). Is the definition there?

I don't think so. I've looked for something like a glossary in the document but failed to find one.

mbtaylor · 2026-05-05T11:49:57Z

In absence of a definition, it seems like a potentially confusing term to use ... "code unit sequence" (even worse "byte sequence") just sounds to me like a sequence of code units [not necessarily aligned on a code point boundary]. However, in this context I suppose there's not much else it can mean.

In section 6 I see I've used the form of words

MUST NOT be truncated midway through the multi-byte representation of a code point

do you think that would be better?

msdemlei · 2026-05-05T12:24:55Z

On Tue, May 05, 2026 at 04:50:18AM -0700, Mark Taylor wrote: In section 6 I see I've used the form of words > MUST NOT be truncated midway through the multi-byte representation of a code point do you think that would be better?

"Better" I don't know about. Less jargony in any case, and I'd say that makes in preferable.

mbtaylor · 2026-05-05T12:37:21Z

@aragilar can you comment on whether this addresses your concerns at #89?

aragilar · 2026-05-06T06:56:35Z

@mbtaylor Yes, I think these changes remove the ambiguity that was there before.

I don't think this is worth sticking in the VOTable document (I've never seen a standard which puts it in), but I presume there's no requirements around normalization forms? If you haven't come across it, here's some example Python code showing the different forms for the Angstrom character:

>>> import unicodedata
>>> unicode_str = "Å"
>>> unicodedata.normalize("NFC", unicode_str)
'Å'
>>> unicodedata.normalize("NFC", unicode_str).encode("utf-8")
b'\xc3\x85'
>>> unicodedata.normalize("NFD", unicode_str).encode("utf-8")
b'A\xcc\x8a'
>>> unicodedata.normalize("NFKD", unicode_str).encode("utf-8")
b'A\xcc\x8a'
>>> unicodedata.normalize("NFKC", unicode_str).encode("utf-8")
b'\xc3\x85'

I suspect this is something clients defer to their language/library/framework to handle.

mbtaylor · 2026-05-06T07:56:03Z

Gosh, there's always more to know about Unicode, isn't there? But yes, I agree that normalization is out of scope for VOTable, I don't think it needs to go beyond explaining how to encode a sequence of code points.

Thanks for the approval.

Tighten up language about characters in Unicode context

e983dc9

Following James Tocknell's comments at issue ivoa-std#89.

mbtaylor mentioned this pull request May 1, 2026

Clairty around intended reading of "character" in section 4.2 and elsewhere (v1.6) #89

Closed

msdemlei approved these changes May 4, 2026

View reviewed changes

Reword "midway through a code point"

e5b1180

mbtaylor merged commit 69d0d22 into ivoa-std:master May 6, 2026
1 check passed

mbtaylor deleted the code-points branch May 6, 2026 08:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tighten up language about characters in Unicode context#90

Tighten up language about characters in Unicode context#90
mbtaylor merged 2 commits into
ivoa-std:masterfrom
mbtaylor:code-points

mbtaylor commented May 1, 2026

Uh oh!

msdemlei left a comment

Uh oh!

msdemlei May 4, 2026

Uh oh!

mbtaylor commented May 4, 2026

Uh oh!

msdemlei commented May 5, 2026 via email

Uh oh!

mbtaylor commented May 5, 2026

Uh oh!

msdemlei commented May 5, 2026 via email

Uh oh!

mbtaylor commented May 5, 2026

Uh oh!

aragilar commented May 6, 2026

Uh oh!

mbtaylor commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mbtaylor commented May 1, 2026

Uh oh!

msdemlei left a comment

Choose a reason for hiding this comment

Uh oh!

msdemlei May 4, 2026

Choose a reason for hiding this comment

Uh oh!

mbtaylor commented May 4, 2026

Uh oh!

msdemlei commented May 5, 2026 via email

Uh oh!

mbtaylor commented May 5, 2026

Uh oh!

msdemlei commented May 5, 2026 via email

Uh oh!

mbtaylor commented May 5, 2026

Uh oh!

aragilar commented May 6, 2026

Uh oh!

mbtaylor commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants