Skip to content

Tighten up language about characters in Unicode context#90

Merged
mbtaylor merged 2 commits into
ivoa-std:masterfrom
mbtaylor:code-points
May 6, 2026
Merged

Tighten up language about characters in Unicode context#90
mbtaylor merged 2 commits into
ivoa-std:masterfrom
mbtaylor:code-points

Conversation

@mbtaylor
Copy link
Copy Markdown
Member

@mbtaylor mbtaylor commented May 1, 2026

Following James Tocknell's comments at issue #89.

Following James Tocknell's comments at issue ivoa-std#89.
Copy link
Copy Markdown
Collaborator

@msdemlei msdemlei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the benefit of other reviewers, here's the glossary (cf. https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G14527):

  • code point is roughly what you'd call a unicode character
  • the code unit is the "word" in the serialisation stream, i.e., bytes for our char type and 16-bit ints for our unicodeChar
  • a code unit sequence is 1 .. n code units encoding a single code point. I'm not 100% sure if the "single" actually strictly holds, but I think so.

If all of this is right, I'd say let's merge with that one comment I have.

Comment thread VOTable.tex Outdated
so truncation of a string to fit a fixed-length char array may result in
unused bytes at the end of the array.
The character count as distinct from the primitive count
Strings must not be truncated midway through a code point,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be "midway through a code unit sequence" (or "byte sequence", if you prefer).

@mbtaylor
Copy link
Copy Markdown
Member Author

mbtaylor commented May 4, 2026

I don't see the term "code unit sequence" defined at the reference you give (though I can see it used in that text). Is the definition there?

@msdemlei
Copy link
Copy Markdown
Collaborator

msdemlei commented May 5, 2026 via email

@mbtaylor
Copy link
Copy Markdown
Member Author

mbtaylor commented May 5, 2026

In absence of a definition, it seems like a potentially confusing term to use ... "code unit sequence" (even worse "byte sequence") just sounds to me like a sequence of code units [not necessarily aligned on a code point boundary]. However, in this context I suppose there's not much else it can mean.

In section 6 I see I've used the form of words

MUST NOT be truncated midway through the multi-byte representation of a code point

do you think that would be better?

@msdemlei
Copy link
Copy Markdown
Collaborator

msdemlei commented May 5, 2026 via email

@mbtaylor
Copy link
Copy Markdown
Member Author

mbtaylor commented May 5, 2026

@aragilar can you comment on whether this addresses your concerns at #89?

@aragilar
Copy link
Copy Markdown

aragilar commented May 6, 2026

@mbtaylor Yes, I think these changes remove the ambiguity that was there before.

I don't think this is worth sticking in the VOTable document (I've never seen a standard which puts it in), but I presume there's no requirements around normalization forms? If you haven't come across it, here's some example Python code showing the different forms for the Angstrom character:

>>> import unicodedata
>>> unicode_str = "Å"
>>> unicodedata.normalize("NFC", unicode_str)
'Å'
>>> unicodedata.normalize("NFC", unicode_str).encode("utf-8")
b'\xc3\x85'
>>> unicodedata.normalize("NFD", unicode_str).encode("utf-8")
b'A\xcc\x8a'
>>> unicodedata.normalize("NFKD", unicode_str).encode("utf-8")
b'A\xcc\x8a'
>>> unicodedata.normalize("NFKC", unicode_str).encode("utf-8")
b'\xc3\x85'

I suspect this is something clients defer to their language/library/framework to handle.

@mbtaylor
Copy link
Copy Markdown
Member Author

mbtaylor commented May 6, 2026

Gosh, there's always more to know about Unicode, isn't there? But yes, I agree that normalization is out of scope for VOTable, I don't think it needs to go beyond explaining how to encode a sequence of code points.

Thanks for the approval.

@mbtaylor mbtaylor merged commit 69d0d22 into ivoa-std:master May 6, 2026
1 check passed
@mbtaylor mbtaylor deleted the code-points branch May 6, 2026 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants