A String in ProtoScript is an immutable finite sequence of Unicode glyphs.
Internally, strings are stored using UTF-8 encoding, but all observable semantics are defined in terms of glyphs, not bytes.
UTF-8 is considered an implementation detail.
typedef struct PSString {
char *utf8; /* UTF-8 buffer, not null-terminated */
size_t byte_len; /* length in bytes */
uint32_t *glyph_offsets; /* byte offsets of each glyph */
size_t glyph_count; /* number of glyphs */
} PSString;utf8always contains a valid UTF-8 sequenceglyph_offsets[i]points to the first byte of the i-th glyphglyph_offsetsis strictly increasingglyph_countis fixed at creation time- Strings are immutable after creation
The empty string has:
byte_len = 0glyph_count = 0glyph_offsets = NULL
When a string is created:
- The UTF-8 input is validated
- Glyph boundaries are detected
glyph_offsetsis builtglyph_countis finalized
All strings are therefore normalized and validated at creation time.
string.lengthReturns the number of Unicode glyphs contained in the string.
Formally:
string.length === glyph_count
All string indexing operations are glyph-based and zero-based.
Valid indices are:
0 <= index < string.length
string.charAt(index)Behavior:
- If
indexis out of bounds → returns the empty string"" - Otherwise → returns a new
Stringcontaining exactly one glyph
The glyph corresponds to the UTF-8 byte range:
utf8[glyph_offsets[index] .. glyph_offsets[index + 1] - 1]
string.charCodeAt(index)Behavior:
- If
indexis out of bounds → returnsNaN - Otherwise → returns the Unicode code point of the glyph at
index
The code point is obtained by decoding the corresponding UTF-8 sequence.
The + operator concatenates strings as follows:
- UTF-8 buffers are concatenated
- The resulting buffer is rescanned
- A new string is created
Source strings remain unchanged.
String comparisons are:
- Lexicographical
- Based on Unicode code point order
- Independent of UTF-8 byte length
Strings are strictly immutable.
Any operation producing a string returns a new instance.
This property is mandatory for:
- memory safety
- object sharing
- garbage collection simplicity
Conversion rules:
String→ unchangedNumber→ ASCII decimal representationBoolean→"true"/"false"null→"null"undefined→"undefined"
This specification intentionally diverges from ECMAScript 1 in the following areas:
- characters are Unicode glyphs, not UTF-16 code units
lengthcounts glyphscharCodeAtreturns full Unicode code points
These deviations are explicit, documented, and stable.
- UTF-8 internal storage
- glyph-based semantics
- O(1) glyph access
- immutable strings
- Unicode-correct behavior
This document is normative for the ProtoScript engine.
