Skip to content

feat(string): add UTF-8 string conversion and validation functions#2528

Open
bobtista wants to merge 8 commits intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions
Open

feat(string): add UTF-8 string conversion and validation functions#2528
bobtista wants to merge 8 commits intoTheSuperHackers:mainfrom
bobtista:bobtista/feat/utf8-string-functions

Conversation

@bobtista
Copy link
Copy Markdown

@bobtista bobtista commented Apr 3, 2026

Adds UTF-8 string handling to WWLib and plumbs it through the codebase, replacing the GameSpy-specific Win32 wrappers with a shared implementation.

Picks up the work proposed in #2045 by @slurmlord, with API adjustments per the review from @xezon.

New: WWLib/utf8.h / utf8.cpp

  • Utf8_Num_Bytes(char lead) — byte count of a UTF-8 character from its lead byte
  • Utf8_Trailing_Invalid_Bytes(const char* str, size_t length) — count of invalid trailing bytes due to an incomplete multi-byte sequence
  • Utf8_Validate(const char* str) / Utf8_Validate(const char* str, size_t length) — returns true if the string is valid UTF-8 per RFC 3629 (rejects overlong encodings and codepoints above U+10FFFF)
  • Utf16Le_To_Utf8_Len(const wchar_t* src, size_t srcLen) / Utf8_To_Utf16Le_Len(const char* src, size_t srcLen) — required output size, not counting null terminator
  • Utf16Le_To_Utf8(char* dest, size_t destLen, const wchar_t* src, size_t srcLen)
  • Utf8_To_Utf16Le(wchar_t* dest, size_t destLen, const char* src, size_t srcLen)

Naming follows the Snake_Case convention used in WWVegas. The conversion functions return the number of units required: if the return is <= destLen the conversion was written (with a null terminator if room remains); if > destLen the buffer was too small and the return value tells the caller how much to allocate; 0 indicates a conversion failure. Implementation is Windows-only and treats wchar_t as UTF-16LE, wrapping Win32 WideCharToMultiByte / MultiByteToWideChar.

AsciiString::translate / UnicodeString::translate

Replaces the broken implementations that only worked for 7-bit ASCII (marked @todo since the original code) with proper UTF-8 conversion using the new WWLib functions.

ThreadUtils.cpp

Replaces raw Win32 API calls in MultiByteToWideCharSingleLine and WideCharStringToMultiByte with the new WWLib functions, using std::string::resize / std::wstring::resize to avoid duplicate allocation.

@greptile-apps
Copy link
Copy Markdown

greptile-apps Bot commented Apr 3, 2026

Greptile Summary

This PR introduces a new WWLib/utf8.h / utf8.cpp module implementing RFC 3629-compliant UTF-8 validation and UTF-16LE ↔ UTF-8 conversion via Win32 APIs, then wires it through AsciiString::translate, UnicodeString::translate, and the GameSpy ThreadUtils helpers, replacing the long-standing 7-bit-ASCII-only implementations. It also contains a latent bug fix — moving the null-terminator assignment in ensureUniqueBufferOfSize outside the if (strToCopy) guard in both string classes.

Confidence Score: 5/5

Safe to merge; all prior P0/P1 validator concerns have been resolved in this iteration.

Overlong-encoding rejection, surrogate rejection, and U+10FFFF capping are all present in the updated validator. The only remaining finding is a P2 style violation in Utf8_Num_Bytes (same-line if bodies), which does not block correctness.

Core/Libraries/Source/WWVegas/WWLib/utf8.cpp — minor style issue in Utf8_Num_Bytes

Important Files Changed

Filename Overview
Core/Libraries/Source/WWVegas/WWLib/utf8.cpp New UTF-8 implementation; validator now covers overlong encodings, surrogates, and U+10FFFF cap; Utf8_Num_Bytes uses same-line if bodies (style rule violation).
Core/Libraries/Source/WWVegas/WWLib/utf8.h New header with clear API documentation; uses #pragma once, correct TheSuperHackers copyright, and accurate contract comments.
Core/GameEngine/Source/Common/System/AsciiString.cpp Replaces 7-bit ASCII loop with proper UTF-8 conversion; also fixes null-terminator placement outside the if (strToCopy) guard in ensureUniqueBufferOfSize.
Core/GameEngine/Source/Common/System/UnicodeString.cpp Mirrors AsciiString changes: proper UTF-16LE ↔ UTF-8 conversion in translate, same null-terminator guard fix.
Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Replaces raw Win32 API calls with WWLib wrappers; return values of the conversion calls are discarded (pre-existing concern from prior review, addressed per developer).
Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt Adds utf8.cpp/utf8.h to WWLIB_SRC; placement outside the WIN32 block is intentional per maintainer decision.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant AsciiString
    participant UnicodeString
    participant utf8
    participant Win32

    Caller->>AsciiString: translate(UnicodeString)
    AsciiString->>utf8: Utf16Le_To_Utf8_Len(src, srcLen)
    utf8->>Win32: WideCharToMultiByte(CP_UTF8, query)
    Win32-->>utf8: byte count
    utf8-->>AsciiString: len
    AsciiString->>AsciiString: ensureUniqueBufferOfSize(len+1)
    AsciiString->>utf8: Utf16Le_To_Utf8(buf, len+1, src, srcLen)
    utf8->>Win32: WideCharToMultiByte(CP_UTF8, convert)
    Win32-->>utf8: bytes written
    utf8-->>AsciiString: written (0 = failure)
    AsciiString-->>Caller: done

    Caller->>UnicodeString: translate(AsciiString)
    UnicodeString->>utf8: Utf8_To_Utf16Le_Len(src, srcLen)
    utf8->>Win32: MultiByteToWideChar(CP_UTF8, query)
    Win32-->>utf8: wchar count
    utf8-->>UnicodeString: len
    UnicodeString->>UnicodeString: ensureUniqueBufferOfSize(len+1)
    UnicodeString->>utf8: Utf8_To_Utf16Le(buf, len+1, src, srcLen)
    utf8->>Win32: MultiByteToWideChar(CP_UTF8, convert)
    Win32-->>utf8: wchars written
    utf8-->>UnicodeString: written (0 = failure)
    UnicodeString-->>Caller: done
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Line: 34-37

Comment:
**Same-line `if` bodies in `Utf8_Num_Bytes`**

All four branches place the return statement on the same line as the condition, which violates the team's debugger breakpoint rule — you can't break on the return without also stopping at the condition.

```suggestion
	if ((lead & 0x80) == 0x00)
		return 1;
	if ((lead & 0xE0) == 0xC0)
		return 2;
	if ((lead & 0xF0) == 0xE0)
		return 3;
	if ((lead & 0xF8) == 0xF0)
		return 4;
```

**Rule Used:** Always place if/else/for/while statement bodies on... ([source](https://app.greptile.com/review/custom-context?memory=16b9b669-b823-49be-ba5b-2bd30ff3ba6d))

**Learned From**
[TheSuperHackers/GeneralsGameCode#2067](https://github.com/TheSuperHackers/GeneralsGameCode/pull/2067#discussion_r2706274626)

How can I resolve this? If you propose a fix, please make it concise.

Reviews (13): Last reviewed commit: "style(utf8): Use >= 0 in length return t..." | Re-trigger Greptile

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 3, 2026

  • Fixed the if formatting
  • added RFC 3629 overlong and out-of-range checks
  • RE the theoretical memory leak, can that even happen here? set() allocates via the engine's custom memory allocator which crashes on failure rather than throwing, so the leak path can't really be reached right?

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.h Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/UnicodeString.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
DEBUG_LOG(("ParseAsciiStringToGameInfo - slotValue name is empty, quitting"));
break;
}
// TheSuperHackers @fix bobtista 02/04/2026 Validate UTF-8 encoding before processing player name
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be beyond the scope of this change. It is not describes in the title. Perhaps is a separate change?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok reverted

Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
@xezon xezon added Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker labels Apr 3, 2026
Comment thread Core/GameEngine/Source/GameNetwork/GameInfo.cpp Outdated
Copy link
Copy Markdown

@xezon xezon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get_Utf8_Size should not include the null terminator in its size.

Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.h Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.h Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
if (dest_size == 0)
return;
return false;
int result = MultiByteToWideChar(CP_UTF8, 0, src, -1, dest, (int)dest_size);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if dest_size does not have enough room for a null terminator?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "Does not write a null terminator" - should we add more comments? Change the functions to always null-terminate? What do you want here?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make it behave like strncpy? Writes null if there is room, otherwise not.

The only issue with the current interface then is that we will not know if it wrote the null terminator. Maybe it should return size_t instead, returning the number of characters it writes or would like to write? MultiByteToWideChar also does that.

I suggest to think this through and design the function interface in a way that it can be conveniently be used for fixed size strings (std::string, AsciiString) and large throwaway buffers (char arr[512]).

The behavior definitely needs to be documented.

Comment thread Core/GameEngine/Source/Common/System/UnicodeString.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
Comment thread Core/GameEngine/Source/Common/System/AsciiString.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
@xezon
Copy link
Copy Markdown

xezon commented Apr 6, 2026

The diff now shows unrelated changes.

@bobtista bobtista force-pushed the bobtista/feat/utf8-string-functions branch from 39d7229 to 40393b8 Compare April 6, 2026 20:02
@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 6, 2026

The diff now shows unrelated changes.

Try again cleaned up the commits and force pushed

}
ensureUniqueBufferOfSize((Int)size + 1, false, nullptr, nullptr);
char* buf = peek();
if (!Unicode_To_Utf8(buf, src, srcLen, size))
Copy link
Copy Markdown

@Mauller Mauller Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So is this translating UTF16LE from windows into UTF8 that is then stored within AsciiString?

If so this may help with the paths issue with usernames and paths not using Latin characters, but file handling functions will need updating to use unicode variants instead of Ascii.

@Mauller
Copy link
Copy Markdown

Mauller commented Apr 7, 2026

I wonder if we should also add a flag to state that the Ascii string is holding a UTF8 string?

I guess all normal ascii characters will display properly, it's just extended character sets that will look garbled.

Comment thread Core/Libraries/Source/WWVegas/WWLib/CMakeLists.txt
@OmniBlade
Copy link
Copy Markdown

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

@bobtista
Copy link
Copy Markdown
Author

bobtista commented Apr 8, 2026

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8?

@Mauller
Copy link
Copy Markdown

Mauller commented Apr 13, 2026

Slight nitpick on function naming, shouldn't it be Utf16 rather than Wchar as Wchar is only Utf16 on windows yet we will likely want the conversion functions on other platforms as well for at least csf parsing if nothing else.

Yeah, but it's consistent with the other naming as it is for now. How about we keep the naming and using a uint16_t/char16_t type internally rather than wchar_t when we make non windows paths? Or would you rather we rename to something like Utf16_To_Utf8?

If anything it would be Utf16Le_To_Utf8, windows uses the little endian utf16 format. Not sure if Utf16Be is used much anywhere but worth being concise with it.

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp Outdated
Comment thread Core/GameEngine/Source/GameNetwork/GameSpy/Thread/ThreadUtils.cpp
return required;
}
}
WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps use written to assert with, instead of another call to WideCharToMultiByte ?

Comment thread Core/Libraries/Source/WWVegas/WWLib/utf8.cpp
@@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)
return (wchars > 0) ? (size_t)wchars : 0;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do >=, so the branch predictor is 100% correct.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or max(0, wchars)

@@ -109,24 +112,9 @@ size_t Utf8_To_Utf16Le_Len(const char* src, size_t srcLen)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: const

}
WWASSERT(destLen >= Utf16Le_To_Utf8_Len(src, srcLen));
const int written = WideCharToMultiByte(CP_UTF8, 0, src, (int)srcLen, dest, (int)destLen, nullptr, nullptr);
if (written <= 0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this now a contradiction to the assert? Would this only be true if the assert was failing?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this branch is dead code when the assert holds, but WWASSERT compiles out in release, so do we keep it?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace this entire branch and substitute it with written = max(0, written)

@githubawn
Copy link
Copy Markdown

Could the unconditional replacement of translate() silently break callers passing legacy CP1252 data, causing strings that are valid CP1252 but invalid UTF-8 to corrupt or clear?

Maybe something like:
Check if the source is valid UTF-8 via Utf8_Validate.
If valid: Proceed with UTF-8 conversion.
If invalid: Fall back to the legacy 1:1 byte-to-wide cast (treating it as CP1252).

@xezon
Copy link
Copy Markdown

xezon commented Apr 23, 2026

Maybe put a breakpoint and check usage patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Enhancement Is new feature or request Minor Severity: Minor < Major < Critical < Blocker

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants