Skip to content

Investigate: JNI type name encoding — Modified UTF-8 vs standard UTF-8 across the codebase #11026

@simonrozsival

Description

@simonrozsival

Summary

The JNI specification uses Modified UTF-8 (MUTF-8), not standard UTF-8, for class names, method names, and field names. Our codebase consistently treats these names as plain ASCII — and in practice, they always are. This issue documents the full analysis for awareness and to inform future work (e.g. #10795 trimmable type maps).

Practical impact: zero. A search across dotnet/android, dotnet/java-interop, dotnet/maui, and dotnet/runtime found no bug reports related to MUTF-8 encoding. The ASCII-only assumption has held across 6+ years and millions of apps.


Background: Standard UTF-8 vs Modified UTF-8

The JNI spec describes two differences from standard UTF-8:

Situation Standard UTF-8 Modified UTF-8
NUL character 0x00 (1 byte) 0xC0 0x80 (2 bytes)
Supplementary (non-BMP) characters (U+10000+) 0xF0... (4 bytes) Two 3-byte sequences 0xED 0xA... + 0xED 0xB... (surrogate pair, 6 bytes)

For class names the NUL case is irrelevant. The surrogate pair / non-BMP case is the only theoretical risk: if a class name contained emoji or a CJK Extension B+ character, the bytes in MUTF-8 would be a 6-byte surrogate pair, which Encoding.UTF8 would decode as two replacement characters (U+FFFD).

JNI encoding by API — with citations

JNI API Encoding Source
NewString / GetStringChars / GetStringRegion UTF-16 (jchar*) JNI spec - String Operations
NewStringUTF / GetStringUTFChars / GetStringUTFRegion Modified UTF-8 (char*) JNI spec - String Operations
FindClass name arg Modified UTF-8 JNI spec - FindClass
GetMethodID / GetFieldID name+sig args Modified UTF-8 JNI spec - GetMethodID
RegisterNatives name+sig fields Modified UTF-8 JNI spec - RegisterNatives
Java String heap representation UTF-16 (compact 8-bit for ASCII since Android 8) Android JNI Tips

Key quote from Android JNI Tips:

"The Java programming language uses UTF-16. For convenience, JNI also provides methods that work with Modified UTF-8... Data passed to NewStringUTF must be in Modified UTF-8 format. ... CheckJNI — enabled by default for emulators — scans strings and aborts the VM if it receives invalid input."

What's safe: normal string marshalling

JniEnvironment.Strings in dotnet/java-interop uses UTF-16 JNI APIs exclusively (NewString/GetStringChars) for all normal Java-to-C# string marshalling — method arguments, return values, field reads/writes. This is completely immune to MUTF-8 issues.

Encoding inconsistencies in the codebase

These are not bugs in practice (all real-world class names are ASCII), but are worth documenting for awareness.

1. TypeManager.GetClassName — MUTF-8 decoded as Latin-1

The native function get_java_class_name_for_TypeManager calls GetStringUTFChars (returns MUTF-8), strdups the result, replaces . with /, and returns a char*. The native code is aware this is MUTF-8 — the local variable is even named mutf8:

const char *mutf8 = env->GetStringUTFChars(name, nullptr);
char *ret = strdup(mutf8);
// ... replace '.' with '/' ...
return ret;

The managed caller decodes the returned bytes with Marshal.PtrToStringAnsi, which interprets them as Latin-1 (ISO-8859-1):

IntPtr ptr = RuntimeNativeMethods.monodroid_TypeManager_get_java_class_name(class_ptr);
return Marshal.PtrToStringAnsi(ptr);

For ASCII, Latin-1/UTF-8/MUTF-8 are byte-identical, so this works. For non-ASCII it would produce mojibake, but this path is only used for fallback type lookup/error logging.

History: the original 2016 implementation (initial import) used the UTF-16 JNI path and handled all Unicode correctly:

return JNIEnv.GetString(
    JNIEnv.CallObjectMethod(class_ptr, JNIEnv.mid_Class_getName),
    JniHandleOwnership.TransferLocalRef).Replace(".", "/");

This was replaced in PR #3729 (Oct 2019, "JNIEnv.Initialize optimization") to save ~30ms on startup by moving the work to native code. The PtrToStringAnsi was the natural P/Invoke idiom for decoding a returned char* — encoding was not discussed in the PR.

2. FindClass(string) in java-interop — standard UTF-8 sent to a MUTF-8 API

JniEnvironment.Types.TryRawFindClass uses Marshal.StringToCoTaskMemUTF8 to encode the class name before passing it to FindClass. This produces standard UTF-8, which differs from MUTF-8 only for non-BMP characters.

The ReadOnlySpan<byte> overload (FindClass(ReadOnlySpan<byte>) using u8 literals) bypasses this entirely and is the preferred path.

3. ConstantPool.cs — already correct

Xamarin.Android.Tools.Bytecode/ConstantPool.cs in dotnet/java-interop already implements a correct MUTF-8 fixup pass before calling Encoding.UTF8.GetString, handling both 0xC0 0x80 NUL and surrogate-pair supplementary characters. This is the reference implementation if a proper MUTF-8 decoder is ever needed elsewhere.

Risk summary

Path Risk Notes
Normal string marshalling (JniEnvironment.Strings) None Uses UTF-16 JNI APIs
Typemap keys from [Register("...")] attributes None Compile-time ASCII C# string literals
FindClass(string) via Marshal.StringToCoTaskMemUTF8 Theoretical Differs from MUTF-8 only for non-BMP class names
TypeManager.GetClassName via PtrToStringAnsi Theoretical Latin-1 decode of MUTF-8; fallback/error path only
ConstantPool.cs bytecode parser None Already implements correct MUTF-8 fixup

Real-world precedent: Android 12 MUTF-8 enforcement

Android 12 (API 31) added strict MUTF-8 validation to NewStringUTF. Invalid input causes a hard SIGABRT:

JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8

This was triggered in the wild by facebook/react-native#34363 / facebook/flipper#3175, where an app name with diacritics (Romanian characters) was passed to NewStringUTF after incorrect percent-encoding produced invalid MUTF-8 byte sequences. 53+ GitHub issues across different projects match this error pattern.

This is not directly applicable to dotnet/android (we don't call NewStringUTF with user-provided strings), but illustrates that MUTF-8 issues can be latent for years and surface only when Android tightens enforcement.

Scenarios that could theoretically trigger issues

  1. Non-ASCII BMP class names (e.g. CJK com/example/MyClass, accented Latin, Cyrillic) — work fine today. Verified against a real JVM (OpenJDK 21): MUTF-8 and standard UTF-8 encode all BMP characters (U+0000–U+FFFF) identically. Encoding.UTF8 decodes them correctly. This covers all living languages, all ~27,000 common CJK ideographs, and all Latin/Cyrillic/Greek/Arabic scripts.
  2. Non-BMP class names (U+10000+: emoji, rare CJK extensions, historic scripts) — would break. Verified against a real JVM: GetStringUTFChars returns 6-byte MUTF-8 surrogate pairs (e.g. ED-A0-BD-ED-B8-80 for 😀), which Encoding.UTF8 decodes as 6 replacement characters (�). Essentially non-existent in real class names.
  3. ProGuard/R8 with Unicode obfuscation dictionaries — advanced obfuscators like dProtect can rename classes to arbitrary Unicode strings. If a bound AAR uses such obfuscation, it could produce non-ASCII JNI names. BMP obfuscation would work fine; non-BMP would break.

Conclusion

The ASCII-only assumption is deeply embedded and has been validated by years of production use with zero bug reports. Future work touching type name lookup paths (e.g. #10795) should simply maintain this same assumption and document it. No fix is needed at this time.

Open questions and follow-up

Connection to the trimmable type map (#10795)

The trimmable type map (NativeHashtable) stores JNI class names as UTF-16 characters in a native blob. At runtime, the type map lookup API accepts a string key.

Important: the trimmable type map path goes through java-interop's JniRuntime.JniTypeManager, which resolves class names via GetJniTypeNameFromClass. This calls Class.getName() and decodes the result using GetStringChars (UTF-16, not MUTF-8) into new string(char*, 0, len). So MUTF-8 is not involved in the trimmable type map lookup path at all — the class name arrives as a proper .NET string via the UTF-16 JNI API.

The MUTF-8 / GetStringUTFChars path only exists in the legacy TypeManager.GetClassName native helper (see the encoding inconsistencies section above).

The current flow for the trimmable type map is:

jclass -> Class.getName() via JNI
       -> GetStringChars (UTF-16 jchar*)
       -> new string(char*, 0, len)     // heap allocation
       -> .Replace('.', '/')
       -> GetTypesForSimpleReference(string)
       -> NativeHashtable lookup

The idea in #10795 is that the lookup table could also accept ReadOnlySpan<char> instead of just string. Since the class name is already available as UTF-16 chars from GetStringChars, we could copy them into a stackalloc buffer (with the . -> / replacement) instead of creating a heap-allocated string. For ASCII inputs (which is all real-world cases), this is a trivial and fast operation.

Even more aggressively, since GetStringChars returns a direct pointer to the JVM's internal character data, it may be possible to perform the lookup directly against that pointer as a ReadOnlySpan<char> — though the . to / replacement and JNI critical section constraints would need to be considered.

The TypeManager.GetClassName history provides additional confidence in the ASCII-only assumption: it has used PtrToStringAnsi (Latin-1, equivalent to ASCII widening) since 2019 with zero issues.

The benchmark data posted on #10795 shows a span-based lookup path is ~30% faster with zero heap allocation compared to the string-allocating path.

Strategy Key type Source Allocation Notes
Current string GetStringChars -> new string(char*) 56-112 B/lookup Heap-allocated string
Span from JNI ReadOnlySpan<char> GetStringChars -> stackalloc copy (with .->/ fixup) 0 B ~30% faster; requires TryGetValue(ROS<char>) on the hashtable

Staying on UTF-16 end-to-end

This approach has a significant advantage beyond performance: it sidesteps the MUTF-8 question entirely. Since GetStringChars returns UTF-16 and the native blob stores UTF-16, the entire lookup stays in UTF-16 from start to finish. No encoding conversion, no ASCII assumption needed for correctness, no fallback path for non-ASCII names. It's correct for all Unicode inputs by construction.

The only transformation needed is the . to / replacement (package separator to JNI separator; the $ for nested classes is already present in Class.getName() output and left untouched). This can be done during the stackalloc copy in a single pass — trivially vectorizable (compare against '.', blend with '/').

This makes the MUTF-8 analysis in this issue nicely self-contained: the MUTF-8 encoding concern is real but only affects the legacy TypeManager.GetClassName native helper path. The trimmable type map can avoid it entirely by staying on UTF-16.

Verified against a real JVM

All of the above has been tested against a desktop OpenJDK 21 JVM using java-interop's JreRuntime. Key results:

  • GetStringChars (UTF-16) round-trips all characters correctly: ASCII, CJK, emoji
  • GetStringUTFChars (MUTF-8) returns 6-byte surrogate pairs for non-BMP characters (e.g. [ED-A0-BD-ED-B8-80] for U+1F600), which Encoding.UTF8.GetString() decodes as ������
  • The zero-allocation lookup (GetStringChars → stackalloc copy with ./ReadOnlySpan<char> lookup) works end-to-end with the JVM and produces 0 bytes of managed allocation across 1000 lookups

Test code is in the Utf16LookupTest experiment project.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    copilot`copilot-cli` or other AIs were used to author thistrimmable-type-map

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions