Investigate: JNI type name encoding — Modified UTF-8 vs standard UTF-8 across the codebase

## Summary

The JNI specification uses **Modified UTF-8** (MUTF-8), not standard UTF-8, for class names, method names, and field names. Our codebase consistently treats these names as plain ASCII — and in practice, they always are. This issue documents the full analysis for awareness and to inform future work (e.g. #10795 trimmable type maps).

**Practical impact: zero.** A search across dotnet/android, dotnet/java-interop, dotnet/maui, and dotnet/runtime found no bug reports related to MUTF-8 encoding. The ASCII-only assumption has held across 6+ years and millions of apps.

---

## Background: Standard UTF-8 vs Modified UTF-8

The [JNI spec](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html) describes two differences from standard UTF-8:

| Situation | Standard UTF-8 | Modified UTF-8 |
|---|---|---|
| NUL character | `0x00` (1 byte) | `0xC0 0x80` (2 bytes) |
| Supplementary (non-BMP) characters (U+10000+) | `0xF0...` (4 bytes) | Two 3-byte sequences `0xED 0xA...` + `0xED 0xB...` (surrogate pair, 6 bytes) |

For class names the NUL case is irrelevant. The surrogate pair / non-BMP case is the only theoretical risk: if a class name contained emoji or a CJK Extension B+ character, the bytes in MUTF-8 would be a 6-byte surrogate pair, which `Encoding.UTF8` would decode as two replacement characters (U+FFFD).

## JNI encoding by API — with citations

| JNI API | Encoding | Source |
|---|---|---|
| `NewString` / `GetStringChars` / `GetStringRegion` | **UTF-16** (`jchar*`) | [JNI spec - String Operations](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#NewString) |
| `NewStringUTF` / `GetStringUTFChars` / `GetStringUTFRegion` | **Modified UTF-8** (`char*`) | [JNI spec - String Operations](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#NewStringUTF) |
| `FindClass` name arg | **Modified UTF-8** | [JNI spec - FindClass](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#FindClass) |
| `GetMethodID` / `GetFieldID` name+sig args | **Modified UTF-8** | [JNI spec - GetMethodID](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#GetMethodID) |
| `RegisterNatives` name+sig fields | **Modified UTF-8** | [JNI spec - RegisterNatives](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/functions.html#RegisterNatives) |
| Java `String` heap representation | **UTF-16** (compact 8-bit for ASCII since Android 8) | [Android JNI Tips](https://developer.android.com/training/articles/perf-jni#utf-8-and-utf-16-strings) |

Key quote from [Android JNI Tips](https://developer.android.com/training/articles/perf-jni#utf-8-and-utf-16-strings):
> *"The Java programming language uses UTF-16. For convenience, JNI also provides methods that work with Modified UTF-8... **Data passed to `NewStringUTF` must be in Modified UTF-8 format.** ... CheckJNI — enabled by default for emulators — scans strings and aborts the VM if it receives invalid input."*

## What's safe: normal string marshalling

`JniEnvironment.Strings` in dotnet/java-interop uses **UTF-16 JNI APIs exclusively** (`NewString`/`GetStringChars`) for all normal Java-to-C# string marshalling — method arguments, return values, field reads/writes. This is completely immune to MUTF-8 issues.

## Encoding inconsistencies in the codebase

These are not bugs in practice (all real-world class names are ASCII), but are worth documenting for awareness.

### 1. `TypeManager.GetClassName` — MUTF-8 decoded as Latin-1

The native function [`get_java_class_name_for_TypeManager`](https://github.com/dotnet/android/blob/6cc34f56b64582cda81b9b0115a94b36b0ec5bf1/src/native/clr/host/host-shared.cc) calls `GetStringUTFChars` (returns MUTF-8), `strdup`s the result, replaces `.` with `/`, and returns a `char*`. The native code is aware this is MUTF-8 — the local variable is even named `mutf8`:

```cpp
const char *mutf8 = env->GetStringUTFChars(name, nullptr);
char *ret = strdup(mutf8);
// ... replace '.' with '/' ...
return ret;
```

The managed caller decodes the returned bytes with `Marshal.PtrToStringAnsi`, which interprets them as Latin-1 (ISO-8859-1):

```csharp
IntPtr ptr = RuntimeNativeMethods.monodroid_TypeManager_get_java_class_name(class_ptr);
return Marshal.PtrToStringAnsi(ptr);
```

For ASCII, Latin-1/UTF-8/MUTF-8 are byte-identical, so this works. For non-ASCII it would produce mojibake, but this path is only used for fallback type lookup/error logging.

**History:** the original 2016 implementation ([initial import](https://github.com/dotnet/android/commit/5777337e)) used the UTF-16 JNI path and handled all Unicode correctly:

```csharp
return JNIEnv.GetString(
    JNIEnv.CallObjectMethod(class_ptr, JNIEnv.mid_Class_getName),
    JniHandleOwnership.TransferLocalRef).Replace(".", "/");
```

This was replaced in [PR #3729](https://github.com/dotnet/android/pull/3729) (Oct 2019, "JNIEnv.Initialize optimization") to save ~30ms on startup by moving the work to native code. The `PtrToStringAnsi` was the natural P/Invoke idiom for decoding a returned `char*` — encoding was not discussed in the PR.

### 2. `FindClass(string)` in java-interop — standard UTF-8 sent to a MUTF-8 API

`JniEnvironment.Types.TryRawFindClass` uses `Marshal.StringToCoTaskMemUTF8` to encode the class name before passing it to `FindClass`. This produces standard UTF-8, which differs from MUTF-8 only for non-BMP characters.

The `ReadOnlySpan<byte>` overload (`FindClass(ReadOnlySpan<byte>)` using `u8` literals) bypasses this entirely and is the preferred path.

### 3. `ConstantPool.cs` — already correct

`Xamarin.Android.Tools.Bytecode/ConstantPool.cs` in dotnet/java-interop already implements a correct MUTF-8 fixup pass before calling `Encoding.UTF8.GetString`, handling both `0xC0 0x80` NUL and surrogate-pair supplementary characters. This is the reference implementation if a proper MUTF-8 decoder is ever needed elsewhere.

## Risk summary

| Path | Risk | Notes |
|---|---|---|
| Normal string marshalling (`JniEnvironment.Strings`) | None | Uses UTF-16 JNI APIs |
| Typemap keys from `[Register("...")]` attributes | None | Compile-time ASCII C# string literals |
| `FindClass(string)` via `Marshal.StringToCoTaskMemUTF8` | Theoretical | Differs from MUTF-8 only for non-BMP class names |
| `TypeManager.GetClassName` via `PtrToStringAnsi` | Theoretical | Latin-1 decode of MUTF-8; fallback/error path only |
| `ConstantPool.cs` bytecode parser | None | Already implements correct MUTF-8 fixup |

## Real-world precedent: Android 12 MUTF-8 enforcement

Android 12 (API 31) added strict MUTF-8 validation to `NewStringUTF`. Invalid input causes a hard `SIGABRT`:
> `JNI DETECTED ERROR IN APPLICATION: input is not valid Modified UTF-8`

This was triggered in the wild by [facebook/react-native#34363](https://github.com/facebook/react-native/issues/34363) / [facebook/flipper#3175](https://github.com/facebook/flipper/issues/3175), where an app name with diacritics (Romanian characters) was passed to `NewStringUTF` after incorrect percent-encoding produced invalid MUTF-8 byte sequences. 53+ GitHub issues across different projects match this error pattern.

This is not directly applicable to dotnet/android (we don't call `NewStringUTF` with user-provided strings), but illustrates that MUTF-8 issues can be latent for years and surface only when Android tightens enforcement.

## Scenarios that could theoretically trigger issues

1. **Non-ASCII BMP class names** (e.g. CJK `com/example/MyClass`, accented Latin, Cyrillic) — **work fine today**. Verified against a real JVM (OpenJDK 21): MUTF-8 and standard UTF-8 encode all BMP characters (U+0000–U+FFFF) identically. `Encoding.UTF8` decodes them correctly. This covers all living languages, all ~27,000 common CJK ideographs, and all Latin/Cyrillic/Greek/Arabic scripts.
2. **Non-BMP class names** (U+10000+: emoji, rare CJK extensions, historic scripts) — **would break**. Verified against a real JVM: `GetStringUTFChars` returns 6-byte MUTF-8 surrogate pairs (e.g. `ED-A0-BD-ED-B8-80` for 😀), which `Encoding.UTF8` decodes as 6 replacement characters (�). Essentially non-existent in real class names.
3. **ProGuard/R8 with Unicode obfuscation dictionaries** — advanced obfuscators like dProtect can rename classes to arbitrary Unicode strings. If a bound AAR uses such obfuscation, it could produce non-ASCII JNI names. BMP obfuscation would work fine; non-BMP would break.

## Conclusion

The ASCII-only assumption is deeply embedded and has been validated by years of production use with zero bug reports. Future work touching type name lookup paths (e.g. #10795) should simply maintain this same assumption and document it. No fix is needed at this time.

## Open questions and follow-up

### Connection to the trimmable type map (#10795)

The trimmable type map (`NativeHashtable`) stores JNI class names as **UTF-16 characters** in a native blob. At runtime, the type map lookup API accepts a `string` key.

**Important:** the trimmable type map path goes through java-interop's `JniRuntime.JniTypeManager`, which resolves class names via `GetJniTypeNameFromClass`. This calls `Class.getName()` and decodes the result using `GetStringChars` (**UTF-16**, not MUTF-8) into `new string(char*, 0, len)`. So MUTF-8 is **not involved** in the trimmable type map lookup path at all — the class name arrives as a proper .NET `string` via the UTF-16 JNI API.

The MUTF-8 / `GetStringUTFChars` path only exists in the legacy `TypeManager.GetClassName` native helper (see the encoding inconsistencies section above).

The current flow for the trimmable type map is:

```
jclass -> Class.getName() via JNI
       -> GetStringChars (UTF-16 jchar*)
       -> new string(char*, 0, len)     // heap allocation
       -> .Replace('.', '/')
       -> GetTypesForSimpleReference(string)
       -> NativeHashtable lookup
```

The idea in #10795 is that the lookup table could also accept `ReadOnlySpan<char>` instead of just `string`. Since the class name is already available as UTF-16 chars from `GetStringChars`, we could copy them into a stackalloc buffer (with the `.` -> `/` replacement) instead of creating a heap-allocated `string`. For ASCII inputs (which is all real-world cases), this is a trivial and fast operation.

Even more aggressively, since `GetStringChars` returns a direct pointer to the JVM's internal character data, it may be possible to perform the lookup directly against that pointer as a `ReadOnlySpan<char>` — though the `.` to `/` replacement and JNI critical section constraints would need to be considered.

The `TypeManager.GetClassName` history provides additional confidence in the ASCII-only assumption: it has used `PtrToStringAnsi` (Latin-1, equivalent to ASCII widening) since 2019 with zero issues.

The [benchmark data posted on #10795](https://github.com/dotnet/android/issues/10795#issuecomment-4134229084) shows a span-based lookup path is **~30% faster with zero heap allocation** compared to the string-allocating path.

| Strategy | Key type | Source | Allocation | Notes |
|---|---|---|---|---|
| Current | `string` | `GetStringChars` -> `new string(char*)` | 56-112 B/lookup | Heap-allocated string |
| Span from JNI | `ReadOnlySpan<char>` | `GetStringChars` -> stackalloc copy (with `.`->`/` fixup) | 0 B | ~30% faster; requires `TryGetValue(ROS<char>)` on the hashtable |

### Staying on UTF-16 end-to-end

This approach has a significant advantage beyond performance: it **sidesteps the MUTF-8 question entirely**. Since `GetStringChars` returns UTF-16 and the native blob stores UTF-16, the entire lookup stays in UTF-16 from start to finish. No encoding conversion, no ASCII assumption needed for correctness, no fallback path for non-ASCII names. It's correct for all Unicode inputs by construction.

The only transformation needed is the `.` to `/` replacement (package separator to JNI separator; the `$` for nested classes is already present in `Class.getName()` output and left untouched). This can be done during the stackalloc copy in a single pass — trivially vectorizable (compare against `'.'`, blend with `'/'`).

This makes the MUTF-8 analysis in this issue nicely self-contained: the MUTF-8 encoding concern is real but only affects the legacy `TypeManager.GetClassName` native helper path. The trimmable type map can avoid it entirely by staying on UTF-16.

### Verified against a real JVM

All of the above has been tested against a desktop OpenJDK 21 JVM using java-interop's `JreRuntime`. Key results:

- `GetStringChars` (UTF-16) round-trips all characters correctly: ASCII, CJK, emoji
- `GetStringUTFChars` (MUTF-8) returns 6-byte surrogate pairs for non-BMP characters (e.g. `[ED-A0-BD-ED-B8-80]` for U+1F600), which `Encoding.UTF8.GetString()` decodes as `������`
- The zero-allocation lookup (`GetStringChars` → stackalloc copy with `.`→`/` → `ReadOnlySpan<char>` lookup) works end-to-end with the JVM and produces 0 bytes of managed allocation across 1000 lookups

Test code is in the [`Utf16LookupTest`](https://github.com/simonrozsival/experiment-utf8-to-utf16) experiment project.

## Related

- #10795 — JNI type name lookup performance (trimmable type maps)
- [`ConstantPool.cs` in dotnet/java-interop](https://github.com/dotnet/java-interop/blob/main/src/Xamarin.Android.Tools.Bytecode/ConstantPool.cs) — correct MUTF-8 fixup implementation
- [JNI spec: Modified UTF-8](https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/types.html#wp16542)
- [Android JNI Tips: UTF-8 and UTF-16 Strings](https://developer.android.com/training/articles/perf-jni#utf-8-and-utf-16-strings)

JNI API	Encoding	Source
`NewString` / `GetStringChars` / `GetStringRegion`	UTF-16 (`jchar*`)	JNI spec - String Operations
`NewStringUTF` / `GetStringUTFChars` / `GetStringUTFRegion`	Modified UTF-8 (`char*`)	JNI spec - String Operations
`FindClass` name arg	Modified UTF-8	JNI spec - FindClass
`GetMethodID` / `GetFieldID` name+sig args	Modified UTF-8	JNI spec - GetMethodID
`RegisterNatives` name+sig fields	Modified UTF-8	JNI spec - RegisterNatives
Java `String` heap representation	UTF-16 (compact 8-bit for ASCII since Android 8)	Android JNI Tips

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate: JNI type name encoding — Modified UTF-8 vs standard UTF-8 across the codebase #11026

Summary

Background: Standard UTF-8 vs Modified UTF-8

JNI encoding by API — with citations

What's safe: normal string marshalling

Encoding inconsistencies in the codebase

1. `TypeManager.GetClassName` — MUTF-8 decoded as Latin-1

2. `FindClass(string)` in java-interop — standard UTF-8 sent to a MUTF-8 API

3. `ConstantPool.cs` — already correct

Risk summary

Real-world precedent: Android 12 MUTF-8 enforcement

Scenarios that could theoretically trigger issues

Conclusion

Open questions and follow-up

Connection to the trimmable type map (#10795)

Staying on UTF-16 end-to-end

Verified against a real JVM

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Situation	Standard UTF-8	Modified UTF-8
NUL character	`0x00` (1 byte)	`0xC0 0x80` (2 bytes)
Supplementary (non-BMP) characters (U+10000+)	`0xF0...` (4 bytes)	Two 3-byte sequences `0xED 0xA...` + `0xED 0xB...` (surrogate pair, 6 bytes)

Path	Risk	Notes
Normal string marshalling (`JniEnvironment.Strings`)	None	Uses UTF-16 JNI APIs
Typemap keys from `[Register("...")]` attributes	None	Compile-time ASCII C# string literals
`FindClass(string)` via `Marshal.StringToCoTaskMemUTF8`	Theoretical	Differs from MUTF-8 only for non-BMP class names
`TypeManager.GetClassName` via `PtrToStringAnsi`	Theoretical	Latin-1 decode of MUTF-8; fallback/error path only
`ConstantPool.cs` bytecode parser	None	Already implements correct MUTF-8 fixup

Strategy	Key type	Source	Allocation	Notes
Current	`string`	`GetStringChars` -> `new string(char*)`	56-112 B/lookup	Heap-allocated string
Span from JNI	`ReadOnlySpan<char>`	`GetStringChars` -> stackalloc copy (with `.`->`/` fixup)	0 B	~30% faster; requires `TryGetValue(ROS<char>)` on the hashtable

Investigate: JNI type name encoding — Modified UTF-8 vs standard UTF-8 across the codebase #11026

Description

Summary

Background: Standard UTF-8 vs Modified UTF-8

JNI encoding by API — with citations

What's safe: normal string marshalling

Encoding inconsistencies in the codebase

1. TypeManager.GetClassName — MUTF-8 decoded as Latin-1

2. FindClass(string) in java-interop — standard UTF-8 sent to a MUTF-8 API

3. ConstantPool.cs — already correct

Risk summary

Real-world precedent: Android 12 MUTF-8 enforcement

Scenarios that could theoretically trigger issues

Conclusion

Open questions and follow-up

Connection to the trimmable type map (#10795)

Staying on UTF-16 end-to-end

Verified against a real JVM

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `TypeManager.GetClassName` — MUTF-8 decoded as Latin-1

2. `FindClass(string)` in java-interop — standard UTF-8 sent to a MUTF-8 API

3. `ConstantPool.cs` — already correct