Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
136 changes: 136 additions & 0 deletions dev/design/utf8_flag_parity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# UTF-8 Flag Parity: Byte-String Preservation

## Problem

PerlOnJava has a systemic issue where operations that should produce byte strings
(SvUTF8=0 in Perl) instead produce UTF-8-flagged strings (STRING type). This
causes data corruption when binary data (JPEG, TIFF, PNG, GIF) is round-tripped
through ExifTool's write path, because `Encode::is_utf8()` returns true and
`Encode::encode('utf8', $data)` re-encodes bytes >127 as multi-byte sequences.

### PerlOnJava Type Model

| PerlOnJava type | Perl equivalent | UTF-8 flag |
|-----------------|-----------------|------------|
| `BYTE_STRING` | SvUTF8=0 | off |
| `STRING` | SvUTF8=1 | on |
| `INTEGER` | IV | N/A |
| `DOUBLE` | NV | N/A |
| `UNDEF` | undef | N/A |

### Perl Rule

An operation produces a UTF-8-flagged string **only** when at least one input
has the UTF-8 flag on. Types without a flag (integers, floats, undef, byte
strings) never upgrade the result to UTF-8.

## Completed Fixes

### 1. `join` — StringOperators.joinInternal()

**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java`

- 1-element fast path: preserve source type instead of always creating STRING
- 2+ element path: track `hasUtf8` — only set if an element is STRING type
- Non-STRING types (INTEGER, DOUBLE, UNDEF, BYTE_STRING) are byte-compatible

### 2. String concatenation — StringOperators.stringConcat()

**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java`

- Only return STRING when at least one operand is STRING type
- When both operands are non-STRING, produce BYTE_STRING (with Latin-1 safety check)
- Previously, if neither was BYTE_STRING (e.g. INTEGER + BYTE_STRING), it fell
through to the default STRING return

### 3. `sprintf` — SprintfOperator.sprintfInternal()

**File:** `src/main/java/org/perlonjava/runtime/operators/SprintfOperator.java`

- Track `hasUtf8Input` by checking format string and all argument types
- Return BYTE_STRING when no input has STRING type

## Remaining Work

### 4. `unpack` — Format handlers return STRING for string results

**Status:** Not yet fixed — needs per-handler analysis

**Problem:** `unpack` format handlers (`HexStringFormatHandler`, `StringFormatHandler`,
`NumericFormatHandler`, `BitStringFormatHandler`, etc.) create results with
`new RuntimeScalar(someString)` which defaults to STRING type. In Perl, `unpack`
returns byte strings for all formats except `U` with wide characters.

**Impact:** ExifTool's `ImageInfo` path uses `unpack("n", ...)`, `unpack("H*", ...)`,
etc. to extract tag values. These values carry the UTF-8 flag, and when
round-tripped through SetNewValue + WriteInfo, the flag propagates to the
output buffer.

**Approach — per-handler fixes:**

Each handler that produces string results via `new RuntimeScalar(String)` needs
to set `type = BYTE_STRING` on the result. This is safe because:

- Numeric formats (n, N, v, V, s, S, i, I, l, L, q, Q): return integers, already OK
- String formats (a, A, Z): should return BYTE_STRING
- Hex/bit formats (H, h, B, b): produce ASCII hex/bit strings, should be BYTE_STRING
- `U` format: returns code points — may legitimately need STRING for chars > 0xFF
- `C` format: returns byte values as integers, already OK

**Files to audit:**
- `src/main/java/org/perlonjava/runtime/operators/unpack/HexStringFormatHandler.java`
- `src/main/java/org/perlonjava/runtime/operators/unpack/StringFormatHandler.java`
- `src/main/java/org/perlonjava/runtime/operators/unpack/BitStringFormatHandler.java`
- `src/main/java/org/perlonjava/runtime/operators/unpack/PointerFormatHandler.java`

**NOT a blanket post-process:** A post-processing step on all unpack results
was considered but rejected as dangerous — it could break `unpack("U", $wide_char)`
which legitimately produces UTF-8 strings.

### 5. Other potential sources

These operations may also need auditing for byte-string preservation:

| Operation | Risk | Notes |
|-----------|------|-------|
| `chr()` | Low | Likely OK — returns BYTE_STRING for 0-255 |
| `substr()` | Medium | Result should inherit source type |
| `lc/uc/ucfirst/lcfirst` | Medium | Should inherit source type |
| `reverse()` | Low | Should inherit source type |
| Hash/array stringification | Low | Produces addresses, should be byte |

### 6. GPX/Geotag parsing (separate issue)

**Status:** Not yet investigated

ExifTool's Geotag.pm reads GPX files using `$/ = '>'` and regex with `\3`
backreference. 5 test failures in Geotag.t and Geolocation.t. This is a
separate issue from UTF-8 flag handling — likely I/O or regex related.

## Verification

```bash
# Quick sanity check
./jperl -e 'use Encode; print Encode::is_utf8(join("", "\xff")) ? "BAD" : "OK", "\n"'
./jperl -e 'use Encode; print Encode::is_utf8(sprintf "%d", 42) ? "BAD" : "OK", "\n"'
./jperl -e 'use Encode; print Encode::is_utf8("" . 42) ? "BAD" : "OK", "\n"'
./jperl -e 'use Encode; print Encode::is_utf8(unpack("H*", "AB")) ? "BAD" : "OK", "\n"'

# ExifTool test suite
cd /path/to/Image-ExifTool-13.55-0
../jperl -Ilib t/IPTC.t # Test 4 should pass after unpack fix
../jperl -Ilib t/Writer.t # Multiple write tests
../jperl -Ilib t/GIF.t # GIF header byte integrity
```

## Progress Tracking

### Current Status: Fixes 1-3 completed, Fix 4 pending

| Fix | Status | Impact |
|-----|--------|--------|
| join byte-string | Done | High — ExifTool Write path |
| stringConcat byte-string | Done | High — all concat ops |
| sprintf byte-string | Done | Medium — tag value formatting |
| unpack per-handler | Pending | High — ExifTool ImageInfo path |
| GPX/Geotag parsing | Pending | 5 test failures |
166 changes: 166 additions & 0 deletions dev/modules/exiftool_parity.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
# Image::ExifTool Parity Fixes for PerlOnJava

## Overview

Image::ExifTool 13.55 has 113 test programs. When run under PerlOnJava,
98 pass cleanly, 4 are false-timeout (pass when run individually), and
11 have real failures totalling ~24 broken subtests across 6 root-cause
categories.

**Branch:** `fix/http-tiny-redirect-mirror` (started with HTTP::Tiny fix)
**Module version:** Image::ExifTool 13.55 (113 test programs)

### Results History

| Date | Programs OK | Subtests Failed | Key Fix |
|------|-------------|-----------------|---------|
| Baseline (pre-fix) | 0/113 | N/A | HTTP::Tiny 301 redirect + binary mirror — `jcpan` couldn't download |
| After HTTP fix | 98/113 | ~24 | Redirect following + binary-safe mirror |

### Current Failure Summary

| Test File | Fail/Total | Category |
|-----------|-----------|----------|
| GIF.t | 3/5 | Binary write (GIF header byte corruption) |
| IPTC.t | 1/8 | `join` UTF-8 flag on byte strings |
| CanonRaw.t | 1/9 | Binary write (maker notes offsets) |
| FujiFilm.t | 1/6 | Binary write (TIFF header corruption) |
| Geolocation.t | 1/8 | GPX/XML parsing (`$/` or regex) |
| Geotag.t | 4/12 | GPX/XML parsing (`$/` or regex) |
| MIE.t | 1/6 | Binary write (thumbnail offset +26 bytes) |
| Nikon.t | 1/9 | Binary write (IFD format corruption) |
| PNG.t | 1/7 | `join` UTF-8 flag corrupts written file |
| Writer.t | 8/61 | Binary write (multiple patterns) |
| XMP.t | 2/54 | Binary write + UTF-8 encoding |

False-timeout (pass when run individually): CanonVRD, FotoStation, Olympus, Pentax

---

## Root Cause Analysis

### RC1: `join` corrupts byte-string flag (~17 failures)

**Impact:** Categories 1-4, 6a, 6b — the dominant root cause

`StringOperators.joinInternal()` has three bugs:

1. **1-element fast path** (line 650): `new RuntimeScalar(scalar.toString())` always
creates type `STRING`, discarding `BYTE_STRING`. When ExifTool does
`$$outfile .= join('', @data)` with binary data, the UTF-8 flag
propagates to the output buffer. On re-read, `Encode::is_utf8()` is true
and `Encode::encode('utf8', $$arg)` re-encodes bytes >127 as multi-byte
sequences, destroying JPEG/TIFF/GIF/PNG signatures.

2. **0-element fast path** (line 640): `new RuntimeScalar("")` creates STRING;
should create BYTE_STRING when separator is byte-string.

3. **2+ element path** (line 679): `isByteString` requires ALL elements to be
`BYTE_STRING`. In Perl, `join` on a mix of integers and byte-strings
should produce a byte-string (integers have no UTF-8 flag). The check
should treat non-STRING types (INTEGER, DOUBLE, UNDEF) as byte-compatible.

**Reproducer:**
```bash
./jperl -e 'use Encode; my $b = "\xff\xd8"; print Encode::is_utf8(join("", $b)), "\n"'
# PerlOnJava: 1 (wrong)
# Perl: (empty, i.e. false)
```

### RC2: GPX/XML parsing failures (5 failures)

**Impact:** Geotag.t (4 failures), Geolocation.t (1 failure)

ExifTool's `Geotag.pm` reads GPX files using `$/ = '>'` as the input
record separator, then parses each chunk with regex. The "No track points
found" error means the parser can't match `<trkpt>` elements.

Possible sub-causes:
- `File::RandomAccess::ReadLine()` not honoring custom `$/`
- Regex backreference `\3` in attribute parser not working
- Floating-point interpolation issue (Geotag test 9 — wrong coordinates)

### RC3: UTF-8 / XML encoding (1 failure)

**Impact:** XMP.t test 35

XMP structured write with `AOTitle-de=pr\xc3\xbcfung` (UTF-8 "prüfung")
differs from reference output at line 13. May be double-encoding or
incorrect byte-to-character handling.

---

## Fix Plan

### Phase 1: Fix `join` byte-string preservation (RC1)

**File:** `src/main/java/org/perlonjava/runtime/operators/StringOperators.java`

**Changes:**
1. In 1-element fast path: preserve `BYTE_STRING` type from input element
2. In 0-element fast path: return BYTE_STRING when separator is byte-string
3. In 2+ element path: treat INTEGER, DOUBLE, UNDEF as byte-compatible
(only STRING type upgrades to UTF-8)

**Expected impact:** Should fix most binary write corruption (IPTC, PNG,
Writer, CanonRaw, FujiFilm, GIF, MIE, Nikon, XMP test 3). This is the
highest-value fix.

**Verification:**
```bash
make # Unit tests pass
# Individual ExifTool tests:
cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0
../jperl -Ilib t/IPTC.t
../jperl -Ilib t/GIF.t
../jperl -Ilib t/Writer.t
../jperl -Ilib t/PNG.t
```

### Phase 2: Investigate GPX/Geotag parsing (RC2)

**Files to investigate:**
- `Image::ExifTool::Geotag` — GPX parser using `$/ = '>'`
- `File::RandomAccess` — `ReadLine()` method
- PerlOnJava regex engine — `\3` backreference support

**Steps:**
1. Test `$/` with a simple script: `$/ = '>'; while (<FH>) { ... }`
2. Test regex backreference: `'key="val"' =~ /(\w+)=(['"])(.*?)\2/`
3. If `$/` is the issue, fix in the I/O layer
4. If regex backreference, fix in `RegexPreprocessor.java`

**Verification:**
```bash
cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0
../jperl -Ilib t/Geotag.t
../jperl -Ilib t/Geolocation.t
```

### Phase 3: Investigate UTF-8 / XMP encoding (RC3)

**Steps:**
1. Diff XMP.t test 35 output vs reference file
2. Determine if it's double-encoding or byte-handling issue
3. May be fixed by Phase 1 (join fix)

**Verification:**
```bash
cd /Users/fglock/.cpan/build/Image-ExifTool-13.55-0
../jperl -Ilib t/XMP.t
```

---

## Progress Tracking

### Current Status: Phase 1 in progress

### Completed
- [x] HTTP::Tiny redirect + binary mirror fix (PR #472)
- [x] Full failure analysis and categorization

### Next Steps
1. Fix `joinInternal` byte-string preservation
2. Run full ExifTool test suite to measure improvement
3. Investigate remaining failures (GPX, UTF-8)
Original file line number Diff line number Diff line change
Expand Up @@ -4307,16 +4307,21 @@ void compileVariableReference(OperatorNode node, String op) {
// Also set isSymbolicReference so defined(\&stub) returns true, matching
// the JVM backend's createCodeReference behavior.
if (node.operand instanceof OperatorNode operandOp
&& operandOp.operator.equals("&")
&& operandOp.operand instanceof IdentifierNode idNode) {
// Set isSymbolicReference before loading, so defined(\&Name) returns true
String subName = NameNormalizer.normalizeVariableName(
idNode.name, getCurrentPackage());
RuntimeScalar codeRef = GlobalVariable.getGlobalCodeRef(subName);
if (codeRef.type == RuntimeScalarType.CODE
&& codeRef.value instanceof RuntimeCode rc) {
rc.isSymbolicReference = true;
&& operandOp.operator.equals("&")) {
if (operandOp.operand instanceof IdentifierNode idNode) {
// \&name — regular package sub. Set isSymbolicReference before
// loading, so defined(\&Name) returns true
String subName = NameNormalizer.normalizeVariableName(
idNode.name, getCurrentPackage());
RuntimeScalar codeRef = GlobalVariable.getGlobalCodeRef(subName);
if (codeRef.type == RuntimeScalarType.CODE
&& codeRef.value instanceof RuntimeCode rc) {
rc.isSymbolicReference = true;
}
}
// For both \&name and \&$var (lexical subs), the & operator
// already produces a CODE value — no CREATE_REF wrapping needed.
// This matches the JVM backend's createCodeReference behavior.
node.operand.accept(this);
// lastResultReg already holds the CODE scalar — no wrapping needed
return;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -298,11 +298,31 @@ public static int executeStringConcatAssign(int[] bytecode, int pc, RuntimeBase[
registers[rd] = BytecodeInterpreter.ensureMutableScalar(registers[rd]);
}
RuntimeScalar target = (RuntimeScalar) registers[rd];
// Remember if target was BYTE_STRING before concatenation.
// In PerlOnJava, "upgrading" from BYTE_STRING to STRING doesn't change bytes
// (unlike Perl where bytes > 127 get re-encoded), so we preserve BYTE_STRING
// in .= to prevent false UTF-8 flag contamination of binary buffers.
boolean wasByteString = (target.type == RuntimeScalarType.BYTE_STRING);
RuntimeScalar result = StringOperators.stringConcat(
target,
(RuntimeScalar) registers[rs]
);
target.set(result);
// Preserve BYTE_STRING type when the target was byte string and the result
// still fits in Latin-1 (all chars <= 255)
if (wasByteString && target.type == RuntimeScalarType.STRING) {
String s = target.toString();
boolean fits = true;
for (int i = 0; i < s.length(); i++) {
if (s.charAt(i) > 255) {
fits = false;
break;
}
}
if (fits) {
target.type = RuntimeScalarType.BYTE_STRING;
}
}
// Invalidate pos() - any string modification should reset pos to undef
RuntimePosLvalue.invalidatePos(target);
return pc;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -313,7 +313,12 @@ static void handleCompoundAssignment(EmitterVisitor emitterVisitor, BinaryOperat
throw new RuntimeException("No operator handler found for base operator: " + baseOperator);
}
// assign to the Lvalue
mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "set", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false);
// For .= use setPreservingByteString to prevent UTF-8 flag contamination of binary buffers
if (node.operator.equals(".=")) {
mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "setPreservingByteString", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false);
} else {
mv.visitMethodInsn(Opcodes.INVOKEVIRTUAL, "org/perlonjava/runtime/runtimetypes/RuntimeScalar", "set", "(Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;)Lorg/perlonjava/runtime/runtimetypes/RuntimeScalar;", false);
}

// For string concat assign (.=), invalidate pos() since string was modified
if (node.operator.equals(".=")) {
Expand Down
4 changes: 2 additions & 2 deletions src/main/java/org/perlonjava/core/Configuration.java
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ public final class Configuration {
* Automatically populated by Gradle/Maven during build.
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String gitCommitId = "5ee00fd77";
public static final String gitCommitId = "f90f44dc2";

/**
* Git commit date of the build (ISO format: YYYY-MM-DD).
Expand All @@ -48,7 +48,7 @@ public final class Configuration {
* Parsed by App::perlbrew and other tools via: perl -V | grep "Compiled at"
* DO NOT EDIT MANUALLY - this value is replaced at build time.
*/
public static final String buildTimestamp = "Apr 9 2026 17:13:46";
public static final String buildTimestamp = "Apr 9 2026 18:35:44";

// Prevent instantiation
private Configuration() {
Expand Down
Loading
Loading