Skip to content

Text::CSV support: 52722/52723 subtests pass (99.998%)#438

Merged
fglock merged 30 commits intomasterfrom
feature/text-csv-support
Apr 4, 2026
Merged

Text::CSV support: 52722/52723 subtests pass (99.998%)#438
fglock merged 30 commits intomasterfrom
feature/text-csv-support

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 4, 2026

Summary

Comprehensive fixes to support Text::CSV 2.06 on PerlOnJava. The pure-Perl backend (Text::CSV_PP) now passes 39/40 test files and 52722/52723 subtests.

Key changes

  • @inc ordering + blib support for CPAN module testing via jcpan
  • BYTE_STRING type preservation across string operations (tr///, substr, readline, logical ops)
  • UTF-8/Latin-1 source encoding detection — files with non-UTF-8 bytes correctly handled as ISO-8859-1
  • Heredoc octet handling — without use utf8, heredoc content now produces byte strings matching Perl 5
  • utf8::valid() — byte strings (UTF-8 flag off) now always return true, matching Perl 5 semantics
  • namespace::autoclean — working implementation that cleans imported functions while preserving local methods
  • Wide character in print — fixed for heredocs containing non-ASCII chars in UTF-8 source files
  • ${q} ambiguity warning — suppressed inside string interpolation (matching Perl 5 behavior)
  • Various fixes: use bytes regex matching, Encode UTF-16/32 orphan byte handling, untie value retention, local %hash save/restore, PerlIO layer fixes, goto &sub label preservation, and more

Test results

Metric Result
Test files 39/40 pass
Subtests 52722/52723 pass (99.998%)
Only failure t/70_rt.t test 72 — pre-existing Text::CSV_PP issue (same in Perl 5 without XS backend)

Commits

28 commits covering Phases 1–9 of the Text::CSV fix plan (see dev/modules/text_csv_fix_plan.md).

Test plan

  • make passes (all unit tests)
  • ./jcpan -j 4 -t Text::CSV — 39/40 files, 52722/52723 subtests
  • ./jcpan -t LWP::UserAgent — 22/22 files pass
  • No Ambiguous use of ${q} warnings
  • No Wide character in print warnings

Generated with Devin

fglock and others added 30 commits April 4, 2026 20:57
- Add %_ to strict vars exemption lists (EmitVariable, BytecodeCompiler,
  Variable) - %_ is a valid Perl global hash like $_ and @_
- Fix Lib.java to unshift (prepend) directories instead of push (append),
  matching Perl lib.pm semantics. This allows use lib qw(./lib) in
  Makefile.PL to override bundled modules.
- Add Text::CSV fix plan documenting remaining issues

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Reorder @inc so user-installed modules override bundled ones:
  -I args > PERL5LIB > ~/.perlonjava/lib > jar:PERL5LIB
  This mirrors Perl 5 site_perl > core pattern.
- Add blib/lib population to MakeMaker-generated Makefiles so
  make test can find modules via PERL5LIB=./blib/lib

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
The bytecode compiler used loopStack.peek() for unlabeled last/next/redo,
which returned do-while pseudo-loops (isTrueLoop=false). This caused
errors when last was used inside a do-while nested in a real while loop.

Fix: iterate loopStack to find the first isTrueLoop=true entry, matching
the JVM backend findInnermostTrueLoopLabels behavior.

Impact: unblocks Text::CSV_PP core parser. Tests go from ~4 to 19/40.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- BytesPragma.java: Register bytes::length, bytes::chr, bytes::ord,
  bytes::substr as callable subroutines, delegating to existing
  StringOperators/ScalarOperators byte-aware methods.
  Text::CSV_PP calls bytes::length() directly at lines 1989/1995.

- RuntimeCode.java: Add GLOB type handling in method dispatch.
  Bare typeglobs (*FH, *DATA) used as method invocants now auto-bless
  to IO::File, matching the existing GLOBREFERENCE behavior.
  This fixes *FH->print(), *DATA->getline(), etc.

Text::CSV tests: 24/40 pass (up from 19), 31019 subtests ran.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
24/40 test programs pass, 31019 subtests ran, 118 actual failures.
Documented remaining issues: binary source reading (t/70_rt.t),
Scalar::Util::readonly (t/75_hashref.t), TieScalar (t/76_magic.t),
utf-32 encoding (t/85_util.t), and UTF-8 handling edge cases.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Bytecode compiler changes:
- Add isBytesEnabled() helper to BytecodeCompiler
- Check HINT_BYTES for length/chr/ord/fc/lc/uc/lcfirst/ucfirst
  and emit *_BYTES opcodes when 'use bytes' is active
- Add FC_BYTES, LC_BYTES, UC_BYTES, LCFIRST_BYTES, UCFIRST_BYTES
  opcodes with handler and disassembly support

DATA section changes:
- Store raw file bytes (after BOM removal) in CompilerOptions
- Extract DATA section content from raw bytes instead of
  UTF-8-decoded tokens, preserving non-UTF-8 bytes (e.g. Latin-1)
- Fall back to token-based extraction for eval/string contexts

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Pass VOID context through to RHS of &&/and, ||/or, // operators in
  both JVM backend (EmitLogicalOperator) and bytecode compiler
  (CompileBinaryOperator). Previously VOID was converted to SCALAR,
  causing side-effect-only expressions to leave values on the stack.
  Fixes t/80_diag.t tests 113-114.

- Add null check in PerlIO::get_layers for non-GLOB arguments,
  throwing "Not a GLOB reference" instead of NPE.
  Fixes t/90_csv.t test 104.

Text::CSV results: 27/40 programs pass (was 16/40).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Previously, `local %hash` only saved the hash contents internally
(via RuntimeHash.dynamicSaveState), but did not save the globalHashes
map entry. When `*glob = \%other` replaced the map entry via glob slot
assignment, the scope-exit restore put the saved contents into the
orphaned original hash, not the one in the global map.

This adds GlobalRuntimeHash (following the GlobalRuntimeScalar pattern)
which saves and restores the actual globalHashes map entry, including
glob alias handling.

Applied in both the JVM backend (EmitOperatorLocal.java) and the
bytecode interpreter (BytecodeInterpreter.java LOCAL_HASH handler).

Fixes Text::CSV t/91_csv_cb.t test 45 (%_ restoration after
`local %_` + `*_ = $hashref`).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…yers

In Perl, reading from file handles without encoding layers (e.g., :raw,
:bytes, or default mode) produces byte strings with the UTF-8 flag off.
PerlOnJava's readline methods (readUntilCharacter, readUntilString,
readParagraphMode, readFixedLength) were always creating STRING-typed
results, which made utf8::is_utf8() return true for all readline output.

This caused Text::CSV_PP's binary character detection to fail: CSV_PP
checks utf8::is_utf8($data) to decide whether to skip binary validation,
so bytes like \x08 (backspace) were silently accepted instead of
raising error 2037.

Changes:
- LayeredIOHandle: add hasEncodingLayer() to detect :utf8/:encoding(...)
- RuntimeIO: add isByteMode() to check if handle produces byte data
- Readline: all four read methods now check isByteMode() and set
  BYTE_STRING type on results when no encoding layers are active

Impact on Text::CSV tests:
- t/20_file.t, t/21_lexicalio.t: 104/109 -> 108/109 (+4 each)
- t/22_scalario.t: 131/136 -> 135/136 (+4)
- t/47_comment.t: 56/71 -> 71/71 (+15, all pass)
- t/51_utf8.t: 128/207 -> 132/167 (+4)
- t/85_util.t: 318/1448 -> 330/330 (all pass)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
30/40 test programs now pass (up from 27/40).
Phase 5 (readline BYTE_STRING propagation) fixed 27 subtest
failures across 6 test files. Notable improvements:
- t/47_comment.t: 71/71 (was 56/71)
- t/85_util.t: 330/330 (was 318/1448)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- TieScalar: cache last FETCHd value; untie restores it (not pre-tie
  value), matching Perl 5 behavior. Fixes t/76_magic.t (44/44).

- LayeredIOHandle: add decoded character buffer to prevent character loss
  when encoding layer decodes more characters than requested. Previously,
  reading 4 bytes of UTF-16BE produced 2 chars but only 1 was consumed;
  the other was silently discarded. Now excess chars are buffered for the
  next read. Also clear buffer on binmode/seek/close.

- Encode: add UTF-32, UTF-32BE, UTF-32LE charset aliases.

- perl_test_runner.pl: handle CPAN module paths with absolute directories
  so require ./t/util.pl works correctly.

Text::CSV t/85_util.t: 330/1448 -> 1350/1448

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… for non-octets

- All IO write() methods (CustomFileChannel, StandardIO, PipeOutputChannel,
  CustomOutputStreamHandle): detect characters > 255 and auto-encode to
  UTF-8, matching Perl 5 'Wide character in print' behavior. Previously,
  wide chars were truncated to their low byte (e.g., U+FEFF -> 0xFF).

- Utf8.java decode(): return false without modification when string
  contains characters > 0xFF, since they cannot be valid UTF-8 octets.
  Previously, getBytes(ISO_8859_1) silently replaced them with '?',
  corrupting Text::CSV sep/quote chars and causing sanity check failures.

Text::CSV t/85_util.t: 1350 -> 1356/1448
Text::CSV t/51_utf8.t: 167/207 (crashed) -> 198/207 (all run)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Two fixes that significantly improve Text::CSV test pass rates:

1. use bytes regex matching: Under use bytes pragma, regex character
   classes like [\x7f-\xa0] now match against UTF-8 byte representation
   of strings rather than Unicode characters. This fixes Text::CSV_PP
   quote_binary detection for multi-byte characters (e.g., euro sign).
   Added toBytesString() to StringOperators, with support in both JVM
   and interpreter backends.

2. Latin-1 source encoding detection: Source files containing non-ASCII
   bytes that are not valid UTF-8 are now detected and read as ISO-8859-1
   instead of UTF-8. This matches Perl 5 behavior where source files
   without use utf8 are treated as Latin-1. Files are marked with
   isByteStringSource so the string parser does not re-encode characters.

Test improvements:
- t/50_utf8.t: 92/93 -> 93/93 (use bytes regex fix)
- t/20_file.t: 108/109 -> 109/109 (Latin-1 fix)
- t/21_lexicalio.t: 108/109 -> 109/109 (Latin-1 fix)
- t/22_scalario.t: 135/136 -> 136/136 (Latin-1 fix)
- t/70_rt.t: 1/20469 -> 20466/20469 (Latin-1 fix, +20465 tests!)
- Overall: 32255 total -> 52723 total tests, 9 -> 5 failing programs

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- Add Wide character in print warning to RuntimeIO.write() when writing
  characters > 0xFF to a filehandle without a UTF-8 encoding layer. The
  warning is on by default (matching Perl 5) and suppressible with
  no warnings utf8. It goes through WarnDie.warn() so it is catchable
  by $SIG{__WARN__} handlers.

- Fix utf8::upgrade() to simply flip the type from BYTE_STRING to STRING
  without decoding the bytes as UTF-8. In Perl 5, utf8::upgrade() only
  changes the internal storage flag; character codepoints remain identical.
  Previously, bytes like 0xE2,0x82,0xAC (UTF-8 for euro sign) were
  incorrectly decoded back to U+20AC, reversing a prior utf8::encode().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In Perl, `print` uses an internal copy of $\ (PL_ors_sv) that is only
updated by direct assignment to $\. Aliasing via `for $\ (@list)` changes
the Perl-visible variable but not the internal value print uses.

PerlOnJava was reading $\ directly from the global variable map, so
`for $\ ($rs) { print $fh $str }` would incorrectly append the aliased
iterator value instead of the original $\ value.

Fix: Create OutputRecordSeparator and OutputFieldSeparator classes that
maintain a static internal value updated only by set(). print reads these
internal values instead of the map entries. GlobalRuntimeScalar handles
save/restore of internal values during local/for scoping.

This fixes 12 failures in Text::CSV t/45_eol.t and all 12 failures in
t/46_eol_si.t.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
withCapturedVars() created a copy of InterpretedCode for closures but
didn't copy gotoLabelPcs or usesLocalization. This caused goto LABEL
to fail in interpreter-fallback subroutines that have closure variables
(like Text::CSV_PP's ____parse, because the label map was silently
dropped when binding captured variables.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EOF
)
- RuntimeTransliterate: both /r return path and in-place modification
  path now preserve BYTE_STRING type from the input scalar
- RuntimeSubstrLvalue: substr() on a BYTE_STRING parent now returns
  BYTE_STRING instead of hardcoded STRING type

These fixes ensure that byte-oriented string operations maintain
their binary semantics, fixing Text::CSV t/51_utf8.t tests 122,
134, 144 where multi-byte separators were garbled.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ions

- chomp/chop: preserve BYTE_STRING after removing separator
- Regex captures ($1, $2, $&, etc.): track lastMatchWasByteString flag
  and propagate to ScalarSpecialVariable results and list-context returns
- split: all result elements inherit BYTE_STRING from input string
- s///: preserve BYTE_STRING for both normal and /r substitution
- lc/uc/lcfirst/ucfirst/fc/quotemeta: preserve type from input
- reverse/repeat (x): preserve BYTE_STRING from input
- utf8::is_utf8: resolve ScalarSpecialVariable proxy before checking type
- RegexState: save/restore lastMatchWasByteString across scope boundaries

These fixes ensure binary-mode string operations maintain their byte
semantics throughout the parsing pipeline. Fixes Text::CSV t/51_utf8.t
(all 207 tests now pass, was 4 failures) and reduces t/85_util.t
from 92 to 24 failures (remaining are UTF-16/32 encoding layer issues).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Perl Encode::decode silently drops incomplete trailing code units
for fixed-width encodings (UTF-16, UTF-32). Java String(byte[],
Charset) replaces them with U+FFFD replacement characters instead.

This caused Text::CSV t/85_util.t to fail 24 tests when reading
BOM-prefixed UTF-16LE/UTF-32LE files with CR line endings: binary
readline consumed the entire file, CSV_PP header() padded the header
with a null byte for alignment, and the extra U+FFFD in the decoded
string was parsed as a second data row.

Fix: trim input bytes to a multiple of the code unit size (2 for
UTF-16, 4 for UTF-32) before decoding. Applied to decode(),
encoding_decode(), and from_to().

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- RuntimeRegex.java: When s/// result contains wide chars (codepoint > 255),
  upgrade from BYTE_STRING to STRING instead of preserving byte type.
  Fixes re/subst.t regression where e.g. s/a/\x{100}/g on a byte string
  incorrectly kept BYTE_STRING type.

- LayeredIOHandle.java: For non-encoding layers like :crlf, read
  conservatively (bytesToRead = charactersNeeded) to avoid over-consuming
  from the delegate, which made tell() inaccurate. Encoding layers
  (UTF-16/32) still read extra bytes to handle multi-byte characters.
  Fixes io/crlf.t regression.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
All 5 reported regressions for PR #424 investigated:
- re/subst.t: fixed (s/// wide char BYTE_STRING upgrade)
- io/crlf.t: fixed (:crlf read over-consumption)
- re/pat_advanced.t: not a regression (matches master)
- comp/parser_run.t: not a regression (matches master)
- op/anonsub.t: not a regression (pre-existing env issue)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Replace ICU4J's UnicodeSet.toPattern(false) with custom
unicodeSetToJavaPattern() that:

1. Uses \x{XXXX} notation for supplementary characters (U+10000+)
   to avoid Java misinterpreting UTF-16 surrogate pairs in char
   class ranges
2. Escapes # and whitespace characters so patterns work correctly
   when recompiled with Pattern.COMMENTS flag (Java's /x mode

Root cause: When an empty regex // reuses the last successful
pattern with different flags (e.g., adding /x), the pattern is
recompiled with Pattern.COMMENTS. Java's COMMENTS mode treats #
as a comment delimiter even inside character classes, breaking
ranges like [!-#] in the expanded \p{IsPunct} pattern.

This fixes the re/pat_advanced.t crash that killed the test at
test ~1521, preventing 157 remaining tests from running. Now all
1678 tests complete (1316 pass, matching master's test count).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
EOF
)
…dvanced.t

- B.pm: wrap require Sub::Util in eval in _introspect() so that
  Sub::Util loading failures (due to @inc reordering) fall back to
  __ANON__ defaults instead of dying (fixes op/anonsub.t test 9)
- IdentifierParser: format non-ASCII bytes as \xNN (uppercase, no braces)
  inside ${...} contexts to match Perl diagnostic format
  (fixes comp/parser_run.t test 66)
- re/pat_advanced.t: no longer crashes - the unicodeSetToJavaPattern()
  fix from previous commit properly handles supplementary characters

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…ions

Replace the no-op stub with a working implementation that:
- Uses B::Hooks::EndOfScope to register cleanup at end of compilation
- Uses Sub::Util::subname (XS) to detect imported vs local functions
- Removes imported functions from the stash while preserving methods
- Supports -cleanee, -also, -except parameters

This fixes DateTime test t/48rt-115983.t which verifies that
Try::Tiny's catch/try don't leak as callable methods on DateTime
objects. Previously the no-op stub left them in the namespace.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…e::autoclean

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Functions installed from companion packages (e.g. DateTime::PP into
DateTime) via glob assignment are now recognized as intentional methods,
not imports. The heuristic: if the origin package is a sub-package of
the cleanee (DateTime::PP starts with DateTime::), keep it.

This fixes DateTime::_ymd2rd (from DateTime::PP) being incorrectly
cleaned, which caused 'Can't locate object method _ymd2rd' errors.
Try::Tiny imports (try, catch) are still correctly cleaned.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
In Perl 5, utf8::valid() on byte strings (UTF-8 flag off) always
returns true — the bytes are not claiming to be UTF-8, so they are
considered valid. PerlOnJava was incorrectly trying to decode them
as UTF-8, causing false negatives (e.g. chr(0xfa)).

Fixes Text::CSV t/70_rt.t test 444 (52722/52723 subtests now pass).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
1. ${q} ambiguity warning: suppress inside string interpolation
   (matching Perl 5, which only warns in code context)

2. Wide character in print: fix heredoc octet handling to convert
   Unicode chars back to UTF-8 bytes without use utf8, matching
   Perl 5 treatment of source bytes as Latin-1. Skips conversion
   for ISO-8859-1 source files (isByteStringSource).

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@fglock fglock force-pushed the feature/text-csv-support branch from e946d28 to 82a5167 Compare April 4, 2026 19:00
@fglock fglock merged commit a7261e4 into master Apr 4, 2026
2 checks passed
@fglock fglock deleted the feature/text-csv-support branch April 4, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant