Skip to content

WWW::Mechanize CPAN module support#440

Open
fglock wants to merge 17 commits intomasterfrom
feature/www-mechanize-support
Open

WWW::Mechanize CPAN module support#440
fglock wants to merge 17 commits intomasterfrom
feature/www-mechanize-support

Conversation

@fglock
Copy link
Copy Markdown
Owner

@fglock fglock commented Apr 4, 2026

Summary

Comprehensive support for WWW::Mechanize and its dependency chain on PerlOnJava, achieved through 11 phases of fixes across the parser, runtime, IO system, and build configuration.

Key changes

Parser & Language fixes

  • Fix parser rejecting NEWLINE/WHITESPACE after trailing : in package names
  • Fix UNIVERSAL::isa() not recognizing CODE references
  • Fix strict subs to allow trailing :: barewords (package name constants)
  • Fix overload dispatch not setting $AUTOLOAD when resolution falls through to AUTOLOAD
  • Fix labeled blocks as valid targets for unlabeled last/next/redo

HTMLParser (HTML::Parser Java implementation)

  • Add array-ref accumulator handlers and argspec parsing (HTML::TokeParser/PullParser)
  • Fix incomplete tag buffering across parse() chunk boundaries
  • Fix self-closing /> handling: emit as attribute in non-XML mode
  • Add raw text element handling for <script>, <style>, <xmp>, <listing>, <plaintext>, <textarea>, <title>
  • Implement SGML marked_sections: <![CDATA[...]]>, <![IGNORE[...]]>, <![INCLUDE[...]]>
  • Script CDATA-skipping: </script> inside <![CDATA[...]]> not treated as closing tag
  • is_cdata argspec now correctly reflects CDATA text events

IO & Runtime fixes

  • Fix fileno() to return valid fd numbers for all file/pipe handles
  • Bridge findFileHandleByDescriptor() to RuntimeIO fileno registry
  • Add closeIOOnDrop(): close IO handles on undef/reassignment for gensym'd globs
  • Fix getRuntimeIO() fallback to use non-auto-vivifying getExistingGlobalIO()

Build & Dependencies

  • Add Devel::Cycle no-op stub for Test::Memory::Cycle compatibility
  • Bundle LWP/media.types in JAR resources for MIME type detection (.csstext/css, etc.)

Test Results

Non-server tests: ~542/545 (99.8%)
Local server tests: 18/18 (0 timeouts)

Metric Before After
Non-server tests 431/478 (90.2%) ~542/545 (99.8%)
Local server tests 14/139 pass 18/18 (100%)
HTTP::Daemon Not working Fully working (pure Perl)
Capture::Tiny 109/331 fail capture/capture_stdout/capture_stderr/capture_merged all working
find_link_xhtml.t 0/10 10/10
image-parse.t 0/47 47/47
submit_form.t 1/9 9/9
find_link.t 68/70 70/70

Remaining (known limitations)

  • cookies.t: Blocked on JVM fork() — uses open FH, '-|' pattern
  • field.t: 1 TODO subtest — HTML::TokeParser limitation (not a PerlOnJava bug)

See dev/modules/www_mechanize.md for detailed per-phase root cause analysis.

Test plan

  • make passes (no regressions in all 11 phases)
  • HTMLParser: chunked parsing, argspec, marked_sections, CDATA, raw text elements
  • HTML::TreeBuilder parses and renders HTML correctly
  • HTML::Form parse/submit/field access working
  • Capture::Tiny capture functions working
  • HTTP::Daemon serving requests, all local server tests pass
  • IO cleanup: gensym'd socket globs properly closed
  • LWP::MediaTypes MIME detection working (media.types bundled)
  • WWW::Mechanize non-server tests: ~99.8%
  • WWW::Mechanize local server tests: 18/18

Generated with Devin

fglock and others added 17 commits April 4, 2026 22:51
Document 6 PerlOnJava bugs discovered by running jcpan -t WWW::Mechanize:
- Parser rejects NEWLINE after trailing :: (IdentifierParser.java)
- UNIVERSAL::isa() missing CODE reference type (Universal.java)
- HTMLParser array-ref accumulator handlers missing (HTMLParser.java)
- Overload dispatch does not set AUTOLOAD (OverloadContext.java)
- Devel::Cycle stub needed for Test::Memory::Cycle
- Capture::Tiny fork dependency (known limitation)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…l::Cycle

Fix 5 PerlOnJava bugs blocking WWW::Mechanize and its dependency chain:

1. Parser: allow NEWLINE/WHITESPACE/comma after trailing :: in package
   names (e.g. Tie::RefHash:: followed by newline). The validation gate
   in parseSubroutineIdentifier() now permits these tokens, letting the
   loop continue to the top where they are already handled correctly.

2. UNIVERSAL::isa(): add CODE reference type to the switch statement so
   UNIVERSAL::isa(\&sub, 'CODE') correctly returns true. This fixes
   HTML::Element::traverse() which checks isa(callback, 'CODE').

3. HTMLParser: implement array-ref accumulator handlers and argspec
   parsing in fireEvent(). HTML::PullParser/TokeParser registers an
   array ref as the event callback; fireEvent() now builds event data
   per the argspec string and pushes it onto the accumulator. This
   makes HTML::Form::parse() return actual form objects.

4. OverloadContext: set $AUTOLOAD when overload method resolution falls
   through to AUTOLOAD. Fixes URI::WithBase stringification which uses
   overload '""' => 'as_string' with AUTOLOAD delegation.

5. Devel::Cycle: add no-op stub for JVM compatibility (same pattern as
   Test::LeakTrace). The JVM tracing GC handles cycles natively.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
When a STRING (method name) handler is used with argspec containing
"self", the XS behavior is to use "self" as the method invocant only,
not to pass it as an additional argument. This was causing
HTML::TreeBuilder to receive $self as $tag (doubled invocant),
breaking as_HTML() output.

- Added skipSelf parameter to buildEventDataFromArgspec()
- STRING callbacks: skipSelf=true (self is already the invocant)
- CODE/ARRAY callbacks: skipSelf=false (self included in args)

WWW::Mechanize tests: 431/478 pass (90.2%) on non-local tests

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
All phases complete. WWW::Mechanize non-local tests: 431/478 (90.2%).
Document remaining failures and Bug 7 (skipSelf) details.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Three fixes that improve WWW::Mechanize tests from 90.2% to 97.0%:

1. HTMLParser: Buffer incomplete tags across parse() chunk boundaries.
   When HTML::PullParser feeds 512-byte chunks, tags spanning boundaries
   were truncated. Now incomplete start/end tags are buffered for the
   next parse() call, matching Perl HTML::Parser XS behavior.

2. Strict subs: Allow barewords ending with :: (package name constants).
   In Perl, Tie::RefHash:: is always legal even under use strict subs
   and evaluates to the string without trailing ::. Fixed in both JVM
   and bytecode backends.

3. HTMLParser: Self-closing /> handling matches Perl HTML::Parser.
   In non-XML mode, / in /> is now emitted as boolean attribute.
   Synthetic end tags only in xml_mode.

WWW::Mechanize non-local tests: 513/529 pass (97.0%)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
… raw text, media.types

Three fixes improving WWW::Mechanize test pass rate from 97.0% to 98.1%:

1. EmitStatement.java: Make labeled blocks valid targets for unlabeled
   last/next/redo. In Perl, LABEL: { for (...) {} last; } should exit
   the labeled block, not the program. The previous logic incorrectly
   prevented unlabeled control flow from targeting labeled simple blocks.

2. HTMLParser.java: Add raw text element handling for script, style,
   xmp, listing, plaintext, textarea, and title elements. Content
   inside these elements is not parsed for HTML tags, matching Perl
   HTML::Parser behavior.

3. Bundle LWP/media.types data file for MIME type lookups. This file was
   missing from the project, causing LWP::MediaTypes to fail to map file
   extensions to content types.

Test results (non-server tests): 522/532 subtests pass (98.1%)
- upload.t: 3/5 -> 5/5 (labeled block fix)
- image-parse.t: 41/47 -> 41/42 (media.types)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…tation plans

Detailed root cause analysis and step-by-step implementation plans for:
- Phase 8: Fix fileno/dup chain for Capture::Tiny capture* (4 bugs identified)
- Phase 9: Test HTTP::Daemon pure Perl server mode end-to-end

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…::Daemon

Four fixes to the fileno/dup chain that enable Capture::Tiny capture*
functions and HTTP::Daemon pure Perl server mode:

1. RuntimeIO: assign sequential filenos to all file handles (files, pipes)
   on open, not just sockets. fileno() previously returned undef for
   regular files, breaking Capture::Tiny's dup-by-fd-number pattern.

2. IOOperator: bridge findFileHandleByDescriptor() to RuntimeIO's fileno
   registry. The dup-by-fd path (open *STDOUT, ">&3") couldn't find
   handles because it only checked its own empty map + hardcoded 0/1/2.

3. RuntimeIO: guard empty filename in dup mode. When fileno() returned
   undef, ">&" . undef became ">&" (empty fd), which silently replaced
   STDOUT with a stdin reader instead of reporting an error.

4. RuntimeIO: unregister fileno on close() to prevent fd number leaks.

5. IOOperator: assign fileno to duplicated handles so they can also be
   found by dup-by-fd.

Results:
- Capture::Tiny capture/capture_stdout/capture_stderr/capture_merged: WORKING
- HTTP::Daemon new/accept/get_request/send_response: WORKING (pure Perl)
- WWW::Mechanize non-server tests: 529/532 (99.4%, up from 522/532)
- WWW::Mechanize local server tests: 17/19 pass (0 failures, 2 timeouts)
- dump.t: 1/7 -> 7/7, mech-dump/file_not_found.t: 0/1 -> 1/1

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…::Daemon)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Since jperl doesn't implement DESTROY or reference counting, IO handles
on gensym'd globs (used by IO::Socket, HTTP::Daemon, etc.) were never
closed when variables went out of scope or were explicitly undef'd.

This caused HTTP clients like LWP to hang waiting for the server to
close connection sockets on 302 redirects.

Changes:
- RuntimeScalar: Add closeIOOnDrop() called from undefine() and
  setLarge() to close IO handles when a GLOBREFERENCE is dropped
- RuntimeScalar: Only close IO for globs NOT in the stash (checked
  via existsGlobalIO) to avoid closing named handles like *MYFILE
- RuntimeIO: Fix getRuntimeIO() fallback to use getExistingGlobalIO()
  instead of getGlobalIO() to prevent auto-vivifying stash entries
  for deleted gensym globs
- GlobalVariable: Add getExistingGlobalIO() non-auto-vivifying lookup

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
- All 18 local server tests now pass with 0 timeouts
- back.t: 47/47, get.t: 34/34 (previously had server-exit timeouts)
- Document remaining issues: CDATA, CSS url, cookies.t (fork)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
HTMLParser: implement <![CDATA[...]]>, <![IGNORE[...]]>, <![INCLUDE[...]]>
parsing when marked_sections or xml_mode is enabled. CDATA sections emit
text events with is_cdata=true. Script/style raw content handlers skip
CDATA sections when looking for closing tags (marked_sections only).
When marked_sections is off, <![...> is treated as a bogus comment per
Perl 5 HTML::Parser behavior.

Build: include LWP/media.types in JAR resources so LWP::MediaTypes can
map file extensions to MIME types (e.g. .css → text/css). This fixes
WWW::Mechanize CSS image extraction from standalone .css files.

Fixes: find_link_xhtml.t (10/10), image-parse.t (47/47)

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Root cause: Capture::Tiny's _copy_std() saves STDOUT/STDERR handles
by dup'ing them into a hash via a reused loop variable:
  $h = IO::Handle->new(); open($h, ">&STDOUT"); $old{stdout} = $h;
  $h = IO::Handle->new(); open($h, ">&STDERR"); $old{stderr} = $h;

When $h was reassigned, setLarge() called closeIOOnDrop(), which saw
the gensym'd glob was no longer in any stash and closed its IO. This
set ioHandle=ClosedIOHandle on the RuntimeIO still referenced by
$old{stdout}, causing fileno() to return undef and the later restore
to fail with "Bad file descriptor".

Additionally, the fd recycling mechanism (ConcurrentLinkedQueue) was
unsafe: the closed RuntimeIO's fd was added to the free queue, and
the next assignFileno() reused it, causing two handles to share the
same fd number.

Changes:
- RuntimeScalar.setLarge(): Remove closeIOOnDrop() from non-null
  assignment path. Without reference counting we cannot know if other
  variables still reference the same glob. Keep it in undefine() and
  setLarge(null) where explicit cleanup is intended.
- RuntimeIO: Remove fd recycling queue (freedFilenos). Fd numbers are
  now monotonically increasing and never reused. This is safe because
  they are virtual (not OS fds) and only consume map entries.
- Add dev/design/io_handle_lifecycle.md documenting the full analysis,
  design decisions, and future improvement options.

See: dev/design/io_handle_lifecycle.md

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant