Skip to content

Fix infinite entity expansion by detecting circular references#312

Open
naitoh wants to merge 3 commits intoruby:masterfrom
naitoh:fix_detecting_circular_references
Open

Fix infinite entity expansion by detecting circular references#312
naitoh wants to merge 3 commits intoruby:masterfrom
naitoh:fix_detecting_circular_references

Conversation

@naitoh
Copy link
Copy Markdown
Contributor

@naitoh naitoh commented May 5, 2026

Why?

Fix an issue where a stack level too deep (SystemStackError) error caused by a circular reference was not detected as an appropriate error.

REXML::Parsers::StreamParser

require 'rexml/parsers/streamparser'
require 'rexml/streamlistener'

source = <<~XML
<!DOCTYPE root [
  <!ENTITY x "&x;">
]>
<root>&x;</root>
XML

listener = Class.new { include REXML::StreamListener }.new
REXML::Parsers::StreamParser.new(source, listener).parse
  • before
lib/rexml/parsers/baseparser.rb:544:in 'REXML::Parsers::BaseParser#entity': stack level too deep (SystemStackError)
  • after
lib/rexml/parsers/baseparser.rb:545:in 'REXML::Parsers::BaseParser#entity': Detected an entity reference loop: x (RuntimeError)

REXML::Document

require 'rexml'

source = <<~XML
<!DOCTYPE root [
  <!ENTITY x "&x;">
]>
<root>&x;</root>
XML

REXML::Document.new(source).root.text
  • before
lib/rexml/text.rb:390:in 'String#gsub': stack level too deep (SystemStackError)
  • after
lib/rexml/entity.rb:80:in 'REXML::Entity#unnormalized': Detected an entity reference loop: x (RuntimeError)

Benchmark

$ benchmark-driver benchmark/parse_doctype.yaml
Calculating -------------------------------------
                             before       after  before(YJIT)  after(YJIT)
      single_entity(dom)    22.381k     24.492k        5.098k       5.211k i/s -     100.000 times in 0.004468s 0.004083s 0.019615s 0.019189s
   single_entity(stream)    46.882k     48.054k        8.796k       8.691k i/s -     100.000 times in 0.002133s 0.002081s 0.011369s 0.011506s
   chained_entities(dom)     2.665k      2.531k        2.349k       2.340k i/s -     100.000 times in 0.037520s 0.039509s 0.042574s 0.042726s
chained_entities(stream)     2.455k      2.387k        2.351k       2.290k i/s -     100.000 times in 0.040732s 0.041888s 0.042538s 0.043660s
      many_entities(dom)    675.927     679.094       684.308      674.686 i/s -     100.000 times in 0.147945s 0.147255s 0.146133s 0.148217s
   many_entities(stream)     2.602k      2.501k        2.403k       2.276k i/s -     100.000 times in 0.038427s 0.039986s 0.041607s 0.043944s
    repeated_entity(dom)    419.090     414.295       445.286      444.575 i/s -     100.000 times in 0.238612s 0.241374s 0.224575s 0.224934s
 repeated_entity(stream)     1.449k      1.401k        1.590k       1.582k i/s -     100.000 times in 0.069014s 0.071358s 0.062898s 0.063201s

Comparison:
                   single_entity(dom)
                   after:     24491.8 i/s
                  before:     22381.4 i/s - 1.09x  slower
             after(YJIT):      5211.3 i/s - 4.70x  slower
            before(YJIT):      5098.1 i/s - 4.80x  slower

                single_entity(stream)
                   after:     48053.8 i/s
                  before:     46882.3 i/s - 1.02x  slower
            before(YJIT):      8795.8 i/s - 5.46x  slower
             after(YJIT):      8691.1 i/s - 5.53x  slower

                chained_entities(dom)
                  before:      2665.2 i/s
                   after:      2531.1 i/s - 1.05x  slower
            before(YJIT):      2348.9 i/s - 1.13x  slower
             after(YJIT):      2340.5 i/s - 1.14x  slower

             chained_entities(stream)
                  before:      2455.1 i/s
                   after:      2387.3 i/s - 1.03x  slower
            before(YJIT):      2350.8 i/s - 1.04x  slower
             after(YJIT):      2290.4 i/s - 1.07x  slower

                   many_entities(dom)
            before(YJIT):       684.3 i/s
                   after:       679.1 i/s - 1.01x  slower
                  before:       675.9 i/s - 1.01x  slower
             after(YJIT):       674.7 i/s - 1.01x  slower

                many_entities(stream)
                  before:      2602.3 i/s
                   after:      2500.9 i/s - 1.04x  slower
            before(YJIT):      2403.4 i/s - 1.08x  slower
             after(YJIT):      2275.6 i/s - 1.14x  slower

                 repeated_entity(dom)
            before(YJIT):       445.3 i/s
             after(YJIT):       444.6 i/s - 1.00x  slower
                  before:       419.1 i/s - 1.06x  slower
                   after:       414.3 i/s - 1.07x  slower

              repeated_entity(stream)
            before(YJIT):      1589.9 i/s
             after(YJIT):      1582.3 i/s - 1.00x  slower
                  before:      1449.0 i/s - 1.10x  slower
                   after:      1401.4 i/s - 1.13x  slower
  • YJIT=ON : 0.95 - 1.09x faster
  • YJIT=OFF : 0.94 - 1.02x faster

@naitoh naitoh requested a review from kou May 5, 2026 12:37
@kou kou requested a review from Copilot May 6, 2026 03:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds circular entity-reference detection to prevent unbounded recursive expansion (previously resulting in SystemStackError) during streaming-style parsing.

Changes:

  • Track the current entity-expansion chain in REXML::Parsers::BaseParser and raise a clear error when a loop is detected.
  • Extend BaseParser#unnormalize/#entity to thread the expansion state through nested expansions.
  • Add regression tests covering direct/indirect circular references (including a long entity value) via StreamParser.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
lib/rexml/parsers/baseparser.rb Adds loop detection during entity expansion by propagating an “currently expanding” set through recursive unnormalization.
test/test_entity.rb Adds regression tests ensuring circular entity references raise an appropriate error instead of stack overflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/rexml/parsers/baseparser.rb Outdated
Comment thread lib/rexml/parsers/baseparser.rb Outdated
@kou
Copy link
Copy Markdown
Member

kou commented May 9, 2026

Is this check heavy?

naitoh added 3 commits May 10, 2026 10:30
…ML::Parsers::BaseParser

## Why?
Fix an issue where a `stack level too deep (SystemStackError)` error caused by a circular reference was not detected as an appropriate error.

```
require 'rexml/parsers/streamparser'
require 'rexml/streamlistener'

source = <<~XML
<!DOCTYPE root [
  <!ENTITY x "&x;">
]>
<root>&x;</root>
XML

listener = Class.new { include REXML::StreamListener }.new
REXML::Parsers::StreamParser.new(source, listener).parse
```

- before
```
lib/rexml/parsers/baseparser.rb:544:in 'REXML::Parsers::BaseParser#entity': stack level too deep (SystemStackError)
```
- after
```
lib/rexml/parsers/baseparser.rb:545:in 'REXML::Parsers::BaseParser#entity': Detected an entity reference loop: x (RuntimeError)
```
…ML::Document

## Why?
Fix an issue where a `stack level too deep (SystemStackError)` error caused by a circular reference was not detected as an appropriate error.

```
require 'rexml'

source = <<~XML
<!DOCTYPE root [
  <!ENTITY x "&x;">
]>
<root>&x;</root>
XML

REXML::Document.new(source).root.text
```

- before
```
lib/rexml/text.rb:390:in 'String#gsub': stack level too deep (SystemStackError)
```
- after
```
lib/rexml/entity.rb:80:in 'REXML::Entity#unnormalized': Detected an entity reference loop: x (RuntimeError)
```
## Benchmark
```
$ benchmark-driver benchmark/parse_doctype.yaml
Calculating -------------------------------------
                             before       after  before(YJIT)  after(YJIT)
      single_entity(dom)    22.381k     24.492k        5.098k       5.211k i/s -     100.000 times in 0.004468s 0.004083s 0.019615s 0.019189s
   single_entity(stream)    46.882k     48.054k        8.796k       8.691k i/s -     100.000 times in 0.002133s 0.002081s 0.011369s 0.011506s
   chained_entities(dom)     2.665k      2.531k        2.349k       2.340k i/s -     100.000 times in 0.037520s 0.039509s 0.042574s 0.042726s
chained_entities(stream)     2.455k      2.387k        2.351k       2.290k i/s -     100.000 times in 0.040732s 0.041888s 0.042538s 0.043660s
      many_entities(dom)    675.927     679.094       684.308      674.686 i/s -     100.000 times in 0.147945s 0.147255s 0.146133s 0.148217s
   many_entities(stream)     2.602k      2.501k        2.403k       2.276k i/s -     100.000 times in 0.038427s 0.039986s 0.041607s 0.043944s
    repeated_entity(dom)    419.090     414.295       445.286      444.575 i/s -     100.000 times in 0.238612s 0.241374s 0.224575s 0.224934s
 repeated_entity(stream)     1.449k      1.401k        1.590k       1.582k i/s -     100.000 times in 0.069014s 0.071358s 0.062898s 0.063201s

Comparison:
                   single_entity(dom)
                   after:     24491.8 i/s
                  before:     22381.4 i/s - 1.09x  slower
             after(YJIT):      5211.3 i/s - 4.70x  slower
            before(YJIT):      5098.1 i/s - 4.80x  slower

                single_entity(stream)
                   after:     48053.8 i/s
                  before:     46882.3 i/s - 1.02x  slower
            before(YJIT):      8795.8 i/s - 5.46x  slower
             after(YJIT):      8691.1 i/s - 5.53x  slower

                chained_entities(dom)
                  before:      2665.2 i/s
                   after:      2531.1 i/s - 1.05x  slower
            before(YJIT):      2348.9 i/s - 1.13x  slower
             after(YJIT):      2340.5 i/s - 1.14x  slower

             chained_entities(stream)
                  before:      2455.1 i/s
                   after:      2387.3 i/s - 1.03x  slower
            before(YJIT):      2350.8 i/s - 1.04x  slower
             after(YJIT):      2290.4 i/s - 1.07x  slower

                   many_entities(dom)
            before(YJIT):       684.3 i/s
                   after:       679.1 i/s - 1.01x  slower
                  before:       675.9 i/s - 1.01x  slower
             after(YJIT):       674.7 i/s - 1.01x  slower

                many_entities(stream)
                  before:      2602.3 i/s
                   after:      2500.9 i/s - 1.04x  slower
            before(YJIT):      2403.4 i/s - 1.08x  slower
             after(YJIT):      2275.6 i/s - 1.14x  slower

                 repeated_entity(dom)
            before(YJIT):       445.3 i/s
             after(YJIT):       444.6 i/s - 1.00x  slower
                  before:       419.1 i/s - 1.06x  slower
                   after:       414.3 i/s - 1.07x  slower

              repeated_entity(stream)
            before(YJIT):      1589.9 i/s
             after(YJIT):      1582.3 i/s - 1.00x  slower
                  before:      1449.0 i/s - 1.10x  slower
                   after:      1401.4 i/s - 1.13x  slower
```
- YJIT=ON : 0.95 - 1.09x faster
- YJIT=OFF : 0.94 - 1.02x faster
@naitoh naitoh force-pushed the fix_detecting_circular_references branch from 19dd4a9 to 26545fc Compare May 10, 2026 02:33
@naitoh
Copy link
Copy Markdown
Contributor Author

naitoh commented May 10, 2026

@kou

Is this check heavy?

I fixed an issue where unnecessary memory copies were occurring.
We've added some benchmarks, so please take a look.

I believe this falls within an acceptable range in terms of performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants