Skip to content

feat(tooling): detect and mark out-of-gas-by-design tests#2941

Draft
leolara wants to merge 9 commits into
ethereum:forks/amsterdamfrom
leolara:leolara/detect-oog-by-design-fillers
Draft

feat(tooling): detect and mark out-of-gas-by-design tests#2941
leolara wants to merge 9 commits into
ethereum:forks/amsterdamfrom
leolara:leolara/detect-oog-by-design-fillers

Conversation

@leolara
Copy link
Copy Markdown
Member

@leolara leolara commented May 29, 2026

Important

🛑 Stacked on #2906 — review and merge this PR only after #2906.

This branch is built on top of #2906 (leolara/mark-gas-checking-tests) and
reuses its scripts/mark_tests.py. Until #2906 merges into
forks/amsterdam, the diff below also contains #2906's commits.
Once #2906
is merged and this branch is rebased onto the updated base, the diff collapses
to just the out-of-gas detection commit. Please do not merge before #2906.

🗒️ Description

Adds tooling to find tests that run out of gas by design — at the top-level
transaction or any sub-call depth — and mark them, complementing #2906's
gas-assertion detection.

scripts/detect_oog_by_design.py scans EELS EIP-3155 traces under an
--evm-dump-dir and flags every test whose execution ran out of gas. The
signal is exact equality on the trace error field == "OutOfGasError" (the
EELS exception class name written at
src/ethereum_spec_tools/evm_tools/t8n/evm_trace/eip3155.py), which is uniform
across every fork and distinct from all other halt classes. This gives zero
false positives
against the other all-gas-consuming halts
(StackOverflowError, StackUnderflowError, InvalidOpcode, Revert). Fill
success is the "by design" filter: if a traced test filled successfully, any OOG
in its trace matched the author's declared post-state.

The output is the mark_tests.py wrapped form {marker, nodeids}, so the two
scripts chain directly. The lossy sanitized dump-dir names are reconstructed
into pytest node IDs and verified against the source AST (re-adding the
test_ prefix and locating the function, including any enclosing class) before
being emitted; names that cannot be verified are reported rather than guessed —
and mark_tests.py independently re-checks every node ID against the
filesystem, so a wrong ID can never silently mis-mark a test.

Also adds:

  • a mark_oog_tests just recipe that runs detect + mark against
    pre-produced traces. The heavy traced fill is documented in the recipe
    header
    rather than run inline, so the fork and worker count stay in the
    reviewer's hands (traces are large and memory-hungry; a single fork suffices
    because OOG-by-design is fork-invariant in practice);
  • scripts/fill_for_oog_detection.sh to produce the traces detached;
  • oog_detector_validation_report.md recording the validation: 6 positive
    variants detected (top-level, sub-call depth 2–4, ooge/oogm, CREATE-init,
    multi-depth-in-one-tx), 5 non-OOG halts correctly ignored, a 1:1 match with
    ground truth read straight from the traces.

The detector is stdlib-only and does not import execution_testing, so it
is unaffected by the substring-match limitation in that package's
_is_out_of_gas_error helper (it tests for "out of gas", which never matches
the EELS class name OutOfGasError).

🔗 Related Issues or PRs

Depends on and must merge after #2906.

✅ Checklist

  • All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
    just static
  • All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
  • All: Considered updating the online docs in the ./docs/ directory.
  • All: Set appropriate labels for the changes (only maintainers can apply labels).

Cute Animal Picture

Put a link to a cute animal picture inside the parenthesis-->

leolara added 8 commits May 23, 2026 11:02
Adds a pytest plugin enabled with `--detect-gas-checks` (fill mode only)
that identifies tests asserting specific gas values. The plugin uses two
detection modes, chosen per sink:

- **Storage in post** (`post[addr].storage[slot]`) — taint propagation.
  Wraps `Bytecode.gas_cost`, per-fork `gas_costs` /
  `opcode_gas_calculator` / `transaction_intrinsic_cost_calculator`, and
  patches `Number.__new__` + `FixedSizeHexNumber.__new__` so a
  `GasTainted(int)` carrier survives `HashInt` / `HexNumber`
  construction. Tests whose post-storage holds a tainted value are
  flagged.
- **Receipt / header / block-level / benchmark fields** — field-presence.
  The field name (`cumulative_gas_used`, `expected_gas_used`,
  `blob_gas_used`, etc.) is itself the assertion signal; if set, flag.

Tests with `tx.error`, `block_exception`, or the `exception_test` marker
are excluded so OOG-expecting tests don't pollute the output. Worker
results are aggregated to the master via `workeroutput` /
`pytest_testnodedown`.

`Block.expected_gas_used` and `BaseTest.expected_benchmark_gas_used`
change from `int` to `HexNumber`. Both are internal assertion-only and
not serialised to fixtures, so the type change has no fixture impact and
brings them under the same constructor path as the other gas sinks.
The benchmark validator widens a local var type to accommodate the
union.

Also adds `scripts/mark_tests.py`, an idempotent script that applies a
configurable `@pytest.mark.<name>` decorator to test functions listed in
a JSON file. Accepts three input shapes (bare mapping, bare list,
wrapped with embedded marker name); `--marker` overrides any embedded
value.
Applies `@pytest.mark.gas_check` to the 54 test functions identified by
running `fill --detect-gas-checks` on the full suite. Produced by
`scripts/mark_tests.py gas_check_report.json`; re-running the script is
idempotent.

Affects 31 files across 16 EIP directories; one decorator each, inserted
at the top of the existing decorator stack.
38 tests covering:

- ``GasTainted`` carrier: construction, arithmetic propagation (incl.
  reflected ops), origin union/dedup, and the NotImplemented forward
  path that fixes ``bytes_literal * GasTainted(n)``.
- ``install_taint`` / ``uninstall_taint``: taint reaches Bytecode and
  fork gas-cost outputs, survives ``HashInt`` / ``HexNumber`` /
  ``ZeroPaddedHexNumber`` and Pydantic ``Storage`` validation, doesn't
  leak onto plain ints, and uninstall reverts.
- ``collect_taint_hits``: storage taint detection, presence-based
  receipt / header / block-level / benchmark detection, the
  BenchmarkTest MRO gate that prevents filler-default false positives,
  and the OOG exclusions (tx.error, block_exception, exception_test).
Replace ``Any`` and ``getattr``-based field access in the sink walker
with direct attribute access against ``BaseTest`` / ``StateTest`` /
``BlockchainTest`` / ``BenchmarkTest`` / ``Alloc`` / ``Transaction`` /
``Block``. Spec-type-specific access (``test.tx``, ``test.blocks``) now
dispatches via ``isinstance`` rather than ``getattr(test, "...", None)``,
so a future rename of ``post`` / ``tx`` / ``expected_receipt`` / etc.
fails type-checking instead of silently returning empty hits.

Also tightens the few remaining ``Any``s where possible: the wrapped
opcode-gas calculator is now typed as ``Callable[[OpcodeBase], int]``,
and ``pytest_testnodedown`` uses ``xdist.workermanage.WorkerController``
(stub added) and ``object | None``, matching the upstream xdist hook.

The unit tests built with ``SimpleNamespace`` fakes no longer satisfy
the new typed signature and will be replaced in a follow-up commit.
Replace the SimpleNamespace-based fakes in the walker tests with real
``StateTest`` / ``BlockchainTest`` / ``BenchmarkTest`` Pydantic instances
so that field renames or retypings on the spec models fail the unit
tests instead of silently masking them.

The storage and mixed-sink tests now use the ``taint_installed`` fixture
because Pydantic ``Storage`` validation strips the ``GasTainted``
subclass without the ``FixedSizeHexNumber.__new__`` patch.

Carrier tests (``TestGasTaintedCarrier``) and install tests
(``TestTaintInstallation``) are unchanged — they already exercised the
real types.
Add four end-to-end tests that load the real fill plugin chain into a
pytester subprocess and exercise the bits the unit tests deliberately
skip:

- ``--detect-gas-checks`` / ``--gas-check-report`` appear in
  ``fill --help``.
- A synthetic test using ``CodeGasMeasure`` is detected and recorded as
  a ``storage`` hit with gas-derived origins.
- A synthetic test marked ``@pytest.mark.exception_test`` is excluded
  from the report.
- Without the flag, no report file is produced.

Co-located with ``test_benchmarking.py`` and added to ``--ignore`` in
``test-tests`` / ``test-tests-pypy`` so unit-test runs stay fast; runs
under ``test-tests-bench`` alongside the existing pytester suite. Needs
``EVM_BIN`` (defaults to ``evm``) on the runner.
Add a negative integration test that exercises a synthetic state test
which stores a literal value (no ``CodeGasMeasure``, no
``expected_receipt``, no header verify) and asserts the resulting
report is empty. Guards against false positives: a test that simply
runs successfully without touching any of the detector's triggers must
not appear in the JSON output.
…ipes

- ``just mark_gas_tests <path-or-selector>`` runs ``fill
  --detect-gas-checks`` against the given target, then pipes the report
  through ``scripts/mark_tests.py`` to apply ``@pytest.mark.gas_check``
  to the detected tests. Accepts any pytest path/selector, so the
  workflow can be scoped to a single file or function.

- ``just mark_gas_tests_test`` runs the pytester end-to-end tests for
  the gas_taint plugin in their own ``[gas check]`` group, instead of
  piggybacking on ``test-tests-bench`` (which is for benchmark
  framework tests, not plugin e2e).
@leolara leolara force-pushed the leolara/detect-oog-by-design-fillers branch from 6f580d8 to f83783c Compare May 29, 2026 11:20
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.49%. Comparing base (55d774b) to head (1ab1fd0).
⚠️ Report is 15 commits behind head on forks/amsterdam.

Additional details and impacted files
@@                 Coverage Diff                 @@
##           forks/amsterdam    #2941      +/-   ##
===================================================
+ Coverage            90.44%   90.49%   +0.05%     
===================================================
  Files                  535      535              
  Lines                32439    32430       -9     
  Branches              3012     3012              
===================================================
+ Hits                 29338    29349      +11     
+ Misses                2573     2563      -10     
+ Partials               528      518      -10     
Flag Coverage Δ
unittests 90.49% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add scripts/detect_oog_by_design.py, which scans EELS EIP-3155 traces
under an --evm-dump-dir and flags every test whose execution ran out of
gas by design -- at the top-level transaction or any sub-call depth.
The signal is exact equality on the trace error field == "OutOfGasError"
(the EELS exception class name), giving zero false positives against
other all-gas-consuming halts (StackOverflowError, StackUnderflowError,
InvalidOpcode, Revert). Fill success is the "by design" filter.

Output is the mark_tests.py wrapped form {marker, nodeids}, so the two
chain directly. Sanitized dump-dir names are reconstructed into pytest
node IDs and verified against the source AST (re-adding the test_ prefix
and locating the function, including any enclosing class) before being
emitted; unverifiable names are reported rather than guessed.

Add a `mark_oog_tests` just recipe (detect + mark from pre-produced
traces; the heavy traced fill is documented in the recipe header rather
than run inline, so the fork and worker count stay in the user's hands)
and scripts/fill_for_oog_detection.sh to produce the traces detached.
Register the `out_of_gas` marker so the applied decorator is recognised.

Validated end-to-end against EELS traces: 6 positive OOG variants
detected (top-level, sub-call depth 2-4, ooge/oogm, CREATE-init,
multi-depth in one tx), 5 non-OOG halts correctly ignored, a 1:1 match
with ground truth.
@leolara leolara force-pushed the leolara/detect-oog-by-design-fillers branch from f83783c to 1ab1fd0 Compare June 1, 2026 12:25
@leolara
Copy link
Copy Markdown
Member Author

leolara commented Jun 1, 2026

How this was validated

Filled on Osaka with --traces (one test per fill, single worker) and ran the detector over the resulting traces. Ground truth was read straight from each test's trace error field.

Positive — detected (the OOG variant each exercises):

Test (tests/ported_static/…) Variant
stRevertTest/test_revert_prefound_oog top-level OOG (memory expansion)
stTransactionTest/test_contract_store_clears_oog top-level OOG (SSTORE)
stLogTests/test_log_in_oog_call sub-call OOG, depth 2
stCallCodes/test_callcallcallcode_001_oogm_before nested depth 3, out-of-gas-memory
stCallCodes/test_callcallcall_000_ooge nested depth 4, out-of-gas-execution
stRevertTest/test_revert_depth_create_oog CREATE-init; OOG at both depth 1 and 2 in one tx

Negative — correctly not flagged (each is a different all-gas-consuming halt or a non-OOG outcome):

Test (tests/ported_static/…) Trace error
vmTests/test_block_info (none — success)
stStackTests/test_stack_overflow StackOverflowError
stSystemOperationsTest/test_create_with_invalid_opcode StackUnderflowError
stRevertTest/test_revert_opcode_in_create_returns Revert
stRevertTest/test_loop_calls_then_revert (REVERT handled in-frame)

The detector flagged exactly the 6 tests whose traces contain OutOfGasError and none of the 5 others — a 1:1 match with ground truth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant