feat(tooling): detect and mark out-of-gas-by-design tests by leolara · Pull Request #2941 · ethereum/execution-specs

leolara · 2026-05-29T11:03:05Z

Important

🛑 Stacked on #2906 — review and merge this PR only after #2906.

This branch is built on top of #2906 (leolara/mark-gas-checking-tests) and
reuses its scripts/mark_tests.py. Until #2906 merges into
forks/amsterdam, the diff below also contains #2906's commits. Once #2906
is merged and this branch is rebased onto the updated base, the diff collapses
to just the out-of-gas detection commit. Please do not merge before #2906.

🗒️ Description

Adds tooling to find tests that run out of gas by design — at the top-level
transaction or any sub-call depth — and mark them, complementing #2906's
gas-assertion detection.

scripts/detect_oog_by_design.py scans EELS EIP-3155 traces under an
--evm-dump-dir and flags every test whose execution ran out of gas. The
signal is exact equality on the trace error field == "OutOfGasError" (the
EELS exception class name written at
src/ethereum_spec_tools/evm_tools/t8n/evm_trace/eip3155.py), which is uniform
across every fork and distinct from all other halt classes. This gives zero
false positives against the other all-gas-consuming halts
(StackOverflowError, StackUnderflowError, InvalidOpcode, Revert). Fill
success is the "by design" filter: if a traced test filled successfully, any OOG
in its trace matched the author's declared post-state.

The output is the mark_tests.py wrapped form {marker, nodeids}, so the two
scripts chain directly. The lossy sanitized dump-dir names are reconstructed
into pytest node IDs and verified against the source AST (re-adding the
test_ prefix and locating the function, including any enclosing class) before
being emitted; names that cannot be verified are reported rather than guessed —
and mark_tests.py independently re-checks every node ID against the
filesystem, so a wrong ID can never silently mis-mark a test.

Also adds:

a mark_oog_tests just recipe that runs detect + mark against
pre-produced traces. The heavy traced fill is documented in the recipe
header rather than run inline, so the fork and worker count stay in the
reviewer's hands (traces are large and memory-hungry; a single fork suffices
because OOG-by-design is fork-invariant in practice);
scripts/fill_for_oog_detection.sh to produce the traces detached;
oog_detector_validation_report.md recording the validation: 6 positive
variants detected (top-level, sub-call depth 2–4, ooge/oogm, CREATE-init,
multi-depth-in-one-tx), 5 non-OOG halts correctly ignored, a 1:1 match with
ground truth read straight from the traces.

The detector is stdlib-only and does not import execution_testing, so it
is unaffected by the substring-match limitation in that package's
_is_out_of_gas_error helper (it tests for "out of gas", which never matches
the EELS class name OutOfGasError).

🔗 Related Issues or PRs

Depends on and must merge after #2906.

✅ Checklist

All: Ran fast static checks to avoid unnecessary CI fails, see also Code Standards and Enabling Pre-commit Checks:
```
just static
```
All: PR title adheres to the repo standard - it will be used as the squash commit message and should start type(scope):.
All: Considered updating the online docs in the ./docs/ directory.
All: Set appropriate labels for the changes (only maintainers can apply labels).

Cute Animal Picture

Adds a pytest plugin enabled with `--detect-gas-checks` (fill mode only) that identifies tests asserting specific gas values. The plugin uses two detection modes, chosen per sink: - **Storage in post** (`post[addr].storage[slot]`) — taint propagation. Wraps `Bytecode.gas_cost`, per-fork `gas_costs` / `opcode_gas_calculator` / `transaction_intrinsic_cost_calculator`, and patches `Number.__new__` + `FixedSizeHexNumber.__new__` so a `GasTainted(int)` carrier survives `HashInt` / `HexNumber` construction. Tests whose post-storage holds a tainted value are flagged. - **Receipt / header / block-level / benchmark fields** — field-presence. The field name (`cumulative_gas_used`, `expected_gas_used`, `blob_gas_used`, etc.) is itself the assertion signal; if set, flag. Tests with `tx.error`, `block_exception`, or the `exception_test` marker are excluded so OOG-expecting tests don't pollute the output. Worker results are aggregated to the master via `workeroutput` / `pytest_testnodedown`. `Block.expected_gas_used` and `BaseTest.expected_benchmark_gas_used` change from `int` to `HexNumber`. Both are internal assertion-only and not serialised to fixtures, so the type change has no fixture impact and brings them under the same constructor path as the other gas sinks. The benchmark validator widens a local var type to accommodate the union. Also adds `scripts/mark_tests.py`, an idempotent script that applies a configurable `@pytest.mark.<name>` decorator to test functions listed in a JSON file. Accepts three input shapes (bare mapping, bare list, wrapped with embedded marker name); `--marker` overrides any embedded value.

Applies `@pytest.mark.gas_check` to the 54 test functions identified by running `fill --detect-gas-checks` on the full suite. Produced by `scripts/mark_tests.py gas_check_report.json`; re-running the script is idempotent. Affects 31 files across 16 EIP directories; one decorator each, inserted at the top of the existing decorator stack.

38 tests covering: - ``GasTainted`` carrier: construction, arithmetic propagation (incl. reflected ops), origin union/dedup, and the NotImplemented forward path that fixes ``bytes_literal * GasTainted(n)``. - ``install_taint`` / ``uninstall_taint``: taint reaches Bytecode and fork gas-cost outputs, survives ``HashInt`` / ``HexNumber`` / ``ZeroPaddedHexNumber`` and Pydantic ``Storage`` validation, doesn't leak onto plain ints, and uninstall reverts. - ``collect_taint_hits``: storage taint detection, presence-based receipt / header / block-level / benchmark detection, the BenchmarkTest MRO gate that prevents filler-default false positives, and the OOG exclusions (tx.error, block_exception, exception_test).

Replace ``Any`` and ``getattr``-based field access in the sink walker with direct attribute access against ``BaseTest`` / ``StateTest`` / ``BlockchainTest`` / ``BenchmarkTest`` / ``Alloc`` / ``Transaction`` / ``Block``. Spec-type-specific access (``test.tx``, ``test.blocks``) now dispatches via ``isinstance`` rather than ``getattr(test, "...", None)``, so a future rename of ``post`` / ``tx`` / ``expected_receipt`` / etc. fails type-checking instead of silently returning empty hits. Also tightens the few remaining ``Any``s where possible: the wrapped opcode-gas calculator is now typed as ``Callable[[OpcodeBase], int]``, and ``pytest_testnodedown`` uses ``xdist.workermanage.WorkerController`` (stub added) and ``object | None``, matching the upstream xdist hook. The unit tests built with ``SimpleNamespace`` fakes no longer satisfy the new typed signature and will be replaced in a follow-up commit.

Replace the SimpleNamespace-based fakes in the walker tests with real ``StateTest`` / ``BlockchainTest`` / ``BenchmarkTest`` Pydantic instances so that field renames or retypings on the spec models fail the unit tests instead of silently masking them. The storage and mixed-sink tests now use the ``taint_installed`` fixture because Pydantic ``Storage`` validation strips the ``GasTainted`` subclass without the ``FixedSizeHexNumber.__new__`` patch. Carrier tests (``TestGasTaintedCarrier``) and install tests (``TestTaintInstallation``) are unchanged — they already exercised the real types.

Add four end-to-end tests that load the real fill plugin chain into a pytester subprocess and exercise the bits the unit tests deliberately skip: - ``--detect-gas-checks`` / ``--gas-check-report`` appear in ``fill --help``. - A synthetic test using ``CodeGasMeasure`` is detected and recorded as a ``storage`` hit with gas-derived origins. - A synthetic test marked ``@pytest.mark.exception_test`` is excluded from the report. - Without the flag, no report file is produced. Co-located with ``test_benchmarking.py`` and added to ``--ignore`` in ``test-tests`` / ``test-tests-pypy`` so unit-test runs stay fast; runs under ``test-tests-bench`` alongside the existing pytester suite. Needs ``EVM_BIN`` (defaults to ``evm``) on the runner.

Add a negative integration test that exercises a synthetic state test which stores a literal value (no ``CodeGasMeasure``, no ``expected_receipt``, no header verify) and asserts the resulting report is empty. Guards against false positives: a test that simply runs successfully without touching any of the detector's triggers must not appear in the JSON output.

…ipes - ``just mark_gas_tests <path-or-selector>`` runs ``fill --detect-gas-checks`` against the given target, then pipes the report through ``scripts/mark_tests.py`` to apply ``@pytest.mark.gas_check`` to the detected tests. Accepts any pytest path/selector, so the workflow can be scoped to a single file or function. - ``just mark_gas_tests_test`` runs the pytester end-to-end tests for the gas_taint plugin in their own ``[gas check]`` group, instead of piggybacking on ``test-tests-bench`` (which is for benchmark framework tests, not plugin e2e).

codecov · 2026-05-29T11:30:08Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.49%. Comparing base (55d774b) to head (1ab1fd0).
⚠️ Report is 15 commits behind head on forks/amsterdam.

Additional details and impacted files

@@                 Coverage Diff                 @@
##           forks/amsterdam    #2941      +/-   ##
===================================================
+ Coverage            90.44%   90.49%   +0.05%     
===================================================
  Files                  535      535              
  Lines                32439    32430       -9     
  Branches              3012     3012              
===================================================
+ Hits                 29338    29349      +11     
+ Misses                2573     2563      -10     
+ Partials               528      518      -10

Flag	Coverage Δ
unittests	`90.49% <ø> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add scripts/detect_oog_by_design.py, which scans EELS EIP-3155 traces under an --evm-dump-dir and flags every test whose execution ran out of gas by design -- at the top-level transaction or any sub-call depth. The signal is exact equality on the trace error field == "OutOfGasError" (the EELS exception class name), giving zero false positives against other all-gas-consuming halts (StackOverflowError, StackUnderflowError, InvalidOpcode, Revert). Fill success is the "by design" filter. Output is the mark_tests.py wrapped form {marker, nodeids}, so the two chain directly. Sanitized dump-dir names are reconstructed into pytest node IDs and verified against the source AST (re-adding the test_ prefix and locating the function, including any enclosing class) before being emitted; unverifiable names are reported rather than guessed. Add a `mark_oog_tests` just recipe (detect + mark from pre-produced traces; the heavy traced fill is documented in the recipe header rather than run inline, so the fork and worker count stay in the user's hands) and scripts/fill_for_oog_detection.sh to produce the traces detached. Register the `out_of_gas` marker so the applied decorator is recognised. Validated end-to-end against EELS traces: 6 positive OOG variants detected (top-level, sub-call depth 2-4, ooge/oogm, CREATE-init, multi-depth in one tx), 5 non-OOG halts correctly ignored, a 1:1 match with ground truth.

leolara · 2026-06-01T12:38:34Z

How this was validated

Filled on Osaka with --traces (one test per fill, single worker) and ran the detector over the resulting traces. Ground truth was read straight from each test's trace error field.

Positive — detected (the OOG variant each exercises):

Test (`tests/ported_static/…`)	Variant
`stRevertTest/test_revert_prefound_oog`	top-level OOG (memory expansion)
`stTransactionTest/test_contract_store_clears_oog`	top-level OOG (`SSTORE`)
`stLogTests/test_log_in_oog_call`	sub-call OOG, depth 2
`stCallCodes/test_callcallcallcode_001_oogm_before`	nested depth 3, out-of-gas-memory
`stCallCodes/test_callcallcall_000_ooge`	nested depth 4, out-of-gas-execution
`stRevertTest/test_revert_depth_create_oog`	CREATE-init; OOG at both depth 1 and 2 in one tx

Negative — correctly not flagged (each is a different all-gas-consuming halt or a non-OOG outcome):

Test (`tests/ported_static/…`)	Trace `error`
`vmTests/test_block_info`	(none — success)
`stStackTests/test_stack_overflow`	`StackOverflowError`
`stSystemOperationsTest/test_create_with_invalid_opcode`	`StackUnderflowError`
`stRevertTest/test_revert_opcode_in_create_returns`	`Revert`
`stRevertTest/test_loop_calls_then_revert`	(REVERT handled in-frame)

The detector flagged exactly the 6 tests whose traces contain OutOfGasError and none of the 5 others — a 1:1 match with ground truth.

leolara added 8 commits May 23, 2026 11:02

leolara force-pushed the leolara/detect-oog-by-design-fillers branch from 6f580d8 to f83783c Compare May 29, 2026 11:20

leolara force-pushed the leolara/detect-oog-by-design-fillers branch from f83783c to 1ab1fd0 Compare June 1, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(tooling): detect and mark out-of-gas-by-design tests#2941

feat(tooling): detect and mark out-of-gas-by-design tests#2941
leolara wants to merge 9 commits into
ethereum:forks/amsterdamfrom
leolara:leolara/detect-oog-by-design-fillers

leolara commented May 29, 2026

Uh oh!

codecov Bot commented May 29, 2026 •

edited

Loading

Uh oh!

leolara commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leolara commented May 29, 2026

🗒️ Description

🔗 Related Issues or PRs

✅ Checklist

Cute Animal Picture

Uh oh!

codecov Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

leolara commented Jun 1, 2026

How this was validated

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented May 29, 2026 •

edited

Loading