feat(tooling): detect and mark out-of-gas-by-design tests#2941
Conversation
Adds a pytest plugin enabled with `--detect-gas-checks` (fill mode only) that identifies tests asserting specific gas values. The plugin uses two detection modes, chosen per sink: - **Storage in post** (`post[addr].storage[slot]`) — taint propagation. Wraps `Bytecode.gas_cost`, per-fork `gas_costs` / `opcode_gas_calculator` / `transaction_intrinsic_cost_calculator`, and patches `Number.__new__` + `FixedSizeHexNumber.__new__` so a `GasTainted(int)` carrier survives `HashInt` / `HexNumber` construction. Tests whose post-storage holds a tainted value are flagged. - **Receipt / header / block-level / benchmark fields** — field-presence. The field name (`cumulative_gas_used`, `expected_gas_used`, `blob_gas_used`, etc.) is itself the assertion signal; if set, flag. Tests with `tx.error`, `block_exception`, or the `exception_test` marker are excluded so OOG-expecting tests don't pollute the output. Worker results are aggregated to the master via `workeroutput` / `pytest_testnodedown`. `Block.expected_gas_used` and `BaseTest.expected_benchmark_gas_used` change from `int` to `HexNumber`. Both are internal assertion-only and not serialised to fixtures, so the type change has no fixture impact and brings them under the same constructor path as the other gas sinks. The benchmark validator widens a local var type to accommodate the union. Also adds `scripts/mark_tests.py`, an idempotent script that applies a configurable `@pytest.mark.<name>` decorator to test functions listed in a JSON file. Accepts three input shapes (bare mapping, bare list, wrapped with embedded marker name); `--marker` overrides any embedded value.
Applies `@pytest.mark.gas_check` to the 54 test functions identified by running `fill --detect-gas-checks` on the full suite. Produced by `scripts/mark_tests.py gas_check_report.json`; re-running the script is idempotent. Affects 31 files across 16 EIP directories; one decorator each, inserted at the top of the existing decorator stack.
38 tests covering: - ``GasTainted`` carrier: construction, arithmetic propagation (incl. reflected ops), origin union/dedup, and the NotImplemented forward path that fixes ``bytes_literal * GasTainted(n)``. - ``install_taint`` / ``uninstall_taint``: taint reaches Bytecode and fork gas-cost outputs, survives ``HashInt`` / ``HexNumber`` / ``ZeroPaddedHexNumber`` and Pydantic ``Storage`` validation, doesn't leak onto plain ints, and uninstall reverts. - ``collect_taint_hits``: storage taint detection, presence-based receipt / header / block-level / benchmark detection, the BenchmarkTest MRO gate that prevents filler-default false positives, and the OOG exclusions (tx.error, block_exception, exception_test).
Replace ``Any`` and ``getattr``-based field access in the sink walker with direct attribute access against ``BaseTest`` / ``StateTest`` / ``BlockchainTest`` / ``BenchmarkTest`` / ``Alloc`` / ``Transaction`` / ``Block``. Spec-type-specific access (``test.tx``, ``test.blocks``) now dispatches via ``isinstance`` rather than ``getattr(test, "...", None)``, so a future rename of ``post`` / ``tx`` / ``expected_receipt`` / etc. fails type-checking instead of silently returning empty hits. Also tightens the few remaining ``Any``s where possible: the wrapped opcode-gas calculator is now typed as ``Callable[[OpcodeBase], int]``, and ``pytest_testnodedown`` uses ``xdist.workermanage.WorkerController`` (stub added) and ``object | None``, matching the upstream xdist hook. The unit tests built with ``SimpleNamespace`` fakes no longer satisfy the new typed signature and will be replaced in a follow-up commit.
Replace the SimpleNamespace-based fakes in the walker tests with real ``StateTest`` / ``BlockchainTest`` / ``BenchmarkTest`` Pydantic instances so that field renames or retypings on the spec models fail the unit tests instead of silently masking them. The storage and mixed-sink tests now use the ``taint_installed`` fixture because Pydantic ``Storage`` validation strips the ``GasTainted`` subclass without the ``FixedSizeHexNumber.__new__`` patch. Carrier tests (``TestGasTaintedCarrier``) and install tests (``TestTaintInstallation``) are unchanged — they already exercised the real types.
Add four end-to-end tests that load the real fill plugin chain into a pytester subprocess and exercise the bits the unit tests deliberately skip: - ``--detect-gas-checks`` / ``--gas-check-report`` appear in ``fill --help``. - A synthetic test using ``CodeGasMeasure`` is detected and recorded as a ``storage`` hit with gas-derived origins. - A synthetic test marked ``@pytest.mark.exception_test`` is excluded from the report. - Without the flag, no report file is produced. Co-located with ``test_benchmarking.py`` and added to ``--ignore`` in ``test-tests`` / ``test-tests-pypy`` so unit-test runs stay fast; runs under ``test-tests-bench`` alongside the existing pytester suite. Needs ``EVM_BIN`` (defaults to ``evm``) on the runner.
Add a negative integration test that exercises a synthetic state test which stores a literal value (no ``CodeGasMeasure``, no ``expected_receipt``, no header verify) and asserts the resulting report is empty. Guards against false positives: a test that simply runs successfully without touching any of the detector's triggers must not appear in the JSON output.
…ipes - ``just mark_gas_tests <path-or-selector>`` runs ``fill --detect-gas-checks`` against the given target, then pipes the report through ``scripts/mark_tests.py`` to apply ``@pytest.mark.gas_check`` to the detected tests. Accepts any pytest path/selector, so the workflow can be scoped to a single file or function. - ``just mark_gas_tests_test`` runs the pytester end-to-end tests for the gas_taint plugin in their own ``[gas check]`` group, instead of piggybacking on ``test-tests-bench`` (which is for benchmark framework tests, not plugin e2e).
6f580d8 to
f83783c
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## forks/amsterdam #2941 +/- ##
===================================================
+ Coverage 90.44% 90.49% +0.05%
===================================================
Files 535 535
Lines 32439 32430 -9
Branches 3012 3012
===================================================
+ Hits 29338 29349 +11
+ Misses 2573 2563 -10
+ Partials 528 518 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add scripts/detect_oog_by_design.py, which scans EELS EIP-3155 traces
under an --evm-dump-dir and flags every test whose execution ran out of
gas by design -- at the top-level transaction or any sub-call depth.
The signal is exact equality on the trace error field == "OutOfGasError"
(the EELS exception class name), giving zero false positives against
other all-gas-consuming halts (StackOverflowError, StackUnderflowError,
InvalidOpcode, Revert). Fill success is the "by design" filter.
Output is the mark_tests.py wrapped form {marker, nodeids}, so the two
chain directly. Sanitized dump-dir names are reconstructed into pytest
node IDs and verified against the source AST (re-adding the test_ prefix
and locating the function, including any enclosing class) before being
emitted; unverifiable names are reported rather than guessed.
Add a `mark_oog_tests` just recipe (detect + mark from pre-produced
traces; the heavy traced fill is documented in the recipe header rather
than run inline, so the fork and worker count stay in the user's hands)
and scripts/fill_for_oog_detection.sh to produce the traces detached.
Register the `out_of_gas` marker so the applied decorator is recognised.
Validated end-to-end against EELS traces: 6 positive OOG variants
detected (top-level, sub-call depth 2-4, ooge/oogm, CREATE-init,
multi-depth in one tx), 5 non-OOG halts correctly ignored, a 1:1 match
with ground truth.
f83783c to
1ab1fd0
Compare
How this was validatedFilled on Osaka with Positive — detected (the OOG variant each exercises):
Negative — correctly not flagged (each is a different all-gas-consuming halt or a non-OOG outcome):
The detector flagged exactly the 6 tests whose traces contain |
Important
🛑 Stacked on #2906 — review and merge this PR only after #2906.
This branch is built on top of #2906 (
leolara/mark-gas-checking-tests) andreuses its
scripts/mark_tests.py. Until #2906 merges intoforks/amsterdam, the diff below also contains #2906's commits. Once #2906is merged and this branch is rebased onto the updated base, the diff collapses
to just the out-of-gas detection commit. Please do not merge before #2906.
🗒️ Description
Adds tooling to find tests that run out of gas by design — at the top-level
transaction or any sub-call depth — and mark them, complementing #2906's
gas-assertion detection.
scripts/detect_oog_by_design.pyscans EELS EIP-3155 traces under an--evm-dump-dirand flags every test whose execution ran out of gas. Thesignal is exact equality on the trace
errorfield== "OutOfGasError"(theEELS exception class name written at
src/ethereum_spec_tools/evm_tools/t8n/evm_trace/eip3155.py), which is uniformacross every fork and distinct from all other halt classes. This gives zero
false positives against the other all-gas-consuming halts
(
StackOverflowError,StackUnderflowError,InvalidOpcode,Revert). Fillsuccess is the "by design" filter: if a traced test filled successfully, any OOG
in its trace matched the author's declared post-state.
The output is the
mark_tests.pywrapped form{marker, nodeids}, so the twoscripts chain directly. The lossy sanitized dump-dir names are reconstructed
into pytest node IDs and verified against the source AST (re-adding the
test_prefix and locating the function, including any enclosing class) beforebeing emitted; names that cannot be verified are reported rather than guessed —
and
mark_tests.pyindependently re-checks every node ID against thefilesystem, so a wrong ID can never silently mis-mark a test.
Also adds:
mark_oog_testsjust recipe that runs detect + mark againstpre-produced traces. The heavy traced fill is documented in the recipe
header rather than run inline, so the fork and worker count stay in the
reviewer's hands (traces are large and memory-hungry; a single fork suffices
because OOG-by-design is fork-invariant in practice);
scripts/fill_for_oog_detection.shto produce the traces detached;oog_detector_validation_report.mdrecording the validation: 6 positivevariants detected (top-level, sub-call depth 2–4,
ooge/oogm, CREATE-init,multi-depth-in-one-tx), 5 non-OOG halts correctly ignored, a 1:1 match with
ground truth read straight from the traces.
The detector is stdlib-only and does not import
execution_testing, so itis unaffected by the substring-match limitation in that package's
_is_out_of_gas_errorhelper (it tests for"out of gas", which never matchesthe EELS class name
OutOfGasError).🔗 Related Issues or PRs
Depends on and must merge after #2906.
✅ Checklist
just statictype(scope):.Cute Animal Picture