Skip to content

solidity: bulk memcpy for bcs_deserialize_offset_{bytes,string}#96

Draft
deuszx wants to merge 1 commit into
zefchain:mainfrom
deuszx:solidity-pr-3-bytes-mcopy
Draft

solidity: bulk memcpy for bcs_deserialize_offset_{bytes,string}#96
deuszx wants to merge 1 commit into
zefchain:mainfrom
deuszx:solidity-pr-3-bytes-mcopy

Conversation

@deuszx
Copy link
Copy Markdown
Contributor

@deuszx deuszx commented May 14, 2026

Summary

Add EvmVersion config (Shanghai default / Cancun / Latest) to
CodeGeneratorConfig and route it through to the Solidity backend.
The byte-by-byte copy loop in bcs_deserialize_offset_bytes and the
corresponding loop in bcs_deserialize_offset_string are now replaced
with one of:

  • Cancun / Latest: a single MCOPY (EIP-5656).
  • Shanghai: a word-by-word assembly memcpy (mload / mstore in
    32-byte chunks). The trailing partial word writes into padding inside
    the new bytes(len) allocation (which rounds up to 32 bytes), so the
    write stays within bounds.

The identity precompile (0x04) was the original target for Shanghai
but staticcall makes the enclosing function non-pure. Solidity then
rejects every bcs_deserialize_* declared pure. The word-loop keeps
the existing pure API and is still far cheaper than the byte loop.

Both paths are gated on len > 0 so empty copies skip the assembly
block entirely.

Benchmarks

Measured with forge test --gas-report, via_ir = true,
optimizer_runs = 200, evm_version = "cancun" (so MCOPY is
available), solc 0.8.33. Numbers are total transaction gas for an
external call to a harness that delegates to the library; same payload
prep across forms, so the relative Δ between columns is the
copy-implementation cost.

Payload len Old (byte loop) Shanghai (word loop) Cancun (MCOPY) Δ vs Old (Shanghai) Δ vs Old (Cancun)
0 7 985 8 017 8 090 +32 +105
1 8 753 8 472 8 547 −281 −206
31 21 609 13 366 12 935 −8 243 −8 674
32 22 172 13 316 13 193 −8 856 −8 979
33 22 866 13 748 13 541 −9 118 −9 325
256 122 018 50 115 49 404 −71 903 −72 614
1024 461 599 174 220 172 348 −287 379 −289 251

Per-byte cost (1 024-byte payload, divide by len):

  • Old (byte loop): ~450 gas/byte
  • Shanghai (word loop): ~170 gas/byte
  • Cancun (MCOPY): ~168 gas/byte

So the new forms are ~2.7× cheaper per byte than the byte-by-byte
loop. Shanghai and Cancun are within ~1 % of each other — the word
loop is essentially as fast as the native MCOPY opcode at typical
sizes, which justifies keeping it as the default and only switching
on MCOPY when the user explicitly opts into Cancun.

Empty (len = 0) and 1-byte payloads are a wash (within ±300 gas of
the old form). The len > 0 guard prevents the assembly setup cost
from hurting the empty case.

Deployed bytecode of the harness contract (which inlines the library
helpers via via_ir):

Form Bytes Deployment gas
Old (byte loop) 842 229 772
Shanghai 832 227 622
Cancun 803 221 352

Both new forms are smaller than the old byte-by-byte loop, despite
the inline assembly. Yul lays out the assembly memcpy more compactly
than the indexed byte-by-byte loop with its per-iteration bounds check.

Practical impact for linera-bridge: every bytes / string field
inside a certificate decodes ~280 gas per byte cheaper. A 2 KiB
certificate saves ~570 K gas just on the copy side, before counting
field-by-field decode savings.

Reproduce the benchmark

  1. Save the three libraries below as Old.sol, Shanghai.sol, and
    Cancun.sol. Each file exports a small harness contract whose
    deser(bytes calldata) method delegates to the library so foundry
    can measure it.

  2. Create foundry.toml:

    [profile.default]
    src = "."
    out = "out"
    test = "test"
    via_ir = true
    optimizer = true
    optimizer_runs = 200
    evm_version = "cancun"
  3. Save the test harness as test/Bench.t.sol (builds an
    LEB128-prefixed payload of the requested length, fills bytes with
    i & 0xff, then calls each harness).

  4. Run forge test --gas-report. The per-function gas table for each
    harness shows the per-call cost; the test_{old,sha,can}_<len>
    results show the per-length breakdown. Runtime-bytecode size comes
    from solc --via-ir --optimize --bin-runtime --evm-version cancun
    (or forge inspect <name> deployedBytecode).

Test Plan

Cover both paths with round-trip tests at lengths 0, 1, 31, 32, 33, 1024 — boundaries where the word loop and MCOPY diverge from a
naive byte loop:

  • test_bytes_copy_shanghai exercises the word-loop path.
  • test_bytes_copy_cancun exercises the MCOPY path.

Add EvmVersion config (Shanghai default / Cancun / Latest) to
CodeGeneratorConfig and route it through to the Solidity backend. The
byte-by-byte copy loop in bcs_deserialize_offset_bytes and the
corresponding loop in bcs_deserialize_offset_string are now replaced
with either:

  * Cancun / Latest: a single `MCOPY` (EIP-5656).
  * Shanghai: a word-by-word assembly memcpy (mload/mstore in 32-byte
    chunks). The trailing partial word writes into padding inside the
    `new bytes(len)` allocation (which rounds up to 32 bytes), so the
    write stays within bounds.

The identity precompile (0x04) was the original target for Shanghai
but `staticcall` makes the enclosing function non-pure. Solidity then
rejects every `bcs_deserialize_*` declared `pure`. The word-loop keeps
the existing `pure` API and is still far cheaper than the byte loop.

Both paths are gated on `len > 0` so empty copies skip the assembly
block entirely.

Cover both paths with round-trip tests at lengths 0, 1, 31, 32, 33,
and 1024 (boundaries where the word-loop and MCOPY diverge from a
naive byte loop).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant