solidity: bulk memcpy for bcs_deserialize_offset_{bytes,string} by deuszx · Pull Request #96 · zefchain/serde-reflection

deuszx · 2026-05-14T13:14:44Z

Summary

Add EvmVersion config (Shanghai default / Cancun / Latest) to
CodeGeneratorConfig and route it through to the Solidity backend.
The byte-by-byte copy loop in bcs_deserialize_offset_bytes and the
corresponding loop in bcs_deserialize_offset_string are now replaced
with one of:

Cancun / Latest: a single MCOPY (EIP-5656).
Shanghai: a word-by-word assembly memcpy (mload / mstore in
32-byte chunks). The trailing partial word writes into padding inside
the new bytes(len) allocation (which rounds up to 32 bytes), so the
write stays within bounds.

The identity precompile (0x04) was the original target for Shanghai
but staticcall makes the enclosing function non-pure. Solidity then
rejects every bcs_deserialize_* declared pure. The word-loop keeps
the existing pure API and is still far cheaper than the byte loop.

Both paths are gated on len > 0 so empty copies skip the assembly
block entirely.

Benchmarks

Measured with forge test --gas-report, via_ir = true,
optimizer_runs = 200, evm_version = "cancun" (so MCOPY is
available), solc 0.8.33. Numbers are total transaction gas for an
external call to a harness that delegates to the library; same payload
prep across forms, so the relative Δ between columns is the
copy-implementation cost.

Payload len	Old (byte loop)	Shanghai (word loop)	Cancun (MCOPY)	Δ vs Old (Shanghai)	Δ vs Old (Cancun)
0	7 985	8 017	8 090	+32	+105
1	8 753	8 472	8 547	−281	−206
31	21 609	13 366	12 935	−8 243	−8 674
32	22 172	13 316	13 193	−8 856	−8 979
33	22 866	13 748	13 541	−9 118	−9 325
256	122 018	50 115	49 404	−71 903	−72 614
1024	461 599	174 220	172 348	−287 379	−289 251

Per-byte cost (1 024-byte payload, divide by len):

Old (byte loop): ~450 gas/byte
Shanghai (word loop): ~170 gas/byte
Cancun (MCOPY): ~168 gas/byte

So the new forms are ~2.7× cheaper per byte than the byte-by-byte
loop. Shanghai and Cancun are within ~1 % of each other — the word
loop is essentially as fast as the native MCOPY opcode at typical
sizes, which justifies keeping it as the default and only switching
on MCOPY when the user explicitly opts into Cancun.

Empty (len = 0) and 1-byte payloads are a wash (within ±300 gas of
the old form). The len > 0 guard prevents the assembly setup cost
from hurting the empty case.

Deployed bytecode of the harness contract (which inlines the library
helpers via via_ir):

Form	Bytes	Deployment gas
Old (byte loop)	842	229 772
Shanghai	832	227 622
Cancun	803	221 352

Both new forms are smaller than the old byte-by-byte loop, despite
the inline assembly. Yul lays out the assembly memcpy more compactly
than the indexed byte-by-byte loop with its per-iteration bounds check.

Practical impact for linera-bridge: every bytes / string field
inside a certificate decodes ~280 gas per byte cheaper. A 2 KiB
certificate saves ~570 K gas just on the copy side, before counting
field-by-field decode savings.

Reproduce the benchmark

Save the three libraries below as Old.sol, Shanghai.sol, and
Cancun.sol. Each file exports a small harness contract whose
deser(bytes calldata) method delegates to the library so foundry
can measure it.

Create foundry.toml:

[profile.default]
src = "."
out = "out"
test = "test"
via_ir = true
optimizer = true
optimizer_runs = 200
evm_version = "cancun"

Save the test harness as test/Bench.t.sol (builds an
LEB128-prefixed payload of the requested length, fills bytes with
i & 0xff, then calls each harness).
Run forge test --gas-report. The per-function gas table for each
harness shows the per-call cost; the test_{old,sha,can}_<len>
results show the per-length breakdown. Runtime-bytecode size comes
from solc --via-ir --optimize --bin-runtime --evm-version cancun
(or forge inspect <name> deployedBytecode).

Test Plan

Cover both paths with round-trip tests at lengths 0, 1, 31, 32, 33, 1024 — boundaries where the word loop and MCOPY diverge from a
naive byte loop:

test_bytes_copy_shanghai exercises the word-loop path.
test_bytes_copy_cancun exercises the MCOPY path.

Add EvmVersion config (Shanghai default / Cancun / Latest) to CodeGeneratorConfig and route it through to the Solidity backend. The byte-by-byte copy loop in bcs_deserialize_offset_bytes and the corresponding loop in bcs_deserialize_offset_string are now replaced with either: * Cancun / Latest: a single `MCOPY` (EIP-5656). * Shanghai: a word-by-word assembly memcpy (mload/mstore in 32-byte chunks). The trailing partial word writes into padding inside the `new bytes(len)` allocation (which rounds up to 32 bytes), so the write stays within bounds. The identity precompile (0x04) was the original target for Shanghai but `staticcall` makes the enclosing function non-pure. Solidity then rejects every `bcs_deserialize_*` declared `pure`. The word-loop keeps the existing `pure` API and is still far cheaper than the byte loop. Both paths are gated on `len > 0` so empty copies skip the assembly block entirely. Cover both paths with round-trip tests at lengths 0, 1, 31, 32, 33, and 1024 (boundaries where the word-loop and MCOPY diverge from a naive byte loop).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solidity: bulk memcpy for bcs_deserialize_offset_{bytes,string}#96

solidity: bulk memcpy for bcs_deserialize_offset_{bytes,string}#96
deuszx wants to merge 1 commit into
zefchain:mainfrom
deuszx:solidity-pr-3-bytes-mcopy

deuszx commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

deuszx commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmarks

Reproduce the benchmark

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

deuszx commented May 14, 2026 •

edited

Loading