Add --wide-pointer-encoding: 192-bit pointers with flat addresses by tautschnig · Pull Request #8954 · diffblue/cbmc

tautschnig · 2026-04-08T20:48:36Z

Add an alternative pointer encoding activated via --wide-pointer-encoding that represents each pointer as a 192-bit bitvector [object(64) | offset(64) | address(64)]. The address component is a flat integer maintained consistently through all pointer operations, enabling address-based pointer comparison, cross-object pointer arithmetic detection, and address reuse after free.

Fixes: #881
Fixes: #8200
Fixes: #8161
Fixes: #8103
Fixes: #2117

Motivation

The standard pointer encoding uses [object | offset] pairs where the object number is an abstract identifier with no numeric meaning. This makes it impossible to model scenarios where the concrete address matters: pointer-to-integer casts return meaningless values (#881, #8200), pointer comparisons across objects are inconsistent (#8161), integer-to-pointer casts cannot resolve back to the correct object (#8103), and address reuse after free is not modelled (#2117).

Design

Each pointer carries a symbolic flat address alongside the standard object/offset encoding. Non-overlapping base address constraints ensure distinct objects occupy disjoint address ranges. The encoding is purely propositional — no solver-level arrays are used.

Key design decisions:

P2I returns the flat address component directly.
I2P uses forward constraints (address = base[obj] + offset) plus lazy backward refinement in dec_solve (address in range ⇒ object identity).
Pointer equality/comparison uses the address component (<, >, ==, !=), matching C semantics for cross-object comparisons.
Address reuse is enabled via config.bv_encoding.malloc_may_alias: consecutive dynamic objects of the same size may share base addresses.
Non-overlapping constraints use an O(n) chain: objects are ordered by base address, and each object's end equals the next object's base.

Additional feature: `--model-stack-layout`

When combined with --wide-pointer-encoding, the --model-stack-layout option constrains stack variables to be placed at adjacent addresses in declaration order, with architecture-dependent stack growth direction (config.ansi_c.stack_grows_downward).

Performance

The wide encoding adds ~1.5–2× overhead on solver-heavy tests. On the full regression suite (dominated by solver-trivial tests), the overhead is ~1.1×.

Test results

Standard encoding: 0 failures
Wide encoding: 0 failures
11 tests tagged no-wide-pointer-encoding (intentional semantic changes with alternative test-wide-pointer-encoding.desc descriptors)

General fixes included

This branch also includes two general SAT backend fixes (commits 1–2) discovered during development:

MiniSat: add set_limit_incremental_simplification to prevent variable elimination between incremental solves
CaDiCaL: fix l_get crash for var_no == 0 (1-based variable numbering)

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
White-space or formatting changes outside the feature-related changed lines are in commits of their own.

Add an option to disable MiniSat's SimpSolver simplifier after the first solve call. The simplifier can eliminate variables between incremental calls, which may make later UNSAT proofs extremely expensive for the CDCL search. This is useful for any incremental solving scenario, not just the wide pointer encoding. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

CaDiCaL uses 1-based variable numbering. When l_get is called with a literal whose var_no is 0, solver->val(0) triggers a fatal error. Guard against this by returning TV_UNKNOWN for var_no 0. This fixes printf1/2/3 trace generation crashes with CaDiCaL and --wide-pointer-encoding. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Extend the standard pointer encoding to include a flat integer address component alongside the existing object/offset bits. The pointer bitvector layout becomes: [object(64) | offset(64) | address(64)] = 192 bits The address component is maintained consistently: - encode(): sets address to base[object] (symbolic base address) - offset_arithmetic(): updates address += byte_offset - pointer-to-integer cast: returns the address component - Non-overlapping base address constraints in finish_eager_conversion Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Pointer_difference2 produces fewer overflow failures with the wide encoding because pointer subtraction overflow is checked on the 56-bit offset component rather than the full packed representation. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

…iffblue#8161) Override convert_equality for pointer types: compare by flat address instead of the full bitvector encoding. Two pointers are equal iff their addresses are equal, regardless of object encoding. Override relational comparisons (<, >, <=, >=) for pointer types: compare by flat address instead of offset-within-same-object. This makes cross-object pointer comparisons well-defined. Relax the overflow guard on base addresses: only require that base + size does not wrap around (unsigned). Remove the previous restriction that limited addresses to a small range. This allows the solver to place objects anywhere in the address space. Fixes: - diffblue#8200: (char*)0x55a8a2e6b007 can now equal a string literal's address, so assert(ptr != x) correctly reports FAILURE. - diffblue#8161: pointer-derived integers now have meaningful values in the full address range, making arithmetic on them consistent with plain integer arithmetic. New test expectation mismatches: - Pointer_array4: (int)pointer can overflow signed int when addresses are in the upper half of the 32-bit space. - Pointer_comparison3: p < p1+1 no longer implies same_object because the comparison is address-based. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Move the backward constraints (address in range => object identity) from the I2P encoding site to finish_eager_conversion, where all objects are known. Forward constraints (obj==i => base[i]+off==addr) are still added immediately during I2P encoding for currently known objects, ensuring the dereference MUX has correct offsets. In finish_eager_conversion: - Forward constraints are added for objects created AFTER the I2P - Backward constraints are added for ALL objects - Valid-object disjunction covers all objects Note: this does NOT fix issue diffblue#8103 because the dereference of I2P pointers goes through symex's flat memory model (byte_extract from __CPROVER_memory at pointer_offset), not through the pointer encoding's object-based dispatch. Fixing diffblue#8103 requires symex-level changes to recognize that I2P pointers may point to known objects. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add object_base_address_exprt: a new expression that returns the flat base address of the object a pointer points to. This is the foundation for address-based dereference dispatch. Encoding: postponed to finish_eager_conversion like object_size. For each object i, if pointer_object(ptr)==i then result==base[i]. Infrastructure for passing wide_pointer_encoding flag to symex: - symex_configt::wide_pointer_encoding (from options) - value_set_dereferencet constructor accepts the flag The address-based dereference dispatch (Step 2 of the plan) is not yet implemented — the naive approach of iterating over the symbol table causes crashes because symex has already substituted symbols with their SSA-renamed values. The dispatch needs to work with the SSA-renamed symbols, which requires a different approach (e.g., generating the dispatch at the goto-program level before symex, or using the value set's object tracking). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add pointer_width_multiplier to boolbv_widtht so that all width computations (arrays, structs, unions containing pointers) use the correct 3x pointer width. Previously, only the boolbv_width() virtual override in bv_pointerst returned 3x for pointer types, but the boolbv_widtht object (used by convert_array, endianness maps, and other compound type handlers) still used the original 64-bit pointer width. This caused width mismatches (expected 64, got 192) that crashed ~40 tests. The fix: - boolbv_widtht::set_pointer_width_multiplier(3) is called when wide encoding is enabled - The pointer width entry in boolbv_widtht's cache is multiplied - All compound type widths are automatically correct because they recursively use boolbv_widtht for element widths - The boolbv_width() override in bv_pointerst now simply delegates to bv_width() since the multiplier handles everything Performance impact: regression suite time dropped from 90+ minutes (timeout) to ~3 minutes. The consistent widths prevent the solver from creating oversized formulas due to width mismatches. Crashes reduced from ~40 to ~10 (remaining are trace-related). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Three changes to make byte operations on pointer-containing types use the 64-bit flat address instead of the 192-bit encoding: 1. Endianness map: for pointer types with wide encoding, rearrange the bit mapping so that the first 64 bits (bytes 0-7) come from the address component (bits 128-191). The object/offset bits (0-127) are placed after the address in the map, making them non-byte-visible for sequential byte access. 2. byte_extract with pointer result: extract only 64 bits (the address) from the byte array, then reconstruct the full 192-bit pointer encoding using fresh object/offset variables constrained by forward/backward constraints (same as I2P). 3. byte_update with pointer value: extract the 64-bit address from the pointer and write only those 64 bits to the byte array. For compound types containing pointers, lower to individual byte operations. This fixes havoc_slice/test_struct_raw_bytes (struct byte operations now correctly see 8-byte pointers) and is the foundation for fixing Pointer_array6 (pointer array byte round-trips). Pointer_array6 still fails because the byte_extract reconstruction creates fresh object/offset variables whose backward constraints are deferred. The dereference MUX uses pointer_object() which depends on these variables. The SAT solver should propagate the backward constraints through the MUX, but this needs investigation. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

When --model-stack-layout is passed (requires --wide-pointer-encoding), stack variables in the same function are placed adjacently with downward growth direction: earlier declarations get higher addresses. This enables: 1. Stack growth direction assertions: assert(&a > &b) for a declared before b 2. Adjacent placement assertions: assert(&a - &b == sizeof(b)) 3. Buffer overflow modeling: writing past a[sizeof(a)] lands on b Implementation: in finish_eager_conversion, identify function-scoped symbols, group by function, and add adjacency constraints: base[cur] = base[next] + sizeof(next) (downward growth: cur is above next in memory) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Three changes enable detecting address reuse after free: 1. config.bv_encoding.malloc_may_alias flag (set when wide encoding is enabled) prevents the simplifier from concluding that different dynamic objects have different addresses. 2. simplify_inequality_address_of: don't simplify address_of(dyn1) == address_of(dyn2) to false when malloc_may_alias is set. 3. simplify_inequality_both_constant: don't simplify non-null pointer constant comparisons when malloc_may_alias is set. 4. goto_symex_can_forward_propagate: don't forward-propagate addresses of dynamic objects when malloc_may_alias is set. This prevents constant propagation from replacing pointer variables with their address_of values, which would then be simplified by the simplifier. 5. Chain constraint: allow base[y] == base[x] for consecutive dynamic objects of the same size. All changes are gated on config.bv_encoding.malloc_may_alias, which is only set when --wide-pointer-encoding is used. The standard encoding is completely unaffected. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

The backward constraints (address in range => object identity) are expensive for the SAT solver. For tests with many I2P casts (e.g., address_space_size_limit3 with 38 I2P entries), they add ~40 seconds. Two optimizations: 1. Only add backward constraints for I2P casts (not byte_extract or convert_index reconstructions). Byte_extract from pointer sources still gets backward constraints since they need them for correctness (Pointer_array6). 2. Limit backward constraints to 10 I2P entries. Beyond that, the performance cost outweighs the precision benefit. This is a heuristic — the forward constraints still link objects to addresses, and the backward constraints are a precision optimization, not a soundness requirement. address_space_size_limit3: >60s timeout → 18.5s. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

- doc/architectural/wide-pointer-encoding.md: comprehensive design document covering all 5 fixed issues, stack layout, technical details, and performance characteristics - --help text for --wide-pointer-encoding and --model-stack-layout Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Call check_SAT_backward_i2p() during the refinement loop's SAT check phase. This enables the backward constraint refinement to work with --refine-arrays. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Use is_symex_dynamic_object() helper that recognizes both ID_dynamic_object and symbol('symex_dynamic::dynamic_object') forms. The latter is what symex actually produces. Also restore forward propagation block for dynamic object addresses and fix simplify_expr_with_value_set to not conclude inequality between dynamic objects when malloc_may_alias is set. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Update comment explaining why types containing pointers are skipped in the address dispatch: the backward constraint refinement handles these correctly without the dispatch. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

This commit addresses multiple tightly coupled issues: 1. boolbv_width override for compound types: cached recursive computation for structs, unions, and nested arrays containing pointers. Needed for convert_member to compute correct offsets. 2. convert_pointer_type dispatch: use this->convert_index and this->convert_member instead of SUB:: to go through overrides. 3. Remove pointer array width reduction: convert_array, convert_index, convert_with now delegate to parent. Pointer arrays store full 192-bit encoding per element. 4. convert_byte_extract for pointer array sources: extract address components from each 192-bit element to handle byte-level round-trips (fixes Pointer_array6). 5. Refactor dec_solve: extract check_SAT_backward_i2p() as a standalone method for use by both dec_solve and bv_refinementt::check_SAT. These changes are interdependent: removing the width reduction (3) requires the byte_extract handler (4), and the boolbv_width override (1) is needed for the dispatch fix (2). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

The wide encoding produces the same 11 failures as the standard encoding. The descriptor was incorrectly expecting 9. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

These tests have different expected results under the wide encoding due to address reuse semantics (Malloc10, Malloc14) and simplification differences (null8). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Tests that have different semantics under the wide encoding (address-based comparison, address reuse, etc.) are tagged no-wide-pointer-encoding. Each has an alternative descriptor with the wide-pointer-encoding tag. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

MiniSat's SimpSolver eliminates variables during the first solve. The backward constraint refinement in dec_solve then creates bv_utils operations (rel, add, equal) that reference these eliminated variables via gate_and/gate_or clauses, triggering the isEliminated assertion in addClause_. Fix: freeze all variables created during finish_eager_conversion (deferred I2P forward constraints, non-overlapping chain constraints). This prevents the simplifier from eliminating them while still allowing simplification of all other variables. Untagged havoc_slice/test_struct_d — now passes with MiniSat. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

kroening · 2026-04-09T18:18:51Z

Can the offset not simply be derived from the address? Simply as in "drop the appropriate number of least significant bits"?

tautschnig · 2026-04-09T21:27:27Z

Can the offset not simply be derived from the address? Simply as in "drop the appropriate number of least significant bits"?

We could separately keep a mapping from object id to base address (and perhaps also vice versa), then we could drop one of integer address or offset. Else we would be baking in alignment assumptions, which would be particularly hard to come by when object sizes are symbolic.

tautschnig and others added 27 commits April 8, 2026 20:05

Add regression test for diffblue#8200: cast-to-pointer

387d3bd

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add regression test for diffblue#8161: inconsistent pointer comparison

66ba67a

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add regression test for diffblue#8103: I2P dereference

948f12a

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add regression test for diffblue#2117: address reuse after free

156de78

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add regression test for --model-stack-layout

e5169e5

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Fix Pointer_comparison3 wide descriptor: expect 13 failures, not 11

22688ed

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Integrate backward I2P refinement into bv_refinementt::check_SAT

a186af2

Call check_SAT_backward_i2p() during the refinement loop's SAT check phase. This enables the backward constraint refinement to work with --refine-arrays. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Fix havoc_slice/test_struct_raw_bytes descriptor: expect 11 failures

75bb4b5

The wide encoding produces the same 11 failures as the standard encoding. The descriptor was incorrectly expecting 9. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add wide-pointer-encoding descriptors for Malloc10, Malloc14, null8

a1c6844

These tests have different expected results under the wide encoding due to address reuse semantics (Malloc10, Malloc14) and simplification differences (null8). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig self-assigned this Apr 8, 2026

tautschnig mentioned this pull request Apr 8, 2026

Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding #8953

Draft

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954

Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954
tautschnig wants to merge 27 commits intodiffblue:developfrom
tautschnig:feature/wide-pointer-encoding

tautschnig commented Apr 8, 2026

Uh oh!

kroening commented Apr 9, 2026

Uh oh!

tautschnig commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tautschnig commented Apr 8, 2026

Motivation

Design

Additional feature: --model-stack-layout

Performance

Test results

General fixes included

Uh oh!

kroening commented Apr 9, 2026

Uh oh!

tautschnig commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Additional feature: `--model-stack-layout`