Skip to content

Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954

Draft
tautschnig wants to merge 27 commits intodiffblue:developfrom
tautschnig:feature/wide-pointer-encoding
Draft

Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954
tautschnig wants to merge 27 commits intodiffblue:developfrom
tautschnig:feature/wide-pointer-encoding

Conversation

@tautschnig
Copy link
Copy Markdown
Collaborator

Add an alternative pointer encoding activated via --wide-pointer-encoding that represents each pointer as a 192-bit bitvector [object(64) | offset(64) | address(64)]. The address component is a flat integer maintained consistently through all pointer operations, enabling address-based pointer comparison, cross-object pointer arithmetic detection, and address reuse after free.

Fixes: #881
Fixes: #8200
Fixes: #8161
Fixes: #8103
Fixes: #2117

Motivation

The standard pointer encoding uses [object | offset] pairs where the object number is an abstract identifier with no numeric meaning. This makes it impossible to model scenarios where the concrete address matters: pointer-to-integer casts return meaningless values (#881, #8200), pointer comparisons across objects are inconsistent (#8161), integer-to-pointer casts cannot resolve back to the correct object (#8103), and address reuse after free is not modelled (#2117).

Design

Each pointer carries a symbolic flat address alongside the standard object/offset encoding. Non-overlapping base address constraints ensure distinct objects occupy disjoint address ranges. The encoding is purely propositional — no solver-level arrays are used.

Key design decisions:

  • P2I returns the flat address component directly.
  • I2P uses forward constraints (address = base[obj] + offset) plus lazy backward refinement in dec_solve (address in range ⇒ object identity).
  • Pointer equality/comparison uses the address component (<, >, ==, !=), matching C semantics for cross-object comparisons.
  • Address reuse is enabled via config.bv_encoding.malloc_may_alias: consecutive dynamic objects of the same size may share base addresses.
  • Non-overlapping constraints use an O(n) chain: objects are ordered by base address, and each object's end equals the next object's base.

Additional feature: --model-stack-layout

When combined with --wide-pointer-encoding, the --model-stack-layout option constrains stack variables to be placed at adjacent addresses in declaration order, with architecture-dependent stack growth direction (config.ansi_c.stack_grows_downward).

Performance

The wide encoding adds ~1.5–2× overhead on solver-heavy tests. On the full regression suite (dominated by solver-trivial tests), the overhead is ~1.1×.

Test results

  • Standard encoding: 0 failures
  • Wide encoding: 0 failures
  • 11 tests tagged no-wide-pointer-encoding (intentional semantic changes with alternative test-wide-pointer-encoding.desc descriptors)

General fixes included

This branch also includes two general SAT backend fixes (commits 1–2) discovered during development:

  • MiniSat: add set_limit_incremental_simplification to prevent variable elimination between incremental solves
  • CaDiCaL: fix l_get crash for var_no == 0 (1-based variable numbering)
  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

tautschnig and others added 27 commits April 8, 2026 20:05
Add an option to disable MiniSat's SimpSolver simplifier after
the first solve call. The simplifier can eliminate variables
between incremental calls, which may make later UNSAT proofs
extremely expensive for the CDCL search. This is useful for any
incremental solving scenario, not just the wide pointer encoding.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
CaDiCaL uses 1-based variable numbering. When l_get is called
with a literal whose var_no is 0, solver->val(0) triggers a
fatal error. Guard against this by returning TV_UNKNOWN for
var_no 0.

This fixes printf1/2/3 trace generation crashes with CaDiCaL
and --wide-pointer-encoding.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extend the standard pointer encoding to include a flat integer
address component alongside the existing object/offset bits.
The pointer bitvector layout becomes:
  [object(64) | offset(64) | address(64)] = 192 bits

The address component is maintained consistently:
- encode(): sets address to base[object] (symbolic base address)
- offset_arithmetic(): updates address += byte_offset
- pointer-to-integer cast: returns the address component
- Non-overlapping base address constraints in finish_eager_conversion

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Pointer_difference2 produces fewer overflow failures with the wide
encoding because pointer subtraction overflow is checked on the
56-bit offset component rather than the full packed representation.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…iffblue#8161)

Override convert_equality for pointer types: compare by flat address
instead of the full bitvector encoding. Two pointers are equal iff
their addresses are equal, regardless of object encoding.

Override relational comparisons (<, >, <=, >=) for pointer types:
compare by flat address instead of offset-within-same-object. This
makes cross-object pointer comparisons well-defined.

Relax the overflow guard on base addresses: only require that
base + size does not wrap around (unsigned). Remove the previous
restriction that limited addresses to a small range. This allows
the solver to place objects anywhere in the address space.

Fixes:
- diffblue#8200: (char*)0x55a8a2e6b007 can now equal a string literal's
  address, so assert(ptr != x) correctly reports FAILURE.
- diffblue#8161: pointer-derived integers now have meaningful values in
  the full address range, making arithmetic on them consistent
  with plain integer arithmetic.

New test expectation mismatches:
- Pointer_array4: (int)pointer can overflow signed int when
  addresses are in the upper half of the 32-bit space.
- Pointer_comparison3: p < p1+1 no longer implies same_object
  because the comparison is address-based.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Move the backward constraints (address in range => object identity)
from the I2P encoding site to finish_eager_conversion, where all
objects are known. Forward constraints (obj==i => base[i]+off==addr)
are still added immediately during I2P encoding for currently known
objects, ensuring the dereference MUX has correct offsets.

In finish_eager_conversion:
- Forward constraints are added for objects created AFTER the I2P
- Backward constraints are added for ALL objects
- Valid-object disjunction covers all objects

Note: this does NOT fix issue diffblue#8103 because the dereference of
I2P pointers goes through symex's flat memory model
(byte_extract from __CPROVER_memory at pointer_offset), not
through the pointer encoding's object-based dispatch. Fixing
diffblue#8103 requires symex-level changes to recognize that I2P pointers
may point to known objects.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add object_base_address_exprt: a new expression that returns the
flat base address of the object a pointer points to. This is the
foundation for address-based dereference dispatch.

Encoding: postponed to finish_eager_conversion like object_size.
For each object i, if pointer_object(ptr)==i then result==base[i].

Infrastructure for passing wide_pointer_encoding flag to symex:
- symex_configt::wide_pointer_encoding (from options)
- value_set_dereferencet constructor accepts the flag

The address-based dereference dispatch (Step 2 of the plan) is
not yet implemented — the naive approach of iterating over the
symbol table causes crashes because symex has already substituted
symbols with their SSA-renamed values. The dispatch needs to work
with the SSA-renamed symbols, which requires a different approach
(e.g., generating the dispatch at the goto-program level before
symex, or using the value set's object tracking).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add pointer_width_multiplier to boolbv_widtht so that all width
computations (arrays, structs, unions containing pointers) use the
correct 3x pointer width. Previously, only the boolbv_width()
virtual override in bv_pointerst returned 3x for pointer types,
but the boolbv_widtht object (used by convert_array, endianness
maps, and other compound type handlers) still used the original
64-bit pointer width. This caused width mismatches (expected 64,
got 192) that crashed ~40 tests.

The fix:
- boolbv_widtht::set_pointer_width_multiplier(3) is called when
  wide encoding is enabled
- The pointer width entry in boolbv_widtht's cache is multiplied
- All compound type widths are automatically correct because they
  recursively use boolbv_widtht for element widths
- The boolbv_width() override in bv_pointerst now simply delegates
  to bv_width() since the multiplier handles everything

Performance impact: regression suite time dropped from 90+ minutes
(timeout) to ~3 minutes. The consistent widths prevent the solver
from creating oversized formulas due to width mismatches.

Crashes reduced from ~40 to ~10 (remaining are trace-related).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Three changes to make byte operations on pointer-containing types
use the 64-bit flat address instead of the 192-bit encoding:

1. Endianness map: for pointer types with wide encoding, rearrange
   the bit mapping so that the first 64 bits (bytes 0-7) come from
   the address component (bits 128-191). The object/offset bits
   (0-127) are placed after the address in the map, making them
   non-byte-visible for sequential byte access.

2. byte_extract with pointer result: extract only 64 bits (the
   address) from the byte array, then reconstruct the full 192-bit
   pointer encoding using fresh object/offset variables constrained
   by forward/backward constraints (same as I2P).

3. byte_update with pointer value: extract the 64-bit address from
   the pointer and write only those 64 bits to the byte array.
   For compound types containing pointers, lower to individual
   byte operations.

This fixes havoc_slice/test_struct_raw_bytes (struct byte operations
now correctly see 8-byte pointers) and is the foundation for fixing
Pointer_array6 (pointer array byte round-trips).

Pointer_array6 still fails because the byte_extract reconstruction
creates fresh object/offset variables whose backward constraints
are deferred. The dereference MUX uses pointer_object() which
depends on these variables. The SAT solver should propagate the
backward constraints through the MUX, but this needs investigation.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
When --model-stack-layout is passed (requires --wide-pointer-encoding),
stack variables in the same function are placed adjacently with
downward growth direction: earlier declarations get higher addresses.

This enables:
1. Stack growth direction assertions:
   assert(&a > &b) for a declared before b
2. Adjacent placement assertions:
   assert(&a - &b == sizeof(b))
3. Buffer overflow modeling:
   writing past a[sizeof(a)] lands on b

Implementation: in finish_eager_conversion, identify function-scoped
symbols, group by function, and add adjacency constraints:
  base[cur] = base[next] + sizeof(next)
(downward growth: cur is above next in memory)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Three changes enable detecting address reuse after free:

1. config.bv_encoding.malloc_may_alias flag (set when wide encoding
   is enabled) prevents the simplifier from concluding that different
   dynamic objects have different addresses.

2. simplify_inequality_address_of: don't simplify
   address_of(dyn1) == address_of(dyn2) to false when
   malloc_may_alias is set.

3. simplify_inequality_both_constant: don't simplify non-null
   pointer constant comparisons when malloc_may_alias is set.

4. goto_symex_can_forward_propagate: don't forward-propagate
   addresses of dynamic objects when malloc_may_alias is set.
   This prevents constant propagation from replacing pointer
   variables with their address_of values, which would then be
   simplified by the simplifier.

5. Chain constraint: allow base[y] == base[x] for consecutive
   dynamic objects of the same size.

All changes are gated on config.bv_encoding.malloc_may_alias,
which is only set when --wide-pointer-encoding is used. The
standard encoding is completely unaffected.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The backward constraints (address in range => object identity) are
expensive for the SAT solver. For tests with many I2P casts (e.g.,
address_space_size_limit3 with 38 I2P entries), they add ~40 seconds.

Two optimizations:
1. Only add backward constraints for I2P casts (not byte_extract
   or convert_index reconstructions). Byte_extract from pointer
   sources still gets backward constraints since they need them
   for correctness (Pointer_array6).
2. Limit backward constraints to 10 I2P entries. Beyond that,
   the performance cost outweighs the precision benefit. This is
   a heuristic — the forward constraints still link objects to
   addresses, and the backward constraints are a precision
   optimization, not a soundness requirement.

address_space_size_limit3: >60s timeout → 18.5s.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
- doc/architectural/wide-pointer-encoding.md: comprehensive design
  document covering all 5 fixed issues, stack layout, technical
  details, and performance characteristics
- --help text for --wide-pointer-encoding and --model-stack-layout

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Call check_SAT_backward_i2p() during the refinement loop's SAT
check phase. This enables the backward constraint refinement to
work with --refine-arrays.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Use is_symex_dynamic_object() helper that recognizes both
ID_dynamic_object and symbol('symex_dynamic::dynamic_object')
forms. The latter is what symex actually produces.

Also restore forward propagation block for dynamic object
addresses and fix simplify_expr_with_value_set to not conclude
inequality between dynamic objects when malloc_may_alias is set.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Update comment explaining why types containing pointers are
skipped in the address dispatch: the backward constraint
refinement handles these correctly without the dispatch.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
This commit addresses multiple tightly coupled issues:

1. boolbv_width override for compound types: cached recursive
   computation for structs, unions, and nested arrays containing
   pointers. Needed for convert_member to compute correct offsets.

2. convert_pointer_type dispatch: use this->convert_index and
   this->convert_member instead of SUB:: to go through overrides.

3. Remove pointer array width reduction: convert_array,
   convert_index, convert_with now delegate to parent. Pointer
   arrays store full 192-bit encoding per element.

4. convert_byte_extract for pointer array sources: extract
   address components from each 192-bit element to handle
   byte-level round-trips (fixes Pointer_array6).

5. Refactor dec_solve: extract check_SAT_backward_i2p() as a
   standalone method for use by both dec_solve and
   bv_refinementt::check_SAT.

These changes are interdependent: removing the width reduction (3)
requires the byte_extract handler (4), and the boolbv_width
override (1) is needed for the dispatch fix (2).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The wide encoding produces the same 11 failures as the standard
encoding. The descriptor was incorrectly expecting 9.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
These tests have different expected results under the wide
encoding due to address reuse semantics (Malloc10, Malloc14)
and simplification differences (null8).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Tests that have different semantics under the wide encoding
(address-based comparison, address reuse, etc.) are tagged
no-wide-pointer-encoding. Each has an alternative descriptor
with the wide-pointer-encoding tag.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
MiniSat's SimpSolver eliminates variables during the first solve.
The backward constraint refinement in dec_solve then creates
bv_utils operations (rel, add, equal) that reference these
eliminated variables via gate_and/gate_or clauses, triggering
the isEliminated assertion in addClause_.

Fix: freeze all variables created during finish_eager_conversion
(deferred I2P forward constraints, non-overlapping chain
constraints). This prevents the simplifier from eliminating them
while still allowing simplification of all other variables.

Untagged havoc_slice/test_struct_d — now passes with MiniSat.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@kroening
Copy link
Copy Markdown
Collaborator

kroening commented Apr 9, 2026

Can the offset not simply be derived from the address? Simply as in "drop the appropriate number of least significant bits"?

@tautschnig
Copy link
Copy Markdown
Collaborator Author

Can the offset not simply be derived from the address? Simply as in "drop the appropriate number of least significant bits"?

We could separately keep a mapping from object id to base address (and perhaps also vice versa), then we could drop one of integer address or offset. Else we would be baking in alignment assumptions, which would be particularly hard to come by when object sizes are symbolic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants