Skip to content

Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding#8953

Draft
tautschnig wants to merge 30 commits intodiffblue:developfrom
tautschnig:feature/pointer-encoding-via-maps
Draft

Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding#8953
tautschnig wants to merge 30 commits intodiffblue:developfrom
tautschnig:feature/pointer-encoding-via-maps

Conversation

@tautschnig
Copy link
Copy Markdown
Collaborator

@tautschnig tautschnig commented Apr 8, 2026

Add an alternative pointer encoding activated via --pointer-encoding-via-maps that represents each pointer as a 64-bit index into two global solver-level arrays: object_map[index] → object number and offset_map[index] → byte offset. A third array base_address_map[object] → base address supports pointer-to-integer casts.

This is an experimental encoding that explores using the solver's array theory for pointer representation instead of bit-packing. It serves as a complement to the --wide-pointer-encoding approach (PR #8954).

Design

  • Each pointer-typed expression gets a unique 64-bit index.
  • object_map and offset_map are constrained via solver-level array read/write operations, leveraging the array theory's axiom instantiation.
  • Pointer equality compares indices after normalising through the maps.
  • P2I computes base_address_map[object] + offset.
  • I2P uses forward constraints plus backward refinement (address in range ⇒ object identity).
  • bv_refinementt is templatized to work with both bv_pointerst and bv_pointers_widet.

Current status

  • Standard encoding: 0 failures
  • Maps encoding: 0 failures (6 KNOWNBUG, 8 tests tagged no-pointer-encoding-via-maps)
  • Performance: ~3.4× overhead on the full regression suite (the array theory axiom instantiation is the bottleneck)

Known encoding gaps (KNOWNBUG)

Test Issue
Pointer_array6 Byte-level pointer array round-trip
argc-and-argv/argv1 argv[argc]==NULL not propagated through array theory
address_space_size_limit1 Object-bits limit not enforced
String_Abstraction7 String abstraction incompatibility
memory_allocation1 __CPROVER_allocated_memory handling
mm_io1 Memory-mapped I/O dispatch

Foundation: array theory improvements

This branch depends on #8905, which contains array theory improvements that benefit both this encoding and the standard encoding:

  • Skip Ackermann constraints for derived arrays (weak equivalence)
  • Fix --arrays-uf-always soundness for member-of-index expressions
  • Handle array typecasts with different element sizes
  • Use array comprehension for large byte_extract lowering
  • Make with/if array constraints lazy for --refine-arrays
  • Fix array theory with-constraints for SSA-renamed indices
  • Templatize bv_refinementt on pointer encoding base class
  • Each commit message has a non-empty body, explaining why the change was made.
  • Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
  • The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
  • Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
  • My commit message includes data points confirming performance improvements (if claimed).
  • My PR is restricted to a single feature or bugfix.
  • White-space or formatting changes outside the feature-related changed lines are in commits of their own.

tautschnig and others added 30 commits April 8, 2026 07:49
Skip generating Ackermann constraints for arrays that are derived from
other arrays via with (store), if, array_of, array constants,
array_comprehension, typecast, or let expressions.

For a derived array such as x = y with [k := v], the Ackermann
constraint i1 = i2 => x[i1] = x[i2] is already implied by:
  (1) the with constraint: k != j => x[j] = y[j], and
  (2) the Ackermann constraint on the base array y.
This is the read-over-weakeq optimisation from the theory of weakly
equivalent arrays (Christ & Hoenicke, 2014).

The same reasoning applies to if, array_of, and other derived array
expressions, all of which already have constraints connecting them
element-wise to their constituent arrays.

With 5 stores to the same unbounded array the Ackermann constraint
count drops from 110 to 60; with 40 stores it drops from 63180 to
31980 (approximately 50% reduction in all cases).

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
collect_arrays() and add_array_constraints() restricted member
expressions to only those whose struct operand was a symbol or
nondet_symbol. With --arrays-uf-always, member expressions can have
more complex struct operands (index, nested member, struct literal)
when arrays are embedded in structs. These are base array expressions
that need no special constraint generation, just like symbols.

The overly strict invariant caused crashes on tests like
Array_operations2, array-bug-6230, byte_update18, and bounds_check1
when run with --arrays-uf-always.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
When lower_byte_operators() is applied to array equalities and index
expressions, the result can contain member-of-struct-literal
expressions (e.g., member(struct{...}, A)) that the array theory
treats as opaque base arrays. These expressions are trivially
simplifiable to the actual array value by simplify_expr.

Without simplification, the array theory registers these as
unconstrained arrays, leading to spurious counterexamples. With
simplification, member(struct_literal, field) is reduced to the
field value, allowing the array theory to properly constrain it.

Also handle the case where simplification fully resolves an index
expression (e.g., index into a constant array with constant index),
so the result is no longer an index_exprt.

This fixes byte_update18 and array-bug-6230 when run with
--arrays-uf-always.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
When arrays are cast between types with different element sizes (e.g.,
SIMD reinterpretation int32[4] <-> int64[2]), the array theory's
element-wise constraint a[i]=b[i] is incorrect because elements don't
correspond one-to-one. The bitvector-level conversion handles these
casts correctly as bitwise copies.

Three changes:
1. collect_arrays: don't unify typecast arrays when element types
   differ, since they have different index domains.
2. add_array_constraints: skip generating typecast constraints when
   element types differ.
3. convert_index: when indexing into a typecast with different element
   types, lower to byte_extract on the source array instead of using
   the array decision procedure.

This fixes SIMD1 when run with --arrays-uf-always.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
…-always

Add a KNOWNBUG test descriptor that runs Array_operations2 with
--arrays-uf-always, documenting the root cause: the array theory's
union-find transitively unifies structurally different arrays that
share the same initial value literal.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
After lower_byte_operators and simplify_expr, an array equality may
simplify to a non-equal expression (e.g., true/false). Check the
result before calling to_equal_expr to avoid an invariant violation.

This fixes crashes in address_space_size_limit3, byte_update14, and
array-cell-sensitivity7 when run with --arrays-uf-always.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Add a cbmc-arrays-uf-always CMake test profile that runs all CORE
regression tests with --arrays-uf-always, following the same pattern
as the existing cbmc-cprover-smt2 and cbmc-new-smt-backend profiles.

Introduce two new test tags:
- broken-arrays-uf-always: tests with known incorrect results
  (Array_operations2 union-find conflation, trace-values format)
- thorough-arrays-uf-always: tests too slow for CI
  (bounds_check1, union/union_large_array)

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Change the with-expression and if-expression array constraints from
eager to lazy (deferred to refinement). Previously only Ackermann
constraints were lazy, making --refine-arrays largely ineffective
since the bulk of array constraints were still added eagerly.

With this change, --refine-arrays defers with, if, and Ackermann
constraints, adding them only when the refinement loop detects they
are violated by the current model. This can dramatically reduce the
number of constraints passed to the SAT solver.

Benchmark on bounds_check1 with --arrays-uf-always:
  Without --refine-arrays: 83s, 287MB
  With --refine-arrays:     2s, 233MB

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Add a test descriptor that runs bounds_check1 with both flags,
documenting the performance improvement from lazy with/if/Ackermann
constraints (~2s vs ~85s without --refine-arrays).

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Add a cbmc-arrays-uf-always-refine-arrays CMake test profile that
runs all CORE tests with both flags. This documents the performance
improvement from lazy with/if/Ackermann constraints.

Introduce broken-refine-arrays tag for tests with pre-existing
--refine-arrays issues (MiniSat eliminated-variable crashes):
  Multi_Dimensional_Array6, address_space_size_limit3,
  array_of_bool_as_bitvec (SMT output), Array_UF23 (constraint counts)

Total test execution time comparison:
  --arrays-uf-always:                  84s (1093 tests, 71 skipped)
  --arrays-uf-always --refine-arrays:  75s (1090 tests, 75 skipped)

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
MiniSat's SimpSolver eliminates propositional variables during
preprocessing. When --refine-arrays adds clauses incrementally in
the refinement loop, it can reference variables that were eliminated,
causing an assertion failure in addClause_.

This is the same issue that --refine-strings already works around.
Disable the simplifier when --refine-arrays is active, matching the
existing pattern for --refine-strings.

Fixes the crash in Multi_Dimensional_Array6 and address_space_size_limit3
when run with --arrays-uf-always --refine-arrays. Remove the
broken-refine-arrays tag from address_space_size_limit3 since it now
passes.

The remaining broken-refine-arrays tests (Multi_Dimensional_Array6,
Array_UF23, array_of_bool_as_bitvec) have pre-existing --refine-arrays
issues unrelated to the simplifier:
- Multi_Dimensional_Array6/Array_UF23: refinement loop does not converge
  (insufficient constraint activation in arrays_overapproximated)
- array_of_bool_as_bitvec: --refine-arrays overrides --smt2 --outfile

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Making with/if constraints lazy for --refine-arrays caused the
refinement loop to not converge: the arrays_overapproximated() check
evaluates each lazy constraint against the current SAT model, but
with/if constraints can appear satisfied by coincidence when the
solver freely assigns array element values. This led to spurious
counterexamples (Multi_Dimensional_Array6) and infinite loops
(Array_UF23).

Only Ackermann constraints are suitable for lazy refinement because
their activation depends on index equality, which is correctly
determined from the bitvector encoding in the SAT model.

The main performance benefit of --refine-arrays comes from lazy
Ackermann constraints (e.g., bounds_check1: 83s -> 2s), so keeping
with/if eager does not significantly impact the speedup.

Remove broken-refine-arrays tag from Multi_Dimensional_Array6 since
it now passes correctly.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
lower_byte_extract_array_vector expands byte_extract of an array
type into an element-by-element array_exprt. For large arrays
(e.g., char[100000] in a union), this creates N sub-expressions
that are each recursively lowered and simplified, resulting in
O(N^2) behavior.

When the array size exceeds MAX_FLATTENED_ARRAY_SIZE (1000), use
array_comprehension_exprt instead. This path already exists for
arrays with non-constant size; now it is also used for large
constant-size arrays. This reduces the lowering from O(N)
expressions to O(1).

Performance on union_large_array (char[100000] in a union):
  Before: >120s with --arrays-uf-always
  After:  2.3s with --arrays-uf-always

Remove thorough-arrays-uf-always tag from union_large_array.desc
since the test now completes in 2 seconds.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
The invariant violation for member expressions with index struct operands
was already fixed in commit db83e74d86 ("Accept member expressions with
arbitrary struct operands in array solver"). Promote the test to CORE
and update the description to reflect that this is a passing test.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
When indexing an array of structs containing array members through a
nondeterministic index (e.g., a[i].d[0]), the expression
member(index(outer_array, i), field) is an array that the array theory
treats as an opaque base array, generating no constraints for it. This
causes the bitvector solver to leave the element values unconstrained,
producing spurious counterexamples.

Fix: in convert_index, when the array operand is a member expression
whose struct operand is not a symbol/nondet_symbol and the array type
has a known finite size, bypass the array decision procedure and fall
through to the bounded-array encoding. This lets the bitvector solver
directly extract the member field bits from the struct bitvector
(which is properly constrained by the outer array's theory), then
select the element based on the index.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
… arrays

Add a KNOWNBUG test for a spurious counterexample when --arrays-uf-always
is used with an array of structs containing array members with 65+
elements, accessed through a nondeterministic index. Structs with array
members of 64 or fewer elements work correctly (covered by the
arrays-uf-always-member-soundness test).

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
When --arrays-uf-always is used, symbol arrays have two disconnected
representations in the bitvector solver:
1. Map literals (from convert_symbol) - used when the full bitvector
   is needed, e.g., in struct literals
2. Element-wise free variables (from convert_index via array theory) -
   used for individual element access

The array theory constrains the element-wise free variables (via with,
if, Ackermann constraints) but not the map literals. When a symbol
array appears inside a struct that is an element of another array, the
struct's bitvector uses the unconstrained map literals, causing
spurious counterexamples.

Fix: in boolbv_set_equality_to_true, when processing an equality for
a symbol of unbounded array type with known finite size (up to
MAX_FLATTENED_ARRAY_SIZE elements), connect the symbol's map literals
to the element-wise bitvectors by calling convert_bv on index
expressions for each element and using map.set_literals to equate
them. This ensures that the array theory's element-wise constraints
propagate to the map literals used in struct contexts.

The equality is still passed to the array theory for element-wise
constraint generation.

Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>
Root cause: when a with-expression stores a value at index I, and
a read uses index J where J is a different SSA symbol connected to
I through an equality constraint (e.g., main_argc = argc), the
array theory's read-over-write axiom only generates the constraint
for index I, not for index J. The solver needs to propagate through
the SSA equality to determine I == J, but the typecast wrapping
(cast(argc, signedbv[64]) vs cast(main_argc, signedbv[64])) creates
different bitvectors that aren't directly connected.

Fix: in add_array_constraints_with, for each index in the index set
that may equal the write index (indices_equal is not constant false),
add an explicit implication: idx_eq => with_expr[other_idx] = value.
This gives the solver the direct connection it needs without
requiring propagation through SSA equalities and typecasts.

This is a general improvement to the array theory that benefits any
use case where array indices are SSA-renamed versions of the same
variable. It specifically fixes the argv[argc]==NULL pattern that
the map-based pointer encoding relies on.

Reduces map encoding regression failures from 11 to 10.
Standard encoding is unaffected (all tests pass).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The with-constraints fix for SSA-renamed indices causes the
array refinement loop to not converge when --show-array-constraints
is combined with solving. The constraint count check (60 vs 110)
is correct, but the subsequent solve hangs.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Introduce bv_pointers_widet, a new pointer encoding class that uses
solver-level arrays (maps) instead of bit-packing object-ids and offsets
within the pointer bit-width. Each pointer-typed expression maps to an
integer index, with two global array symbols (object_map and offset_map)
providing the object number and byte offset for that index.

This removes the limitation where object_bits reduce the available offset
range and provides a foundation for more sound handling of address-of
expressions and pointer-to-integer casts.

The new encoding is activated with the --pointer-encoding-via-maps flag.
Default behavior (bv_pointerst) is unchanged.

Co-authored-by: Michael Tautschnig <mt@debian.org>
Three fundamental fixes to the map-based pointer encoding:

1. Pointer equality now compares (object, offset) pairs instead of
   raw bitvector indices. Two pointers with different indices but
   the same object and offset are now correctly considered equal.
   This fixes spurious failures in pointer comparison tests.

2. Pointer-to-integer casts now return base_address[object] + offset
   instead of just the offset. Each object gets a symbolic base
   address that is constrained to be non-overlapping with all other
   objects' address ranges. This fixes the unsoundness where
   (uintptr_t)&x == (uintptr_t)&y was satisfiable for distinct
   objects x and y.

3. Integer-to-pointer casts create a fresh pointer whose object and
   offset are constrained so that base_address[object] + offset
   equals the integer value. This correctly models the relationship
   between integer addresses and pointer semantics.

The non-overlapping constraints use both range-based separation
(base[i] + size[i] <= base[j] OR base[j] + size[j] <= base[i])
and direct inequality (base[i] != base[j]) as a redundant helper
for the SAT solver.

Reduces regression test failures from 33 to 21 (with
--pointer-encoding-via-maps). Standard encoding is unaffected.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Lower byte_extract and byte_update operations when pointer types
are involved. The abstract pointer index is not meaningful as bytes,
so these operations must be decomposed into individual field accesses
before reaching the solver. This fixes havoc_slice and similar tests
that operate on structs containing pointer fields at the byte level.

Cache encode() results per object number so that the same object
always gets the same pointer index. This is critical for NULL:
without caching, each NULL pointer gets a different index, causing
array store/load to fail (the stored NULL index differs from the
compared NULL index).

Add propagation hints to pointer equality: when indices are equal,
force the semantic (object, offset) comparison to also be true.
This helps the SAT solver propagate through the array theory.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…alities

Root cause analysis: the SAT-based array theory cannot derive that
base[i] + k != base[j] from the range constraint
base[i] + size[i] <= base[j]. This is because the range constraint
involves 64-bit unsigned addition and comparison, which creates a
complex chain of implications that the SAT solver cannot propagate.

Fix 1: Direct lookup for constant pointer indices.
When read_object() or read_offset() is called with a constant
pointer index, look up the (object, offset) bitvectors directly
from the index_to_bv_object_offset map instead of creating array
reads. This eliminates the array theory indirection for the common
case where pointer indices are known at encoding time.

Fix 2: Explicit pairwise offset inequalities.
For each pair of objects with base addresses, add explicit
constraints base[i] + k != base[j] for all offsets k within the
larger object. The SAT solver can handle these direct inequality
constraints but cannot derive them from the range constraint.

Also populate index_to_bv_object_offset in encode() (not just
encode_fresh) so that the P2I eager path can find all constant
indices.

Fixes Pointer_comparison5 and the minimal ptr-to-int distinctness
test. Reduces the 'fundamental SAT limitation' category from 3
failures to 2 (Pointer_array4 and struct7 remain due to overflow
checks on pointer-to-integer arithmetic).

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Pointer subtraction overflow: override the overflow check for
pointer-typed operands to use offsets from the maps instead of
raw bitvector indices. For same-object pointers, check if the
offset difference overflows. For cross-object pointers, compute
flat addresses (base + offset) and check if their difference
overflows.

Address range: constrain base addresses to fit in the positive
range of a 32-bit signed integer (base + size <= 2^31 - 1).
This ensures that pointer-to-integer casts to 32-bit int don't
overflow, matching real hardware where user-space addresses are
in the lower portion of the address space.

Fixes Pointer_array4 (overflow on (int)ptr2 - (int)ptr1) and
struct7 (overflow on pointer subtraction p - &s.array[0]).
Reduces regression failures from 13 to 12.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Pointer equality: use a hybrid approach that avoids array reads
when possible. For constant indices that are different, use direct
lookup from index_to_bv_object_offset to compare (object, offset)
pairs without creating array reads. For symbolic indices, fall back
to array reads with propagation hints.

Pointer subtraction: for cross-object subtraction, compute the
difference using flat addresses (base + offset) via a MUX chain
over known objects. This correctly handles integer-address objects
like (char*)20 - (char*)10 = 10.

Key finding: the map encoding works correctly with --refine-arrays
(refinement-based array solver). The remaining failures with the
default Ackermann array solver are caused by its inability to
propagate through nested array reads (e.g., argv array + object_map
array simultaneously). A future improvement would be to make
bv_refinementt work with bv_pointers_widet as its base class.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Make bv_refinementt a class template parameterized on the pointer
encoding base class (default: bv_pointerst). This allows the
refinement solver to work with bv_pointers_widet for the map-based
pointer encoding.

Changes:
- bv_refinement.h: templatize class, extract config/info structs
- bv_refinement_loop.cpp: templatize all methods, add explicit
  instantiations for bv_pointerst and bv_pointers_widet
- refine_arrays.cpp: templatize, fix dependent type names
- refine_arithmetic.cpp: templatize, add this-> for dependent names
- string_refinement.h: use bv_refinementt<> (default template arg)
- solver_factory.cpp: use bv_refinementt<bv_pointers_widet> when
  --pointer-encoding-via-maps is set with --refine-arrays

Note: the map encoding's array theory issues (nested reads on
object_map/offset_map interfering with other arrays) are NOT
resolved by the refinement solver alone, because the refinement
only applies to user-level arrays, not the internal pointer maps.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add test-pointer-encoding-via-maps.desc for tests where the map
encoding correctly produces different results than the standard
encoding:

- Pointer_difference2: 5 failures instead of 7-9 (no overflow
  artifacts from small base addresses)
- r_w_ok10: 8 failures instead of 9 (invalid pointer constant
  is a valid integer-address object in the map encoding)

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
For symbolic integer-to-pointer casts, constrain the nondeterministic
object number to be one of the known objects (null, invalid, or
program objects). Without this, the solver can choose an arbitrary
object number not in the pointer maps, making pointer safety checks
(null, invalid, bounds) unprovable.

The constraint is a disjunction: obj == 0 OR obj == 1 OR ... OR
obj == N, where N is the number of known objects. This is only
applied to symbolic I2P casts (not constant ones, which already
have dedicated integer-address objects).

Fixes havoc_slice/test_struct_raw_bytes: the partial havoc of a
struct with a pointer field now correctly preserves the unchanged
bytes because the I2P result is constrained to a known object.

Reduces regression failures from 9 to 8.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
For symbolic integer-to-pointer casts, add a backward constraint:
if the integer value falls within object i's address range
[base[i], base[i] + size[i]), then the I2P result's object must
be i. This is the converse of the existing forward constraint
(obj == i => base[i] + offset == int_value).

The backward constraint allows the solver to determine which object
an I2P result points to, based on the integer value's relationship
to the known objects' base addresses. Without this, the solver
can't prove pointer safety checks (null, invalid, bounds) because
the I2P object is nondeterministic.

Fixes Pointer2 (pointer round-trip through bit manipulation),
memory_allocation1, mm_io1, r_w_ok10, and the simple I2P
round-trip test (int*)(uintptr_t)&x == &x.

Reduces regression failures from 8 to 4.

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Tag tests that have different expectations under the maps encoding
with no-pointer-encoding-via-maps. Add KNOWNBUG descriptors with
pointer-encoding-via-maps tag for known encoding gaps:
- Pointer_array6: byte-level pointer array round-trip
- address_space_size_limit1: object-bits limit not enforced
- String_Abstraction7: string abstraction incompatibility
- memory_allocation1: __CPROVER_allocated_memory handling
- mm_io1: memory-mapped I/O dispatch

Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
@tautschnig tautschnig self-assigned this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant