Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding by tautschnig · Pull Request #8953 · diffblue/cbmc

tautschnig · 2026-04-08T20:46:30Z

Add an alternative pointer encoding activated via --pointer-encoding-via-maps that represents each pointer as a 64-bit index into two global solver-level arrays: object_map[index] → object number and offset_map[index] → byte offset. A third array base_address_map[object] → base address supports pointer-to-integer casts.

This is an experimental encoding that explores using the solver's array theory for pointer representation instead of bit-packing. It serves as a complement to the --wide-pointer-encoding approach (PR #8954).

Design

Each pointer-typed expression gets a unique 64-bit index.
object_map and offset_map are constrained via solver-level array read/write operations, leveraging the array theory's axiom instantiation.
Pointer equality compares indices after normalising through the maps.
P2I computes base_address_map[object] + offset.
I2P uses forward constraints plus backward refinement (address in range ⇒ object identity).
bv_refinementt is templatized to work with both bv_pointerst and bv_pointers_widet.

Current status

Standard encoding: 0 failures
Maps encoding: 0 failures (6 KNOWNBUG, 8 tests tagged no-pointer-encoding-via-maps)
Performance: ~3.4× overhead on the full regression suite (the array theory axiom instantiation is the bottleneck)

Known encoding gaps (KNOWNBUG)

Test	Issue
Pointer_array6	Byte-level pointer array round-trip
argc-and-argv/argv1	argv[argc]==NULL not propagated through array theory
address_space_size_limit1	Object-bits limit not enforced
String_Abstraction7	String abstraction incompatibility
memory_allocation1	`__CPROVER_allocated_memory` handling
mm_io1	Memory-mapped I/O dispatch

Foundation: array theory improvements

This branch depends on #8905, which contains array theory improvements that benefit both this encoding and the standard encoding:

Skip Ackermann constraints for derived arrays (weak equivalence)
Fix --arrays-uf-always soundness for member-of-index expressions
Handle array typecasts with different element sizes
Use array comprehension for large byte_extract lowering
Make with/if array constraints lazy for --refine-arrays
Fix array theory with-constraints for SSA-renamed indices
Templatize bv_refinementt on pointer encoding base class

Each commit message has a non-empty body, explaining why the change was made.
Methods or procedures I have added are documented, following the guidelines provided in CODING_STANDARD.md.
The feature or user visible behaviour I have added or modified has been documented in the User Guide in doc/cprover-manual/
Regression or unit tests are included, or existing tests cover the modified code (in this case I have detailed which ones those are in the commit message).
My commit message includes data points confirming performance improvements (if claimed).
My PR is restricted to a single feature or bugfix.
White-space or formatting changes outside the feature-related changed lines are in commits of their own.

Skip generating Ackermann constraints for arrays that are derived from other arrays via with (store), if, array_of, array constants, array_comprehension, typecast, or let expressions. For a derived array such as x = y with [k := v], the Ackermann constraint i1 = i2 => x[i1] = x[i2] is already implied by: (1) the with constraint: k != j => x[j] = y[j], and (2) the Ackermann constraint on the base array y. This is the read-over-weakeq optimisation from the theory of weakly equivalent arrays (Christ & Hoenicke, 2014). The same reasoning applies to if, array_of, and other derived array expressions, all of which already have constraints connecting them element-wise to their constituent arrays. With 5 stores to the same unbounded array the Ackermann constraint count drops from 110 to 60; with 40 stores it drops from 63180 to 31980 (approximately 50% reduction in all cases). Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

collect_arrays() and add_array_constraints() restricted member expressions to only those whose struct operand was a symbol or nondet_symbol. With --arrays-uf-always, member expressions can have more complex struct operands (index, nested member, struct literal) when arrays are embedded in structs. These are base array expressions that need no special constraint generation, just like symbols. The overly strict invariant caused crashes on tests like Array_operations2, array-bug-6230, byte_update18, and bounds_check1 when run with --arrays-uf-always. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

When lower_byte_operators() is applied to array equalities and index expressions, the result can contain member-of-struct-literal expressions (e.g., member(struct{...}, A)) that the array theory treats as opaque base arrays. These expressions are trivially simplifiable to the actual array value by simplify_expr. Without simplification, the array theory registers these as unconstrained arrays, leading to spurious counterexamples. With simplification, member(struct_literal, field) is reduced to the field value, allowing the array theory to properly constrain it. Also handle the case where simplification fully resolves an index expression (e.g., index into a constant array with constant index), so the result is no longer an index_exprt. This fixes byte_update18 and array-bug-6230 when run with --arrays-uf-always. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

When arrays are cast between types with different element sizes (e.g., SIMD reinterpretation int32[4] <-> int64[2]), the array theory's element-wise constraint a[i]=b[i] is incorrect because elements don't correspond one-to-one. The bitvector-level conversion handles these casts correctly as bitwise copies. Three changes: 1. collect_arrays: don't unify typecast arrays when element types differ, since they have different index domains. 2. add_array_constraints: skip generating typecast constraints when element types differ. 3. convert_index: when indexing into a typecast with different element types, lower to byte_extract on the source array instead of using the array decision procedure. This fixes SIMD1 when run with --arrays-uf-always. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

…-always Add a KNOWNBUG test descriptor that runs Array_operations2 with --arrays-uf-always, documenting the root cause: the array theory's union-find transitively unifies structurally different arrays that share the same initial value literal. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

After lower_byte_operators and simplify_expr, an array equality may simplify to a non-equal expression (e.g., true/false). Check the result before calling to_equal_expr to avoid an invariant violation. This fixes crashes in address_space_size_limit3, byte_update14, and array-cell-sensitivity7 when run with --arrays-uf-always. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Add a cbmc-arrays-uf-always CMake test profile that runs all CORE regression tests with --arrays-uf-always, following the same pattern as the existing cbmc-cprover-smt2 and cbmc-new-smt-backend profiles. Introduce two new test tags: - broken-arrays-uf-always: tests with known incorrect results (Array_operations2 union-find conflation, trace-values format) - thorough-arrays-uf-always: tests too slow for CI (bounds_check1, union/union_large_array) Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Change the with-expression and if-expression array constraints from eager to lazy (deferred to refinement). Previously only Ackermann constraints were lazy, making --refine-arrays largely ineffective since the bulk of array constraints were still added eagerly. With this change, --refine-arrays defers with, if, and Ackermann constraints, adding them only when the refinement loop detects they are violated by the current model. This can dramatically reduce the number of constraints passed to the SAT solver. Benchmark on bounds_check1 with --arrays-uf-always: Without --refine-arrays: 83s, 287MB With --refine-arrays: 2s, 233MB Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Add a test descriptor that runs bounds_check1 with both flags, documenting the performance improvement from lazy with/if/Ackermann constraints (~2s vs ~85s without --refine-arrays). Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Add a cbmc-arrays-uf-always-refine-arrays CMake test profile that runs all CORE tests with both flags. This documents the performance improvement from lazy with/if/Ackermann constraints. Introduce broken-refine-arrays tag for tests with pre-existing --refine-arrays issues (MiniSat eliminated-variable crashes): Multi_Dimensional_Array6, address_space_size_limit3, array_of_bool_as_bitvec (SMT output), Array_UF23 (constraint counts) Total test execution time comparison: --arrays-uf-always: 84s (1093 tests, 71 skipped) --arrays-uf-always --refine-arrays: 75s (1090 tests, 75 skipped) Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

MiniSat's SimpSolver eliminates propositional variables during preprocessing. When --refine-arrays adds clauses incrementally in the refinement loop, it can reference variables that were eliminated, causing an assertion failure in addClause_. This is the same issue that --refine-strings already works around. Disable the simplifier when --refine-arrays is active, matching the existing pattern for --refine-strings. Fixes the crash in Multi_Dimensional_Array6 and address_space_size_limit3 when run with --arrays-uf-always --refine-arrays. Remove the broken-refine-arrays tag from address_space_size_limit3 since it now passes. The remaining broken-refine-arrays tests (Multi_Dimensional_Array6, Array_UF23, array_of_bool_as_bitvec) have pre-existing --refine-arrays issues unrelated to the simplifier: - Multi_Dimensional_Array6/Array_UF23: refinement loop does not converge (insufficient constraint activation in arrays_overapproximated) - array_of_bool_as_bitvec: --refine-arrays overrides --smt2 --outfile Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Making with/if constraints lazy for --refine-arrays caused the refinement loop to not converge: the arrays_overapproximated() check evaluates each lazy constraint against the current SAT model, but with/if constraints can appear satisfied by coincidence when the solver freely assigns array element values. This led to spurious counterexamples (Multi_Dimensional_Array6) and infinite loops (Array_UF23). Only Ackermann constraints are suitable for lazy refinement because their activation depends on index equality, which is correctly determined from the bitvector encoding in the SAT model. The main performance benefit of --refine-arrays comes from lazy Ackermann constraints (e.g., bounds_check1: 83s -> 2s), so keeping with/if eager does not significantly impact the speedup. Remove broken-refine-arrays tag from Multi_Dimensional_Array6 since it now passes correctly. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

lower_byte_extract_array_vector expands byte_extract of an array type into an element-by-element array_exprt. For large arrays (e.g., char[100000] in a union), this creates N sub-expressions that are each recursively lowered and simplified, resulting in O(N^2) behavior. When the array size exceeds MAX_FLATTENED_ARRAY_SIZE (1000), use array_comprehension_exprt instead. This path already exists for arrays with non-constant size; now it is also used for large constant-size arrays. This reduces the lowering from O(N) expressions to O(1). Performance on union_large_array (char[100000] in a union): Before: >120s with --arrays-uf-always After: 2.3s with --arrays-uf-always Remove thorough-arrays-uf-always tag from union_large_array.desc since the test now completes in 2 seconds. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

The invariant violation for member expressions with index struct operands was already fixed in commit db83e74d86 ("Accept member expressions with arbitrary struct operands in array solver"). Promote the test to CORE and update the description to reflect that this is a passing test. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

When indexing an array of structs containing array members through a nondeterministic index (e.g., a[i].d[0]), the expression member(index(outer_array, i), field) is an array that the array theory treats as an opaque base array, generating no constraints for it. This causes the bitvector solver to leave the element values unconstrained, producing spurious counterexamples. Fix: in convert_index, when the array operand is a member expression whose struct operand is not a symbol/nondet_symbol and the array type has a known finite size, bypass the array decision procedure and fall through to the bounded-array encoding. This lets the bitvector solver directly extract the member field bits from the struct bitvector (which is properly constrained by the outer array's theory), then select the element based on the index. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

… arrays Add a KNOWNBUG test for a spurious counterexample when --arrays-uf-always is used with an array of structs containing array members with 65+ elements, accessed through a nondeterministic index. Structs with array members of 64 or fewer elements work correctly (covered by the arrays-uf-always-member-soundness test). Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

When --arrays-uf-always is used, symbol arrays have two disconnected representations in the bitvector solver: 1. Map literals (from convert_symbol) - used when the full bitvector is needed, e.g., in struct literals 2. Element-wise free variables (from convert_index via array theory) - used for individual element access The array theory constrains the element-wise free variables (via with, if, Ackermann constraints) but not the map literals. When a symbol array appears inside a struct that is an element of another array, the struct's bitvector uses the unconstrained map literals, causing spurious counterexamples. Fix: in boolbv_set_equality_to_true, when processing an equality for a symbol of unbounded array type with known finite size (up to MAX_FLATTENED_ARRAY_SIZE elements), connect the symbol's map literals to the element-wise bitvectors by calling convert_bv on index expressions for each element and using map.set_literals to equate them. This ensures that the array theory's element-wise constraints propagate to the map literals used in struct contexts. The equality is still passed to the array theory for element-wise constraint generation. Co-authored-by: Kiro (autonomous agent) <kiro-agent@users.noreply.github.com>

Root cause: when a with-expression stores a value at index I, and a read uses index J where J is a different SSA symbol connected to I through an equality constraint (e.g., main_argc = argc), the array theory's read-over-write axiom only generates the constraint for index I, not for index J. The solver needs to propagate through the SSA equality to determine I == J, but the typecast wrapping (cast(argc, signedbv[64]) vs cast(main_argc, signedbv[64])) creates different bitvectors that aren't directly connected. Fix: in add_array_constraints_with, for each index in the index set that may equal the write index (indices_equal is not constant false), add an explicit implication: idx_eq => with_expr[other_idx] = value. This gives the solver the direct connection it needs without requiring propagation through SSA equalities and typecasts. This is a general improvement to the array theory that benefits any use case where array indices are SSA-renamed versions of the same variable. It specifically fixes the argv[argc]==NULL pattern that the map-based pointer encoding relies on. Reduces map encoding regression failures from 11 to 10. Standard encoding is unaffected (all tests pass). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

The with-constraints fix for SSA-renamed indices causes the array refinement loop to not converge when --show-array-constraints is combined with solving. The constraint count check (60 vs 110) is correct, but the subsequent solve hangs. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Introduce bv_pointers_widet, a new pointer encoding class that uses solver-level arrays (maps) instead of bit-packing object-ids and offsets within the pointer bit-width. Each pointer-typed expression maps to an integer index, with two global array symbols (object_map and offset_map) providing the object number and byte offset for that index. This removes the limitation where object_bits reduce the available offset range and provides a foundation for more sound handling of address-of expressions and pointer-to-integer casts. The new encoding is activated with the --pointer-encoding-via-maps flag. Default behavior (bv_pointerst) is unchanged. Co-authored-by: Michael Tautschnig <mt@debian.org>

Three fundamental fixes to the map-based pointer encoding: 1. Pointer equality now compares (object, offset) pairs instead of raw bitvector indices. Two pointers with different indices but the same object and offset are now correctly considered equal. This fixes spurious failures in pointer comparison tests. 2. Pointer-to-integer casts now return base_address[object] + offset instead of just the offset. Each object gets a symbolic base address that is constrained to be non-overlapping with all other objects' address ranges. This fixes the unsoundness where (uintptr_t)&x == (uintptr_t)&y was satisfiable for distinct objects x and y. 3. Integer-to-pointer casts create a fresh pointer whose object and offset are constrained so that base_address[object] + offset equals the integer value. This correctly models the relationship between integer addresses and pointer semantics. The non-overlapping constraints use both range-based separation (base[i] + size[i] <= base[j] OR base[j] + size[j] <= base[i]) and direct inequality (base[i] != base[j]) as a redundant helper for the SAT solver. Reduces regression test failures from 33 to 21 (with --pointer-encoding-via-maps). Standard encoding is unaffected. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Lower byte_extract and byte_update operations when pointer types are involved. The abstract pointer index is not meaningful as bytes, so these operations must be decomposed into individual field accesses before reaching the solver. This fixes havoc_slice and similar tests that operate on structs containing pointer fields at the byte level. Cache encode() results per object number so that the same object always gets the same pointer index. This is critical for NULL: without caching, each NULL pointer gets a different index, causing array store/load to fail (the stored NULL index differs from the compared NULL index). Add propagation hints to pointer equality: when indices are equal, force the semantic (object, offset) comparison to also be true. This helps the SAT solver propagate through the array theory. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

…alities Root cause analysis: the SAT-based array theory cannot derive that base[i] + k != base[j] from the range constraint base[i] + size[i] <= base[j]. This is because the range constraint involves 64-bit unsigned addition and comparison, which creates a complex chain of implications that the SAT solver cannot propagate. Fix 1: Direct lookup for constant pointer indices. When read_object() or read_offset() is called with a constant pointer index, look up the (object, offset) bitvectors directly from the index_to_bv_object_offset map instead of creating array reads. This eliminates the array theory indirection for the common case where pointer indices are known at encoding time. Fix 2: Explicit pairwise offset inequalities. For each pair of objects with base addresses, add explicit constraints base[i] + k != base[j] for all offsets k within the larger object. The SAT solver can handle these direct inequality constraints but cannot derive them from the range constraint. Also populate index_to_bv_object_offset in encode() (not just encode_fresh) so that the P2I eager path can find all constant indices. Fixes Pointer_comparison5 and the minimal ptr-to-int distinctness test. Reduces the 'fundamental SAT limitation' category from 3 failures to 2 (Pointer_array4 and struct7 remain due to overflow checks on pointer-to-integer arithmetic). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Pointer subtraction overflow: override the overflow check for pointer-typed operands to use offsets from the maps instead of raw bitvector indices. For same-object pointers, check if the offset difference overflows. For cross-object pointers, compute flat addresses (base + offset) and check if their difference overflows. Address range: constrain base addresses to fit in the positive range of a 32-bit signed integer (base + size <= 2^31 - 1). This ensures that pointer-to-integer casts to 32-bit int don't overflow, matching real hardware where user-space addresses are in the lower portion of the address space. Fixes Pointer_array4 (overflow on (int)ptr2 - (int)ptr1) and struct7 (overflow on pointer subtraction p - &s.array[0]). Reduces regression failures from 13 to 12. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Pointer equality: use a hybrid approach that avoids array reads when possible. For constant indices that are different, use direct lookup from index_to_bv_object_offset to compare (object, offset) pairs without creating array reads. For symbolic indices, fall back to array reads with propagation hints. Pointer subtraction: for cross-object subtraction, compute the difference using flat addresses (base + offset) via a MUX chain over known objects. This correctly handles integer-address objects like (char*)20 - (char*)10 = 10. Key finding: the map encoding works correctly with --refine-arrays (refinement-based array solver). The remaining failures with the default Ackermann array solver are caused by its inability to propagate through nested array reads (e.g., argv array + object_map array simultaneously). A future improvement would be to make bv_refinementt work with bv_pointers_widet as its base class. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Make bv_refinementt a class template parameterized on the pointer encoding base class (default: bv_pointerst). This allows the refinement solver to work with bv_pointers_widet for the map-based pointer encoding. Changes: - bv_refinement.h: templatize class, extract config/info structs - bv_refinement_loop.cpp: templatize all methods, add explicit instantiations for bv_pointerst and bv_pointers_widet - refine_arrays.cpp: templatize, fix dependent type names - refine_arithmetic.cpp: templatize, add this-> for dependent names - string_refinement.h: use bv_refinementt<> (default template arg) - solver_factory.cpp: use bv_refinementt<bv_pointers_widet> when --pointer-encoding-via-maps is set with --refine-arrays Note: the map encoding's array theory issues (nested reads on object_map/offset_map interfering with other arrays) are NOT resolved by the refinement solver alone, because the refinement only applies to user-level arrays, not the internal pointer maps. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Add test-pointer-encoding-via-maps.desc for tests where the map encoding correctly produces different results than the standard encoding: - Pointer_difference2: 5 failures instead of 7-9 (no overflow artifacts from small base addresses) - r_w_ok10: 8 failures instead of 9 (invalid pointer constant is a valid integer-address object in the map encoding) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

For symbolic integer-to-pointer casts, constrain the nondeterministic object number to be one of the known objects (null, invalid, or program objects). Without this, the solver can choose an arbitrary object number not in the pointer maps, making pointer safety checks (null, invalid, bounds) unprovable. The constraint is a disjunction: obj == 0 OR obj == 1 OR ... OR obj == N, where N is the number of known objects. This is only applied to symbolic I2P casts (not constant ones, which already have dedicated integer-address objects). Fixes havoc_slice/test_struct_raw_bytes: the partial havoc of a struct with a pointer field now correctly preserves the unchanged bytes because the I2P result is constrained to a known object. Reduces regression failures from 9 to 8. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

For symbolic integer-to-pointer casts, add a backward constraint: if the integer value falls within object i's address range [base[i], base[i] + size[i]), then the I2P result's object must be i. This is the converse of the existing forward constraint (obj == i => base[i] + offset == int_value). The backward constraint allows the solver to determine which object an I2P result points to, based on the integer value's relationship to the known objects' base addresses. Without this, the solver can't prove pointer safety checks (null, invalid, bounds) because the I2P object is nondeterministic. Fixes Pointer2 (pointer round-trip through bit manipulation), memory_allocation1, mm_io1, r_w_ok10, and the simple I2P round-trip test (int*)(uintptr_t)&x == &x. Reduces regression failures from 8 to 4. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

Tag tests that have different expectations under the maps encoding with no-pointer-encoding-via-maps. Add KNOWNBUG descriptors with pointer-encoding-via-maps tag for known encoding gaps: - Pointer_array6: byte-level pointer array round-trip - address_space_size_limit1: object-bits limit not enforced - String_Abstraction7: string abstraction incompatibility - memory_allocation1: __CPROVER_allocated_memory handling - mm_io1: memory-mapped I/O dispatch Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>

tautschnig and others added 30 commits April 8, 2026 07:49

tautschnig self-assigned this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding#8953

Add --pointer-encoding-via-maps: solver-level array maps for pointer encoding#8953
tautschnig wants to merge 30 commits intodiffblue:developfrom
tautschnig:feature/pointer-encoding-via-maps

tautschnig commented Apr 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tautschnig commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design

Current status

Known encoding gaps (KNOWNBUG)

Foundation: array theory improvements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tautschnig commented Apr 8, 2026 •

edited

Loading