Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954
Draft
tautschnig wants to merge 27 commits intodiffblue:developfrom
Draft
Add --wide-pointer-encoding: 192-bit pointers with flat addresses#8954tautschnig wants to merge 27 commits intodiffblue:developfrom
tautschnig wants to merge 27 commits intodiffblue:developfrom
Conversation
Add an option to disable MiniSat's SimpSolver simplifier after the first solve call. The simplifier can eliminate variables between incremental calls, which may make later UNSAT proofs extremely expensive for the CDCL search. This is useful for any incremental solving scenario, not just the wide pointer encoding. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
CaDiCaL uses 1-based variable numbering. When l_get is called with a literal whose var_no is 0, solver->val(0) triggers a fatal error. Guard against this by returning TV_UNKNOWN for var_no 0. This fixes printf1/2/3 trace generation crashes with CaDiCaL and --wide-pointer-encoding. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Extend the standard pointer encoding to include a flat integer address component alongside the existing object/offset bits. The pointer bitvector layout becomes: [object(64) | offset(64) | address(64)] = 192 bits The address component is maintained consistently: - encode(): sets address to base[object] (symbolic base address) - offset_arithmetic(): updates address += byte_offset - pointer-to-integer cast: returns the address component - Non-overlapping base address constraints in finish_eager_conversion Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Pointer_difference2 produces fewer overflow failures with the wide encoding because pointer subtraction overflow is checked on the 56-bit offset component rather than the full packed representation. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
…iffblue#8161) Override convert_equality for pointer types: compare by flat address instead of the full bitvector encoding. Two pointers are equal iff their addresses are equal, regardless of object encoding. Override relational comparisons (<, >, <=, >=) for pointer types: compare by flat address instead of offset-within-same-object. This makes cross-object pointer comparisons well-defined. Relax the overflow guard on base addresses: only require that base + size does not wrap around (unsigned). Remove the previous restriction that limited addresses to a small range. This allows the solver to place objects anywhere in the address space. Fixes: - diffblue#8200: (char*)0x55a8a2e6b007 can now equal a string literal's address, so assert(ptr != x) correctly reports FAILURE. - diffblue#8161: pointer-derived integers now have meaningful values in the full address range, making arithmetic on them consistent with plain integer arithmetic. New test expectation mismatches: - Pointer_array4: (int)pointer can overflow signed int when addresses are in the upper half of the 32-bit space. - Pointer_comparison3: p < p1+1 no longer implies same_object because the comparison is address-based. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Move the backward constraints (address in range => object identity) from the I2P encoding site to finish_eager_conversion, where all objects are known. Forward constraints (obj==i => base[i]+off==addr) are still added immediately during I2P encoding for currently known objects, ensuring the dereference MUX has correct offsets. In finish_eager_conversion: - Forward constraints are added for objects created AFTER the I2P - Backward constraints are added for ALL objects - Valid-object disjunction covers all objects Note: this does NOT fix issue diffblue#8103 because the dereference of I2P pointers goes through symex's flat memory model (byte_extract from __CPROVER_memory at pointer_offset), not through the pointer encoding's object-based dispatch. Fixing diffblue#8103 requires symex-level changes to recognize that I2P pointers may point to known objects. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add object_base_address_exprt: a new expression that returns the flat base address of the object a pointer points to. This is the foundation for address-based dereference dispatch. Encoding: postponed to finish_eager_conversion like object_size. For each object i, if pointer_object(ptr)==i then result==base[i]. Infrastructure for passing wide_pointer_encoding flag to symex: - symex_configt::wide_pointer_encoding (from options) - value_set_dereferencet constructor accepts the flag The address-based dereference dispatch (Step 2 of the plan) is not yet implemented — the naive approach of iterating over the symbol table causes crashes because symex has already substituted symbols with their SSA-renamed values. The dispatch needs to work with the SSA-renamed symbols, which requires a different approach (e.g., generating the dispatch at the goto-program level before symex, or using the value set's object tracking). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Add pointer_width_multiplier to boolbv_widtht so that all width computations (arrays, structs, unions containing pointers) use the correct 3x pointer width. Previously, only the boolbv_width() virtual override in bv_pointerst returned 3x for pointer types, but the boolbv_widtht object (used by convert_array, endianness maps, and other compound type handlers) still used the original 64-bit pointer width. This caused width mismatches (expected 64, got 192) that crashed ~40 tests. The fix: - boolbv_widtht::set_pointer_width_multiplier(3) is called when wide encoding is enabled - The pointer width entry in boolbv_widtht's cache is multiplied - All compound type widths are automatically correct because they recursively use boolbv_widtht for element widths - The boolbv_width() override in bv_pointerst now simply delegates to bv_width() since the multiplier handles everything Performance impact: regression suite time dropped from 90+ minutes (timeout) to ~3 minutes. The consistent widths prevent the solver from creating oversized formulas due to width mismatches. Crashes reduced from ~40 to ~10 (remaining are trace-related). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Three changes to make byte operations on pointer-containing types use the 64-bit flat address instead of the 192-bit encoding: 1. Endianness map: for pointer types with wide encoding, rearrange the bit mapping so that the first 64 bits (bytes 0-7) come from the address component (bits 128-191). The object/offset bits (0-127) are placed after the address in the map, making them non-byte-visible for sequential byte access. 2. byte_extract with pointer result: extract only 64 bits (the address) from the byte array, then reconstruct the full 192-bit pointer encoding using fresh object/offset variables constrained by forward/backward constraints (same as I2P). 3. byte_update with pointer value: extract the 64-bit address from the pointer and write only those 64 bits to the byte array. For compound types containing pointers, lower to individual byte operations. This fixes havoc_slice/test_struct_raw_bytes (struct byte operations now correctly see 8-byte pointers) and is the foundation for fixing Pointer_array6 (pointer array byte round-trips). Pointer_array6 still fails because the byte_extract reconstruction creates fresh object/offset variables whose backward constraints are deferred. The dereference MUX uses pointer_object() which depends on these variables. The SAT solver should propagate the backward constraints through the MUX, but this needs investigation. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
When --model-stack-layout is passed (requires --wide-pointer-encoding), stack variables in the same function are placed adjacently with downward growth direction: earlier declarations get higher addresses. This enables: 1. Stack growth direction assertions: assert(&a > &b) for a declared before b 2. Adjacent placement assertions: assert(&a - &b == sizeof(b)) 3. Buffer overflow modeling: writing past a[sizeof(a)] lands on b Implementation: in finish_eager_conversion, identify function-scoped symbols, group by function, and add adjacency constraints: base[cur] = base[next] + sizeof(next) (downward growth: cur is above next in memory) Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Three changes enable detecting address reuse after free: 1. config.bv_encoding.malloc_may_alias flag (set when wide encoding is enabled) prevents the simplifier from concluding that different dynamic objects have different addresses. 2. simplify_inequality_address_of: don't simplify address_of(dyn1) == address_of(dyn2) to false when malloc_may_alias is set. 3. simplify_inequality_both_constant: don't simplify non-null pointer constant comparisons when malloc_may_alias is set. 4. goto_symex_can_forward_propagate: don't forward-propagate addresses of dynamic objects when malloc_may_alias is set. This prevents constant propagation from replacing pointer variables with their address_of values, which would then be simplified by the simplifier. 5. Chain constraint: allow base[y] == base[x] for consecutive dynamic objects of the same size. All changes are gated on config.bv_encoding.malloc_may_alias, which is only set when --wide-pointer-encoding is used. The standard encoding is completely unaffected. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The backward constraints (address in range => object identity) are expensive for the SAT solver. For tests with many I2P casts (e.g., address_space_size_limit3 with 38 I2P entries), they add ~40 seconds. Two optimizations: 1. Only add backward constraints for I2P casts (not byte_extract or convert_index reconstructions). Byte_extract from pointer sources still gets backward constraints since they need them for correctness (Pointer_array6). 2. Limit backward constraints to 10 I2P entries. Beyond that, the performance cost outweighs the precision benefit. This is a heuristic — the forward constraints still link objects to addresses, and the backward constraints are a precision optimization, not a soundness requirement. address_space_size_limit3: >60s timeout → 18.5s. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
- doc/architectural/wide-pointer-encoding.md: comprehensive design document covering all 5 fixed issues, stack layout, technical details, and performance characteristics - --help text for --wide-pointer-encoding and --model-stack-layout Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Call check_SAT_backward_i2p() during the refinement loop's SAT check phase. This enables the backward constraint refinement to work with --refine-arrays. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Use is_symex_dynamic_object() helper that recognizes both
ID_dynamic_object and symbol('symex_dynamic::dynamic_object')
forms. The latter is what symex actually produces.
Also restore forward propagation block for dynamic object
addresses and fix simplify_expr_with_value_set to not conclude
inequality between dynamic objects when malloc_may_alias is set.
Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Update comment explaining why types containing pointers are skipped in the address dispatch: the backward constraint refinement handles these correctly without the dispatch. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
This commit addresses multiple tightly coupled issues: 1. boolbv_width override for compound types: cached recursive computation for structs, unions, and nested arrays containing pointers. Needed for convert_member to compute correct offsets. 2. convert_pointer_type dispatch: use this->convert_index and this->convert_member instead of SUB:: to go through overrides. 3. Remove pointer array width reduction: convert_array, convert_index, convert_with now delegate to parent. Pointer arrays store full 192-bit encoding per element. 4. convert_byte_extract for pointer array sources: extract address components from each 192-bit element to handle byte-level round-trips (fixes Pointer_array6). 5. Refactor dec_solve: extract check_SAT_backward_i2p() as a standalone method for use by both dec_solve and bv_refinementt::check_SAT. These changes are interdependent: removing the width reduction (3) requires the byte_extract handler (4), and the boolbv_width override (1) is needed for the dispatch fix (2). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
The wide encoding produces the same 11 failures as the standard encoding. The descriptor was incorrectly expecting 9. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
These tests have different expected results under the wide encoding due to address reuse semantics (Malloc10, Malloc14) and simplification differences (null8). Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
Tests that have different semantics under the wide encoding (address-based comparison, address reuse, etc.) are tagged no-wide-pointer-encoding. Each has an alternative descriptor with the wide-pointer-encoding tag. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
MiniSat's SimpSolver eliminates variables during the first solve. The backward constraint refinement in dec_solve then creates bv_utils operations (rel, add, equal) that reference these eliminated variables via gate_and/gate_or clauses, triggering the isEliminated assertion in addClause_. Fix: freeze all variables created during finish_eager_conversion (deferred I2P forward constraints, non-overlapping chain constraints). This prevents the simplifier from eliminating them while still allowing simplification of all other variables. Untagged havoc_slice/test_struct_d — now passes with MiniSat. Co-authored-by: Kiro <kiro-agent@users.noreply.github.com>
7 tasks
Collaborator
|
Can the offset not simply be derived from the address? Simply as in "drop the appropriate number of least significant bits"? |
Collaborator
Author
We could separately keep a mapping from object id to base address (and perhaps also vice versa), then we could drop one of integer address or offset. Else we would be baking in alignment assumptions, which would be particularly hard to come by when object sizes are symbolic. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add an alternative pointer encoding activated via
--wide-pointer-encodingthat represents each pointer as a 192-bit bitvector[object(64) | offset(64) | address(64)]. The address component is a flat integer maintained consistently through all pointer operations, enabling address-based pointer comparison, cross-object pointer arithmetic detection, and address reuse after free.Fixes: #881
Fixes: #8200
Fixes: #8161
Fixes: #8103
Fixes: #2117
Motivation
The standard pointer encoding uses
[object | offset]pairs where the object number is an abstract identifier with no numeric meaning. This makes it impossible to model scenarios where the concrete address matters: pointer-to-integer casts return meaningless values (#881, #8200), pointer comparisons across objects are inconsistent (#8161), integer-to-pointer casts cannot resolve back to the correct object (#8103), and address reuse after free is not modelled (#2117).Design
Each pointer carries a symbolic flat address alongside the standard object/offset encoding. Non-overlapping base address constraints ensure distinct objects occupy disjoint address ranges. The encoding is purely propositional — no solver-level arrays are used.
Key design decisions:
dec_solve(address in range ⇒ object identity).<,>,==,!=), matching C semantics for cross-object comparisons.config.bv_encoding.malloc_may_alias: consecutive dynamic objects of the same size may share base addresses.Additional feature:
--model-stack-layoutWhen combined with
--wide-pointer-encoding, the--model-stack-layoutoption constrains stack variables to be placed at adjacent addresses in declaration order, with architecture-dependent stack growth direction (config.ansi_c.stack_grows_downward).Performance
The wide encoding adds ~1.5–2× overhead on solver-heavy tests. On the full regression suite (dominated by solver-trivial tests), the overhead is ~1.1×.
Test results
no-wide-pointer-encoding(intentional semantic changes with alternativetest-wide-pointer-encoding.descdescriptors)General fixes included
This branch also includes two general SAT backend fixes (commits 1–2) discovered during development:
set_limit_incremental_simplificationto prevent variable elimination between incremental solvesl_getcrash forvar_no == 0(1-based variable numbering)