release: AIngle v0.6.3 — total data integrity hardening by ApiliumDevTeam · Pull Request #83 · ApiliumCode/aingle

ApiliumDevTeam · 2026-03-19T10:46:55Z

Summary

Complete data integrity audit and hardening across the entire AIngle pipeline. Fixes 18 bugs found during exhaustive input/output testing of all subsystems.

Data integrity (7 fixes)

Persistent ProofStore — was in-memory only, all proofs lost on restart. New ProofBackend trait with Sled implementation
Proofs in Raft snapshots — new nodes joining a cluster got 0 proofs. Added ProofSnapshotProvider trait + checksum coverage
Periodic auto-flush — Ineru only saved on explicit shutdown; crash = total data loss. Now flushes every 300s (configurable via --flush-interval)
Audit log fsync — writeln! without sync_all(), errors silently ignored with let _
Atomic batch insert — insert_batch() did sequential puts; partial failure = inconsistency. Now uses sled::Batch
P2P DAG sync — gossip only synced triples, DAG actions were never replicated between peers. Added tip-based sync protocol
Sled lock contention — ProofStore and GraphDB shared same sled path causing WouldBlock. Fixed with separate proofs.sled directory

fsync hardening (5 fixes)

DAG signing key write — let _ = write_all replaced with error handling + fsync
Kaneru agent state + ML weights — fsync after save
Ineru memory snapshot — std::fs::write (never fsyncs) replaced with File::create + sync_all
P2P node identity key — fsync on both Unix and non-Unix
Peer store JSON — fsync after write

Panic elimination (5 fixes)

WAL writer: lock().unwrap() → lock().map_err() (4 sites)
Rule engine: poisoned lock recovery via unwrap_or_else(|p| p.into_inner()) (9 sites)
Kaneru agent: unwrap() on Option → graceful early return with log::warn
P2P REST endpoints: to_value().unwrap() → match with HTTP 500
ProofStore init: blocking_write removed entirely

New features

POST /api/v1/triples/batch — atomic bulk triple insert endpoint with duplicate detection
--flush-interval <SECS> CLI flag for configurable periodic flush (0 = disabled)

Bug fixes

GET /api/v1/proofs/:id/verify returned 422 on malformed proof data — now returns 200 with valid: false + error details

Testing

8 new cross-subsystem data integrity tests (ProofStore persistence, Graph+DAG consistency, batch atomicity, AppState flush/restore, Raft snapshot with proofs, audit log integrity)
1747+ tests passing across all core crates, 0 regressions
Full E2E server test: 15 endpoints verified

Test plan

🤖 Generated with Claude Code

- GraphDB.flush() now flushes DAG store alongside triple store - DAG persistent init failure is now fatal (no silent in-memory fallback) - DAG action failures on triple insert/delete return errors instead of being silently swallowed — prevents triples existing without audit trail - GraphQL mutations now record DAG actions (previously bypassed DAG/Raft entirely, causing split-brain in cluster mode) - DagStore.put() validates parent hashes exist before accepting actions, preventing orphaned entries that break traversal and time-travel queries - Corrupted actions in DAG backend are now logged during index rebuild instead of being silently skipped Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ProofStore was purely in-memory — all proofs lost on restart. This adds a ProofBackend trait (mirroring the DagBackend pattern) with Memory and Sled implementations. - New `proofs/backend.rs` with ProofBackend trait, MemoryProofBackend, and SledProofBackend (tree "proofs" in a dedicated sled DB) - Refactored ProofStore to use backend trait instead of HashMap - `ProofStore::with_sled(path)` constructor for persistent storage - `ProofStore::flush()` for durable writes - `AppState::with_db_path` creates Sled-backed ProofStore (uses `proofs.sled` sibling directory to avoid sled lock contention with the graph DB) - `AppState::flush()` now flushes proof store alongside graph - Stats rebuilt from backend on startup (no tokio lock needed)

ClusterSnapshot did not include proofs — new nodes joining a cluster started with zero proofs. This adds proof snapshot export/import to the Raft state machine. - Added ProofSnapshot struct and ProofSnapshotProvider trait - ClusterSnapshot now has `proofs: Vec<ProofSnapshot>` field (backward-compatible via serde(default)) - Blake3 checksum now covers proofs alongside triples + ineru_ltm - CortexSnapshotBuilder exports proofs via provider during build - install_snapshot() imports proofs when present - ProofStore implements ProofSnapshotProvider (sync methods for export/import via backend.list_all) - Wired into cluster_init: proof provider set on CortexStateMachine

Ineru only saved on explicit shutdown — a crash meant data loss. This adds a configurable periodic flush task. - Added `flush_interval_secs` to CortexConfig (default: 300s) - Added `--flush-interval <SECS>` CLI argument (0 = disabled) - Spawns tokio task that calls state.flush() at the configured interval - Flushes graph DB, proof store, and Ineru snapshot atomically

AuditLog::record() used `let _ = writeln!()` which silently ignored write failures, and never called sync_all() — data could be lost on crash. - writeln! errors now logged via log::error! - file.sync_all() called after each write (log::warn on failure) - OpenOptions::open failures logged instead of silently ignored

insert_batch() performed individual puts — a failure mid-batch left the store in an inconsistent state with some triples written and indexes partially updated. - Added apply_batch() to StorageBackend trait (default: sequential puts) - SledBackend overrides with sled::Batch for atomic writes - Refactored GraphStore::insert_batch() into 3 phases: 1. Collect non-duplicate triples 2. Atomic backend batch write 3. Update indexes only on success

Gossip only synced triples — DAG actions were not replicated between peers. This adds tip-based DAG sync to the P2P layer. - New p2p/dag_sync.rs module with tip collection, missing action computation, serialized action fetch, and batch ingestion - Added DagTipSync, RequestDagActions, SendDagActions message variants (feature-gated under "dag") - Gossip loop (Task 2) now sends DagTipSync alongside BloomSync - Message handler (Task 3) handles all 3 DAG message types: - DagTipSync: compute missing actions via DagStore::compute_missing() - RequestDagActions: fetch and send actions by hash - SendDagActions: ingest received actions via DagStore::ingest() - Reuses existing DagStore::compute_missing() and ingest() APIs

End-to-end tests verifying data flows correctly across all AIngle subsystems — these caught the sled lock contention bug where ProofStore and GraphDB shared the same sled path. - ProofStore Sled round-trip (20 proofs write/reopen/delete/verify) - Graph+DAG triple materialization consistency (50 triples + deletes) - Batch insert index consistency (100 triples, duplicate handling) - AppState flush/restore full cycle (triples + proofs survive restart) - Raft snapshot with proofs serialization round-trip - Snapshot checksum changes when proofs are included - Graph Sled persistence with float precision verification - Audit log fsync integrity (50 entries write/reopen/query filters)

Audit found 5 locations where important data was written to disk without sync_all(), meaning a crash or power loss could lose the data even after a "successful" write returned. - main.rs: DAG signing key — `let _ = write_all` replaced with proper error handling + fsync (key loss broke all future DAG signatures) - kaneru/persistence.rs: Agent state + LearningEngine saves now fsync after write_all (ML weights/Q-values could be lost) - ineru/lib.rs: Memory snapshot now uses File::create + write_all + sync_all instead of std::fs::write (which never fsyncs) - p2p/identity.rs: Node Ed25519 key now fsynced on both Unix and non-Unix (identity loss = can't rejoin P2P mesh) - p2p/peer_store.rs: Known peers JSON now fsynced (peer list loss = must rediscover all network peers)

Mutex lock poisoning in the WAL writer caused panics that crashed the entire node. The WAL is the most critical data path — a panic here takes down all Raft consensus operations. All 4 .lock().unwrap() calls replaced with .lock().map_err() that returns io::Error, propagating the failure gracefully instead of aborting the process.

RuleEngine used .write().unwrap() and .read().unwrap() on 9 lock acquisitions. A panic in any thread holding these locks would cascade to crash all subsequent validation/inference operations. Replaced all 9 occurrences with .unwrap_or_else(|p| p.into_inner()) which recovers the data from a poisoned lock and continues operating. Stats and inferred triples are non-critical — crashing the server over a stats counter is disproportionate.

KaneruAgent::learn() called .unwrap() on current_state (Option) and observation_history.back() (Option) — both panic if called before any observation is recorded. Replaced with early returns + log::warn for graceful degradation. The agent now safely skips learning when called in an invalid state instead of crashing the entire process.

p2p_status() and list_peers() called serde_json::to_value().unwrap() which would panic the server if serialization ever failed. Replaced with match that returns 500 Internal Server Error with error details.

…line This release fixes 16 bugs found during an exhaustive input/output audit of the entire AIngle data pipeline: ## Data integrity (6 fixes) - Persistent ProofStore with Sled backend (was in-memory only) - Proofs included in Raft cluster snapshots (new nodes got 0 proofs) - Periodic auto-flush every 300s (crash = data loss window reduced) - Audit log fsync + error reporting (was silently dropping writes) - Atomic batch insert via sled::Batch (partial writes impossible) - P2P DAG action sync via tip exchange (DAG wasn't replicated) ## fsync hardening (5 fixes) - DAG signing key write — error handling + fsync - Kaneru agent state + ML weights — fsync after save - Ineru memory snapshot — fsync after write - P2P node identity key — fsync on all platforms - Peer store JSON — fsync after write ## Panic elimination (5 fixes) - WAL writer: lock().unwrap() → lock().map_err() (4 sites) - Rule engine: poisoned lock recovery (9 sites) - Kaneru agent: unwrap on Option → graceful early return - P2P REST endpoints: unwrap → HTTP 500 error response - ProofStore init: blocking_write removed entirely ## Testing - 8 new cross-subsystem data integrity tests - 1092+ tests passing across all core crates, 0

Exposes GraphStore::insert_batch() via REST API for efficient bulk data loading. Uses sled::Batch for atomic writes when using Sled backend. - POST /api/v1/triples/batch with JSON body {"triples": [...]} - Returns 201 with inserted IDs, total count, and duplicate count - Validates all inputs before writing (empty subject/predicate → 400) - Namespace scoping enforced per triple - Duplicates silently skipped (reported in response) - Audit log records batch_create with insert/duplicate counts - Events broadcast for each new triple

GET /api/v1/proofs/:id/verify returned 422 when the stored proof_data didn't match the expected ZkProof structure (e.g. user submitted arbitrary JSON without the required commitment/challenge/response fields). Now: - Malformed proof data → 200 with valid:false + error details - Proof not found → 404 - Valid proof → 200 with valid:true This matches the semantic contract: verification tells you whether a proof is valid, it shouldn't fail with a server error just because the proof data is garbage.

ApiliumDevTeam and others added 16 commits March 16, 2026 19:26

fix: P2P REST endpoints — replace unwrap with proper HTTP error response

8418d6a

p2p_status() and list_peers() called serde_json::to_value().unwrap() which would panic the server if serialization ever failed. Replaced with match that returns 500 Internal Server Error with error details.

ApiliumDevTeam merged commit 22b45ae into main Mar 19, 2026
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: AIngle v0.6.3 — total data integrity hardening#83

release: AIngle v0.6.3 — total data integrity hardening#83
ApiliumDevTeam merged 16 commits intomainfrom
dev

ApiliumDevTeam commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ApiliumDevTeam commented Mar 19, 2026

Summary

Data integrity (7 fixes)

fsync hardening (5 fixes)

Panic elimination (5 fixes)

New features

Bug fixes

Testing

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant