Skip to content

release: AIngle v0.6.3 — total data integrity hardening#83

Merged
ApiliumDevTeam merged 16 commits intomainfrom
dev
Mar 19, 2026
Merged

release: AIngle v0.6.3 — total data integrity hardening#83
ApiliumDevTeam merged 16 commits intomainfrom
dev

Conversation

@ApiliumDevTeam
Copy link
Contributor

Summary

Complete data integrity audit and hardening across the entire AIngle pipeline. Fixes 18 bugs found during exhaustive input/output testing of all subsystems.

Data integrity (7 fixes)

  • Persistent ProofStore — was in-memory only, all proofs lost on restart. New ProofBackend trait with Sled implementation
  • Proofs in Raft snapshots — new nodes joining a cluster got 0 proofs. Added ProofSnapshotProvider trait + checksum coverage
  • Periodic auto-flush — Ineru only saved on explicit shutdown; crash = total data loss. Now flushes every 300s (configurable via --flush-interval)
  • Audit log fsyncwriteln! without sync_all(), errors silently ignored with let _
  • Atomic batch insertinsert_batch() did sequential puts; partial failure = inconsistency. Now uses sled::Batch
  • P2P DAG sync — gossip only synced triples, DAG actions were never replicated between peers. Added tip-based sync protocol
  • Sled lock contention — ProofStore and GraphDB shared same sled path causing WouldBlock. Fixed with separate proofs.sled directory

fsync hardening (5 fixes)

  • DAG signing key write — let _ = write_all replaced with error handling + fsync
  • Kaneru agent state + ML weights — fsync after save
  • Ineru memory snapshot — std::fs::write (never fsyncs) replaced with File::create + sync_all
  • P2P node identity key — fsync on both Unix and non-Unix
  • Peer store JSON — fsync after write

Panic elimination (5 fixes)

  • WAL writer: lock().unwrap()lock().map_err() (4 sites)
  • Rule engine: poisoned lock recovery via unwrap_or_else(|p| p.into_inner()) (9 sites)
  • Kaneru agent: unwrap() on Option → graceful early return with log::warn
  • P2P REST endpoints: to_value().unwrap() → match with HTTP 500
  • ProofStore init: blocking_write removed entirely

New features

  • POST /api/v1/triples/batch — atomic bulk triple insert endpoint with duplicate detection
  • --flush-interval <SECS> CLI flag for configurable periodic flush (0 = disabled)

Bug fixes

  • GET /api/v1/proofs/:id/verify returned 422 on malformed proof data — now returns 200 with valid: false + error details

Testing

  • 8 new cross-subsystem data integrity tests (ProofStore persistence, Graph+DAG consistency, batch atomicity, AppState flush/restore, Raft snapshot with proofs, audit log integrity)
  • 1747+ tests passing across all core crates, 0 regressions
  • Full E2E server test: 15 endpoints verified

Test plan

  • cargo check --workspace — clean compilation
  • cargo test -p aingle_graph --features dag — 244 passed
  • cargo test -p aingle_cortex --lib — 153 passed
  • cargo test -p aingle_cortex --test data_integrity_test — 8 passed
  • cargo test -p aingle_cortex --test proof_system_test — 14 passed
  • cargo test -p aingle_cortex --test rate_limiting_test — 16 passed
  • cargo test -p aingle_raft — 33 passed
  • cargo test -p aingle_wal — 20 passed
  • cargo test -p aingle_logic — 33 passed
  • cargo test -p ineru — 57 passed
  • cargo test -p kaneru — 444 passed
  • E2E server test — all 15 endpoints return correct status codes and data
  • aingle-cortex --versionv0.6.3

🤖 Generated with Claude Code

ApiliumDevTeam and others added 16 commits March 16, 2026 19:26
- GraphDB.flush() now flushes DAG store alongside triple store
- DAG persistent init failure is now fatal (no silent in-memory fallback)
- DAG action failures on triple insert/delete return errors instead of
  being silently swallowed — prevents triples existing without audit trail
- GraphQL mutations now record DAG actions (previously bypassed DAG/Raft
  entirely, causing split-brain in cluster mode)
- DagStore.put() validates parent hashes exist before accepting actions,
  preventing orphaned entries that break traversal and time-travel queries
- Corrupted actions in DAG backend are now logged during index rebuild
  instead of being silently skipped

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ProofStore was purely in-memory — all proofs lost on restart. This adds
a ProofBackend trait (mirroring the DagBackend pattern) with Memory and
Sled implementations.

- New `proofs/backend.rs` with ProofBackend trait, MemoryProofBackend,
  and SledProofBackend (tree "proofs" in a dedicated sled DB)
- Refactored ProofStore to use backend trait instead of HashMap
- `ProofStore::with_sled(path)` constructor for persistent storage
- `ProofStore::flush()` for durable writes
- `AppState::with_db_path` creates Sled-backed ProofStore (uses
  `proofs.sled` sibling directory to avoid sled lock contention with
  the graph DB)
- `AppState::flush()` now flushes proof store alongside graph
- Stats rebuilt from backend on startup (no tokio lock needed)
ClusterSnapshot did not include proofs — new nodes joining a cluster
started with zero proofs. This adds proof snapshot export/import to the
Raft state machine.

- Added ProofSnapshot struct and ProofSnapshotProvider trait
- ClusterSnapshot now has `proofs: Vec<ProofSnapshot>` field
  (backward-compatible via serde(default))
- Blake3 checksum now covers proofs alongside triples + ineru_ltm
- CortexSnapshotBuilder exports proofs via provider during build
- install_snapshot() imports proofs when present
- ProofStore implements ProofSnapshotProvider (sync methods for
  export/import via backend.list_all)
- Wired into cluster_init: proof provider set on CortexStateMachine
Ineru only saved on explicit shutdown — a crash meant data loss. This
adds a configurable periodic flush task.

- Added `flush_interval_secs` to CortexConfig (default: 300s)
- Added `--flush-interval <SECS>` CLI argument (0 = disabled)
- Spawns tokio task that calls state.flush() at the configured interval
- Flushes graph DB, proof store, and Ineru snapshot atomically
AuditLog::record() used `let _ = writeln!()` which silently ignored
write failures, and never called sync_all() — data could be lost on
crash.

- writeln! errors now logged via log::error!
- file.sync_all() called after each write (log::warn on failure)
- OpenOptions::open failures logged instead of silently ignored
insert_batch() performed individual puts — a failure mid-batch left
the store in an inconsistent state with some triples written and
indexes partially updated.

- Added apply_batch() to StorageBackend trait (default: sequential puts)
- SledBackend overrides with sled::Batch for atomic writes
- Refactored GraphStore::insert_batch() into 3 phases:
  1. Collect non-duplicate triples
  2. Atomic backend batch write
  3. Update indexes only on success
Gossip only synced triples — DAG actions were not replicated between
peers. This adds tip-based DAG sync to the P2P layer.

- New p2p/dag_sync.rs module with tip collection, missing action
  computation, serialized action fetch, and batch ingestion
- Added DagTipSync, RequestDagActions, SendDagActions message variants
  (feature-gated under "dag")
- Gossip loop (Task 2) now sends DagTipSync alongside BloomSync
- Message handler (Task 3) handles all 3 DAG message types:
  - DagTipSync: compute missing actions via DagStore::compute_missing()
  - RequestDagActions: fetch and send actions by hash
  - SendDagActions: ingest received actions via DagStore::ingest()
- Reuses existing DagStore::compute_missing() and ingest() APIs
End-to-end tests verifying data flows correctly across all AIngle
subsystems — these caught the sled lock contention bug where ProofStore
and GraphDB shared the same sled path.

- ProofStore Sled round-trip (20 proofs write/reopen/delete/verify)
- Graph+DAG triple materialization consistency (50 triples + deletes)
- Batch insert index consistency (100 triples, duplicate handling)
- AppState flush/restore full cycle (triples + proofs survive restart)
- Raft snapshot with proofs serialization round-trip
- Snapshot checksum changes when proofs are included
- Graph Sled persistence with float precision verification
- Audit log fsync integrity (50 entries write/reopen/query filters)
Audit found 5 locations where important data was written to disk without
sync_all(), meaning a crash or power loss could lose the data even after
a "successful" write returned.

- main.rs: DAG signing key — `let _ = write_all` replaced with proper
  error handling + fsync (key loss broke all future DAG signatures)
- kaneru/persistence.rs: Agent state + LearningEngine saves now fsync
  after write_all (ML weights/Q-values could be lost)
- ineru/lib.rs: Memory snapshot now uses File::create + write_all +
  sync_all instead of std::fs::write (which never fsyncs)
- p2p/identity.rs: Node Ed25519 key now fsynced on both Unix and
  non-Unix (identity loss = can't rejoin P2P mesh)
- p2p/peer_store.rs: Known peers JSON now fsynced (peer list loss =
  must rediscover all network peers)
Mutex lock poisoning in the WAL writer caused panics that crashed the
entire node. The WAL is the most critical data path — a panic here
takes down all Raft consensus operations.

All 4 .lock().unwrap() calls replaced with .lock().map_err() that
returns io::Error, propagating the failure gracefully instead of
aborting the process.
RuleEngine used .write().unwrap() and .read().unwrap() on 9 lock
acquisitions. A panic in any thread holding these locks would cascade
to crash all subsequent validation/inference operations.

Replaced all 9 occurrences with .unwrap_or_else(|p| p.into_inner())
which recovers the data from a poisoned lock and continues operating.
Stats and inferred triples are non-critical — crashing the server over
a stats counter is disproportionate.
KaneruAgent::learn() called .unwrap() on current_state (Option) and
observation_history.back() (Option) — both panic if called before any
observation is recorded.

Replaced with early returns + log::warn for graceful degradation.
The agent now safely skips learning when called in an invalid state
instead of crashing the entire process.
p2p_status() and list_peers() called serde_json::to_value().unwrap()
which would panic the server if serialization ever failed. Replaced
with match that returns 500 Internal Server Error with error details.
…line

This release fixes 16 bugs found during an exhaustive input/output
audit of the entire AIngle data pipeline:

## Data integrity (6 fixes)
- Persistent ProofStore with Sled backend (was in-memory only)
- Proofs included in Raft cluster snapshots (new nodes got 0 proofs)
- Periodic auto-flush every 300s (crash = data loss window reduced)
- Audit log fsync + error reporting (was silently dropping writes)
- Atomic batch insert via sled::Batch (partial writes impossible)
- P2P DAG action sync via tip exchange (DAG wasn't replicated)

## fsync hardening (5 fixes)
- DAG signing key write — error handling + fsync
- Kaneru agent state + ML weights — fsync after save
- Ineru memory snapshot — fsync after write
- P2P node identity key — fsync on all platforms
- Peer store JSON — fsync after write

## Panic elimination (5 fixes)
- WAL writer: lock().unwrap() → lock().map_err() (4 sites)
- Rule engine: poisoned lock recovery (9 sites)
- Kaneru agent: unwrap on Option → graceful early return
- P2P REST endpoints: unwrap → HTTP 500 error response
- ProofStore init: blocking_write removed entirely

## Testing
- 8 new cross-subsystem data integrity tests
- 1092+ tests passing across all core crates, 0
Exposes GraphStore::insert_batch() via REST API for efficient bulk
data loading. Uses sled::Batch for atomic writes when using Sled backend.

- POST /api/v1/triples/batch with JSON body {"triples": [...]}
- Returns 201 with inserted IDs, total count, and duplicate count
- Validates all inputs before writing (empty subject/predicate → 400)
- Namespace scoping enforced per triple
- Duplicates silently skipped (reported in response)
- Audit log records batch_create with insert/duplicate counts
- Events broadcast for each new triple
GET /api/v1/proofs/:id/verify returned 422 when the stored proof_data
didn't match the expected ZkProof structure (e.g. user submitted
arbitrary JSON without the required commitment/challenge/response
fields).

Now:
- Malformed proof data → 200 with valid:false + error details
- Proof not found → 404
- Valid proof → 200 with valid:true

This matches the semantic contract: verification tells you whether a
proof is valid, it shouldn't fail with a server error just because
the proof data is garbage.
@ApiliumDevTeam ApiliumDevTeam merged commit 22b45ae into main Mar 19, 2026
21 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant