Skip to content

dariuszduszynski/ELSA

Repository files navigation

ELSA - Easy Log Search Archival

Cold layer archival search for SOC teams. Built on the same philosophy as DES: S3 as the only source of truth, fully stateless compute nodes, zero mandatory external databases.

"Let it go" — once data reaches the archive, it stays there. Immutable, verifiable, searchable.


Problem Statement

Every SOC team faces the same tension: logs must be kept for years, but storage systems designed for search were never designed for retention — and storage systems designed for retention were never designed for search.

The typical architecture that emerges:

  • Wazuh or Splunk for real-time alerting — fast, expensive, short retention
  • Elasticsearch for hot search — fast, very expensive, short retention
  • Cold storage (S3, tape, NFS) for compliance — cheap, but unsearchable

When a SOC analyst needs to investigate an incident from 6 months ago — a compromised IP address, a lateral movement trail, a suspicious username — they face a wall: the hot systems have already purged the data, and the cold archive has no index. The answer to "what did 185.220.101.42 do in February?" requires hours of manual log retrieval, decompression, and grep.

ELSA solves this. It is the missing layer: cheap S3 storage with a searchable index, compliance-grade immutability, and a query model built for SOC workflows.


What ELSA Is (and Is Not)

ELSA IS:

  • An archival log storage system with entity-centric search (IP, username, hostname, session ID)
  • A multi-stream query system — SOC analysts work across log streams simultaneously, and ELSA is designed for this
  • A compliance-grade archive (S3 Object Lock / WORM, GDPR tombstone deletion, audit trail)
  • A natural extension of logRotate semantics — data flows in when hot systems are done with it
  • A complement to Wazuh/Elastic, not a competitor

ELSA IS NOT:

  • A real-time alerting or correlation engine (that's Wazuh's job)
  • A full-text search engine (that's Elasticsearch's job)
  • A replacement for your hot storage layer
  • A streaming analytics platform

The Ecosystem

T+0 → T+72h              T+72h → T+30d          T+30d → ∞
┌────────────────────┐   ┌────────────────────┐  ┌────────────────────┐
│  WAZUH + Elastic   │   │  Wazuh hot storage │  │  ELSA              │
│                    │   │  Fast DBs          │  │                    │
│  Real-time alerts  │   │  Fast search       │  │  Archive + Search  │
│  Correlation rules │   │  Last 30 days      │  │  Compliance/WORM   │
│  Active response   │   │                    │  │  Entity lookup     │
└────────────────────┘   └────────────────────┘  └────────────────────┘
                                                          ↑
                                              logRotate feeds data here

Core Architecture

Philosophical Foundation

ELSA inherits the core principle from DES: S3 is the only source of truth. Every other component is either a cache or a compute node — destroyable and reconstructible at any time.

The implication: Redis (the metadata cache) can be wiped and fully rebuilt from S3 manifests in minutes. No data lives exclusively in Redis. No PostgreSQL is required.

The D+1 Model

CONTINUOUS WRITE (no locking, no conflicts)
  Ingestors → micro-splits → s3://staging/{stream}/{today}/

NIGHTLY COMPACTION (single writer, zero race conditions)
  Nightly Job → merge → index → promote to archive → rebuild Redis

QUERY (stateless, any node can serve any query)
  Redis metastore → identify candidate splits → S3 Range-GET → results

Multi-Stream Query Model

A core design requirement: SOC analysts pivot across streams simultaneously. An entity timeline for IP 185.220.101.42 must return events from audit_logs, firewall_logs, and app_logs in a single request, merged and sorted by timestamp.

ELSA implements this through a cluster-level Redis namespace that maintains a cross-stream entity index, and a query planner that fans out to all relevant streams in parallel, merging results before returning to the caller. This is a first-class design concern, not an afterthought.


Security Model — Ingestor Authentication

All ingestion endpoints require authentication. Unauthenticated writes to staging are a data poisoning vector — data ingested today becomes WORM-locked archive tomorrow.

  • Syslog over TLS: mutual TLS with per-source certificates
  • HTTP POST: Bearer token (HMAC, same model as DES)
  • Kafka: SASL/SCRAM per consumer group
  • Source identity is embedded in the split metadata and audit trail

Compliance Design

WORM and the logRotate Boundary

  • Staging: mutable — ingestors can correct errors before compaction
  • Archive: immutable — S3 Object Lock applied immediately after promotion

GDPR vs WORM Tension

Resolved through tombstone-based deletion. Physical deletion via repack is only possible in GOVERNANCE mode.

Chain Hash Continuity After Repack

When a repack operation physically removes tombstoned records, it creates new splits with new hashes — which would break the cryptographic chain for all subsequent splits. ELSA resolves this with repack anchor entries in the audit trail:

REPACK_ANCHOR entry:
  old_split_id    → hash of original split
  new_split_id    → hash of repacked split
  tombstone_ids   → list of removed doc_ids
  anchor_hash     → sha256(previous_chain_hash + old_split_hash + new_split_hash)

Auditors verify integrity using the repack-aware chain verifier, which treats anchor points as valid chain continuations. The chain is unbroken — it has documented mutations.

BypassGovernanceRetention — Privileged Operation

The s3:BypassGovernanceRetention IAM permission required for GDPR repack is treated as a privileged, break-glass operation:

  • Held by a dedicated IAM role, not by any service account used in normal operations
  • Requires MFA authentication before assumption
  • Every usage generates a CloudTrail alert to the CISO
  • The permission is never embedded in application configuration — it is retrieved from OpenBao at repack time with a short-lived token

Index Architecture

Three-Layer Index

QUERY: src_ip = 1.2.3.4, time: last 3 weeks, streams: all

LAYER 0: Redis cluster entity index (0 S3 GETs)
  → "which streams × weekly segments contain this IP?" → cross-stream map

LAYER 1: Bloom filter per split (1 Range-GET per split, hotcache section)
  → probabilistic elimination, ~1% FPR

LAYER 2: Inverted index posting list (1 Range-GET per qualifying split)
  → exact list of doc_ids → fetch records → merge across streams → return

The Split Format (.elsa)

Previously .ldes. Renamed to .elsa for consistency with the project name.

Each archive unit is a self-contained binary file stored on S3. The format is versioned from v1, with a version byte in the magic header enabling forward-compatible readers.

Split file (.elsa) — versioned binary format
┌───────────────────────────────────────┐
│  MAGIC (4B: "ELSA") + VERSION (1B)    │  ← version enables format evolution
│  HEADER: stream, time_range, schema   │
├───────────────────────────────────────┤
│  HOTCACHE SECTION (~50–200KB)         │
│  Bloom filters (versioned, own format)│  ← NOT Java serialization
│  Sparse index (every 256th record)    │
│  Column min/max statistics            │
├───────────────────────────────────────┤
│  COLUMNAR DATA SECTION (zstd)         │
├───────────────────────────────────────┤
│  INVERTED INDEX SECTION               │
│  entity_value → posting list          │
├───────────────────────────────────────┤
│  FOOTER (last 32B)                    │
│  Section offsets + CRC32              │
└───────────────────────────────────────┘

Bloom filter serialization: ELSA uses its own binary Bloom filter format (not Guava's Java serialization) to ensure library-version independence. Format is documented in docs/bloom-filter-format.md.


S3 Layout

s3://logs-bucket/
├── _catalog/
│   ├── cluster.json                         ← cluster-wide stream registry + cross-stream entity index roots
│   └── {stream}/
│       ├── current                          ← pointer to active snapshot (updated with S3 conditional PUT)
│       ├── snap_{N}.json
│       └── manifests/
│           └── man_{YYYY-WNN}.json
├── splits/
│   └── {stream}/{YYYY}/{WNN}/
│       └── {split_id}.elsa                  ← archive splits (WORM-locked)
├── indexes/
│   ├── {stream}/{YYYY-WNN}/
│   │   └── ip_index.idx                     ← per-stream weekly index
│   └── _cross_stream/{YYYY-WNN}/
│       └── entity_index.idx                 ← cross-stream entity index (NEW)
├── staging/
│   └── {stream}/{YYYY}/{MM}/{DD}/
│       └── {micro_split_id}.elsa
├── tombstones/
│   └── {stream}/{request_id}.json           ← doc_ids stored as opaque hashes, not raw IDs
└── audit-trail/
    └── {stream}/{YYYY}/{MM}/
        └── audit_{DD}.jsonl

Epics

Epic Title Key changes vs original design
EPIC-01 Core Architecture & Split Format Format versioning, Bloom filter own binary format
EPIC-02 Ingestor & Schema Normalization Ingestor auth (mTLS/HMAC/SASL), schema evolution strategy
EPIC-03 Nightly Compaction & Index Build Job overlap guard, cross-stream index build, S3 conditional PUT
EPIC-04 Redis Metastore Cluster namespace, eviction policy for high-cardinality, DR runbook
EPIC-05 Query Engine Cross-stream fan-out, rate limiting, staging scan cap
EPIC-06 Compliance Layer Chain hash with repack anchors, BypassGovernanceRetention hardening, tombstone privacy, entity-scoped legal hold
EPIC-07 logRotate Integration Authenticated import, format unchanged
EPIC-08 SOC API & Query Interface Cross-stream timeline, rate limiting headers

Estimated scope: ~380 story points (additional ~37 SP for risk mitigations).

MVP (v0.2.0): EPIC-01 through EPIC-04 plus EPIC-07.


Technology Stack

Component Technology Rationale
Runtime Java 21 + Quarkus Aligned with DES 2.0
Object storage S3-compatible (AWS, MinIO, Ceph/RGW) Same as DES
Metadata cache Redis 7.x Atomic Lua scripts; sorted sets for time-range
Compression zstd level 3 Columnar data benefits significantly
Bloom filter Custom binary format (inspired by Guava) Library-version independence
Posting list codec VByte delta encoding 1–2 bytes per doc_id typical
Secrets management OpenBao (Vault fork) BypassGovernanceRetention token management
Container orchestration Kubernetes CronJob for nightly compaction
Observability Prometheus + Grafana Nightly job success/failure alerts included

Key Design Decisions — Summary

  1. S3 is the only source of truth. Redis can be wiped and rebuilt at any time.
  2. D+1 model eliminates race conditions. One writer per manifest per night.
  3. Multi-stream query is first class. SOC analysts do not work with one stream at a time.
  4. Ingestor authentication is mandatory. Unauthenticated writes are a data poisoning vector.
  5. Bloom filters use own binary format. Not Guava Java serialization — library-version independent.
  6. Chain hash survives repack via anchor entries. GDPR deletion does not break audit trail integrity.
  7. BypassGovernanceRetention is break-glass. MFA, short-lived token, CloudTrail alert.
  8. Tombstones store hashed doc_ids. No raw personal data identifiers in audit trail.

Relationship to DES

ELSA is a sibling project to DES, developed by the same team at Datavision.pl.


License

Apache License 2.0


Datavision.pl — data science consultancy and infrastructure tooling. Project status: design phase. Implementation begins Q2 2026.

About

Easy Log Storage Archive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages