GitHub - dariuszduszynski/ELSA: Easy Log Storage Archive

ELSA - Easy Log Search Archival

Cold layer archival search for SOC teams. Built on the same philosophy as DES: S3 as the only source of truth, fully stateless compute nodes, zero mandatory external databases.

"Let it go" — once data reaches the archive, it stays there. Immutable, verifiable, searchable.

Problem Statement

Every SOC team faces the same tension: logs must be kept for years, but storage systems designed for search were never designed for retention — and storage systems designed for retention were never designed for search.

The typical architecture that emerges:

Wazuh or Splunk for real-time alerting — fast, expensive, short retention
Elasticsearch for hot search — fast, very expensive, short retention
Cold storage (S3, tape, NFS) for compliance — cheap, but unsearchable

When a SOC analyst needs to investigate an incident from 6 months ago — a compromised IP address, a lateral movement trail, a suspicious username — they face a wall: the hot systems have already purged the data, and the cold archive has no index. The answer to "what did 185.220.101.42 do in February?" requires hours of manual log retrieval, decompression, and grep.

ELSA solves this. It is the missing layer: cheap S3 storage with a searchable index, compliance-grade immutability, and a query model built for SOC workflows.

What ELSA Is (and Is Not)

ELSA IS:

An archival log storage system with entity-centric search (IP, username, hostname, session ID)
A multi-stream query system — SOC analysts work across log streams simultaneously, and ELSA is designed for this
A compliance-grade archive (S3 Object Lock / WORM, GDPR tombstone deletion, audit trail)
A natural extension of logRotate semantics — data flows in when hot systems are done with it
A complement to Wazuh/Elastic, not a competitor

ELSA IS NOT:

A real-time alerting or correlation engine (that's Wazuh's job)
A full-text search engine (that's Elasticsearch's job)
A replacement for your hot storage layer
A streaming analytics platform

The Ecosystem

T+0 → T+72h              T+72h → T+30d          T+30d → ∞
┌────────────────────┐   ┌────────────────────┐  ┌────────────────────┐
│  WAZUH + Elastic   │   │  Wazuh hot storage │  │  ELSA              │
│                    │   │  Fast DBs          │  │                    │
│  Real-time alerts  │   │  Fast search       │  │  Archive + Search  │
│  Correlation rules │   │  Last 30 days      │  │  Compliance/WORM   │
│  Active response   │   │                    │  │  Entity lookup     │
└────────────────────┘   └────────────────────┘  └────────────────────┘
                                                          ↑
                                              logRotate feeds data here

Core Architecture

Philosophical Foundation

ELSA inherits the core principle from DES: S3 is the only source of truth. Every other component is either a cache or a compute node — destroyable and reconstructible at any time.

The implication: Redis (the metadata cache) can be wiped and fully rebuilt from S3 manifests in minutes. No data lives exclusively in Redis. No PostgreSQL is required.

The D+1 Model

CONTINUOUS WRITE (no locking, no conflicts)
  Ingestors → micro-splits → s3://staging/{stream}/{today}/

NIGHTLY COMPACTION (single writer, zero race conditions)
  Nightly Job → merge → index → promote to archive → rebuild Redis

QUERY (stateless, any node can serve any query)
  Redis metastore → identify candidate splits → S3 Range-GET → results

Multi-Stream Query Model

A core design requirement: SOC analysts pivot across streams simultaneously. An entity timeline for IP 185.220.101.42 must return events from audit_logs, firewall_logs, and app_logs in a single request, merged and sorted by timestamp.

ELSA implements this through a cluster-level Redis namespace that maintains a cross-stream entity index, and a query planner that fans out to all relevant streams in parallel, merging results before returning to the caller. This is a first-class design concern, not an afterthought.

Security Model — Ingestor Authentication

All ingestion endpoints require authentication. Unauthenticated writes to staging are a data poisoning vector — data ingested today becomes WORM-locked archive tomorrow.

Syslog over TLS: mutual TLS with per-source certificates
HTTP POST: Bearer token (HMAC, same model as DES)
Kafka: SASL/SCRAM per consumer group
Source identity is embedded in the split metadata and audit trail

Compliance Design

WORM and the logRotate Boundary

Staging: mutable — ingestors can correct errors before compaction
Archive: immutable — S3 Object Lock applied immediately after promotion

GDPR vs WORM Tension

Resolved through tombstone-based deletion. Physical deletion via repack is only possible in GOVERNANCE mode.

Chain Hash Continuity After Repack

When a repack operation physically removes tombstoned records, it creates new splits with new hashes — which would break the cryptographic chain for all subsequent splits. ELSA resolves this with repack anchor entries in the audit trail:

REPACK_ANCHOR entry:
  old_split_id    → hash of original split
  new_split_id    → hash of repacked split
  tombstone_ids   → list of removed doc_ids
  anchor_hash     → sha256(previous_chain_hash + old_split_hash + new_split_hash)

Auditors verify integrity using the repack-aware chain verifier, which treats anchor points as valid chain continuations. The chain is unbroken — it has documented mutations.

BypassGovernanceRetention — Privileged Operation

The s3:BypassGovernanceRetention IAM permission required for GDPR repack is treated as a privileged, break-glass operation:

Held by a dedicated IAM role, not by any service account used in normal operations
Requires MFA authentication before assumption
Every usage generates a CloudTrail alert to the CISO
The permission is never embedded in application configuration — it is retrieved from OpenBao at repack time with a short-lived token

Index Architecture

Three-Layer Index

QUERY: src_ip = 1.2.3.4, time: last 3 weeks, streams: all

LAYER 0: Redis cluster entity index (0 S3 GETs)
  → "which streams × weekly segments contain this IP?" → cross-stream map

LAYER 1: Bloom filter per split (1 Range-GET per split, hotcache section)
  → probabilistic elimination, ~1% FPR

LAYER 2: Inverted index posting list (1 Range-GET per qualifying split)
  → exact list of doc_ids → fetch records → merge across streams → return

The Split Format (.elsa)

Previously .ldes. Renamed to .elsa for consistency with the project name.

Each archive unit is a self-contained binary file stored on S3. The format is versioned from v1, with a version byte in the magic header enabling forward-compatible readers.

Split file (.elsa) — versioned binary format
┌───────────────────────────────────────┐
│  MAGIC (4B: "ELSA") + VERSION (1B)    │  ← version enables format evolution
│  HEADER: stream, time_range, schema   │
├───────────────────────────────────────┤
│  HOTCACHE SECTION (~50–200KB)         │
│  Bloom filters (versioned, own format)│  ← NOT Java serialization
│  Sparse index (every 256th record)    │
│  Column min/max statistics            │
├───────────────────────────────────────┤
│  COLUMNAR DATA SECTION (zstd)         │
├───────────────────────────────────────┤
│  INVERTED INDEX SECTION               │
│  entity_value → posting list          │
├───────────────────────────────────────┤
│  FOOTER (last 32B)                    │
│  Section offsets + CRC32              │
└───────────────────────────────────────┘

Bloom filter serialization: ELSA uses its own binary Bloom filter format (not Guava's Java serialization) to ensure library-version independence. Format is documented in docs/bloom-filter-format.md.

S3 Layout

s3://logs-bucket/
├── _catalog/
│   ├── cluster.json                         ← cluster-wide stream registry + cross-stream entity index roots
│   └── {stream}/
│       ├── current                          ← pointer to active snapshot (updated with S3 conditional PUT)
│       ├── snap_{N}.json
│       └── manifests/
│           └── man_{YYYY-WNN}.json
├── splits/
│   └── {stream}/{YYYY}/{WNN}/
│       └── {split_id}.elsa                  ← archive splits (WORM-locked)
├── indexes/
│   ├── {stream}/{YYYY-WNN}/
│   │   └── ip_index.idx                     ← per-stream weekly index
│   └── _cross_stream/{YYYY-WNN}/
│       └── entity_index.idx                 ← cross-stream entity index (NEW)
├── staging/
│   └── {stream}/{YYYY}/{MM}/{DD}/
│       └── {micro_split_id}.elsa
├── tombstones/
│   └── {stream}/{request_id}.json           ← doc_ids stored as opaque hashes, not raw IDs
└── audit-trail/
    └── {stream}/{YYYY}/{MM}/
        └── audit_{DD}.jsonl

Epics

Epic	Title	Key changes vs original design
EPIC-01	Core Architecture & Split Format	Format versioning, Bloom filter own binary format
EPIC-02	Ingestor & Schema Normalization	Ingestor auth (mTLS/HMAC/SASL), schema evolution strategy
EPIC-03	Nightly Compaction & Index Build	Job overlap guard, cross-stream index build, S3 conditional PUT
EPIC-04	Redis Metastore	Cluster namespace, eviction policy for high-cardinality, DR runbook
EPIC-05	Query Engine	Cross-stream fan-out, rate limiting, staging scan cap
EPIC-06	Compliance Layer	Chain hash with repack anchors, BypassGovernanceRetention hardening, tombstone privacy, entity-scoped legal hold
EPIC-07	logRotate Integration	Authenticated import, format unchanged
EPIC-08	SOC API & Query Interface	Cross-stream timeline, rate limiting headers

Estimated scope: ~380 story points (additional ~37 SP for risk mitigations).

MVP (v0.2.0): EPIC-01 through EPIC-04 plus EPIC-07.

Technology Stack

Component	Technology	Rationale
Runtime	Java 21 + Quarkus	Aligned with DES 2.0
Object storage	S3-compatible (AWS, MinIO, Ceph/RGW)	Same as DES
Metadata cache	Redis 7.x	Atomic Lua scripts; sorted sets for time-range
Compression	zstd level 3	Columnar data benefits significantly
Bloom filter	Custom binary format (inspired by Guava)	Library-version independence
Posting list codec	VByte delta encoding	1–2 bytes per doc_id typical
Secrets management	OpenBao (Vault fork)	BypassGovernanceRetention token management
Container orchestration	Kubernetes	CronJob for nightly compaction
Observability	Prometheus + Grafana	Nightly job success/failure alerts included

Key Design Decisions — Summary

S3 is the only source of truth. Redis can be wiped and rebuilt at any time.
D+1 model eliminates race conditions. One writer per manifest per night.
Multi-stream query is first class. SOC analysts do not work with one stream at a time.
Ingestor authentication is mandatory. Unauthenticated writes are a data poisoning vector.
Bloom filters use own binary format. Not Guava Java serialization — library-version independent.
Chain hash survives repack via anchor entries. GDPR deletion does not break audit trail integrity.
BypassGovernanceRetention is break-glass. MFA, short-lived token, CloudTrail alert.
Tombstones store hashed doc_ids. No raw personal data identifiers in audit trail.

Relationship to DES

ELSA is a sibling project to DES, developed by the same team at Datavision.pl.

License

Apache License 2.0

Datavision.pl — data science consultancy and infrastructure tooling. Project status: design phase. Implementation begins Q2 2026.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.mvn/wrapper		.mvn/wrapper
elsa-format		elsa-format
elsa-storage		elsa-storage
elsa-test-fixtures		elsa-test-fixtures
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
mvnw		mvnw
mvnw.cmd		mvnw.cmd
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Problem Statement

What ELSA Is (and Is Not)

The Ecosystem

Core Architecture

Philosophical Foundation

The D+1 Model

Multi-Stream Query Model

Security Model — Ingestor Authentication

Compliance Design

WORM and the logRotate Boundary

GDPR vs WORM Tension

Chain Hash Continuity After Repack

BypassGovernanceRetention — Privileged Operation

Index Architecture

Three-Layer Index

The Split Format (.elsa)

S3 Layout

Epics

Technology Stack

Key Design Decisions — Summary

Relationship to DES

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Problem Statement

What ELSA Is (and Is Not)

The Ecosystem

Core Architecture

Philosophical Foundation

The D+1 Model

Multi-Stream Query Model

Security Model — Ingestor Authentication

Compliance Design

WORM and the logRotate Boundary

GDPR vs WORM Tension

Chain Hash Continuity After Repack

BypassGovernanceRetention — Privileged Operation

Index Architecture

Three-Layer Index

The Split Format (.elsa)

S3 Layout

Epics

Technology Stack

Key Design Decisions — Summary

Relationship to DES

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages