-
Notifications
You must be signed in to change notification settings - Fork 0
Storage Layout
This document describes the storage layout and organization of data in Springtail's storage engine, from the low-level extent structure up through tables and indexes.
Springtail uses a multi-version concurrency control (MVCC) storage system with copy-on-write semantics. Data is organized hierarchically:
- Extents: Fixed-size blocks containing rows of data with a fixed-width schema
- Fields: Type-specific accessors for reading/writing column values within rows
- BTrees: B+ tree indexes composed of extents, used for both primary and secondary indexes
- Tables: Logical collections of data organized by a primary BTree and zero or more secondary indexes
- Cache: Unified page cache managing access to extents across all tables
An extent is the fundamental storage unit in Springtail. Each extent consists of three components stored sequentially on disk:
The header (serialized via ExtentHeader::pack()) contains metadata about the extent:
Offset Size Field
------ ---- -----
0 8 xid - Transaction ID when extent was written
8 8 prev_offset - Location of previous extent version (for MVCC)
16 4 row_size - Fixed size of each row in bytes
20 4 field_count - Number of fields in schema
24 N field_types - Array of field type bytes (1 byte per field)
24+N 1 type - Extent type (leaf/branch, root flag)
Field Type Encoding: Each field type byte encodes both the Springtail type (lower 7 bits) and nullable flag (top bit).
Extent Types:
-
LEAF(0x00): Contains actual data rows -
BRANCH(0x01): Contains index keys and child extent pointers - Root flag (0x02): Indicates this extent is a tree root
The fixed data section contains a contiguous array of fixed-width rows. Each row has exactly row_size bytes and stores:
- Fixed-width values: Integers, floats stored inline at specific byte offsets (at the beginning of the row)
- Variable-data offsets: 4-byte offsets into the variable data section for TEXT, BINARY, NUMERIC, and EXTENSION types
- Boolean fields: Packed as bits after all fixed-width fields
- Null bitmaps: Packed bits indicating which nullable fields are NULL (after boolean fields)
- Undefined bitmaps: Packed bits for replication tracking (indicates unchanged fields). Note: undefined bitmaps are only present in write cache extents, not in on-disk table extents.
Row layout is determined by the ExtentSchema (see src/storage/schema.cc:79-162), which places fixed-width fields first, then boolean bits, null bits, and finally undefined bits (if applicable) at the end of the row.
Example row structure for on-disk extents (simplified):
[int64_field: 8 bytes] [text_offset: 4 bytes] [int32_field: 4 bytes] [bool_field: 1 bit] [null_bits: N bits]
Example row structure for write cache extents (simplified):
[int64_field: 8 bytes] [text_offset: 4 bytes] [int32_field: 4 bytes] [bool_field: 1 bit] [null_bits: N bits] [undefined_bits: M bits]
Variable-length values (TEXT, BINARY, NUMERIC, EXTENSION types) are stored in a separate variable data section managed by the VariableData class. This provides:
-
Length-prefixed storage: Each value stored as
[4-byte length][data] - Automatic deduplication: Hash table detects duplicate values within an extent
- Chunked allocation: Data organized in 4KB chunks to avoid vector reallocation
- Global offset addressing: Fixed data stores 4-byte offsets into this global space
The deduplication hash (in Extent::_variable_hash) maps string_view → offset, ensuring identical strings share storage.
The Field class hierarchy provides type-safe access to column values within logical rows (often extent rows, but also replication log messages, constant values, etc.):
ExtentField (include/storage/field.hh:508): Accesses data in an Extent::Row by:
- Storing byte offset within the fixed row data
- For booleans: storing bit position and bitmask
- For nullable fields: tracking null bitmap offset and bit
- For variable data: reading offset from fixed data, then dereferencing into variable section
ConstField: Represents constant values (for query predicates, default values)
PgLogField: Reads data from PostgreSQL replication log messages
MutableField (include/storage/field.hh:325): Extends Field with set_* methods for writing values
Fields support:
- Type-specific
get_*andset_*methods (e.g.,get_int64(),set_text()) - Comparison operations (
less_than(),equal()) used for index searches - Null and undefined checks via bitmask operations
- Copying values between different row sources
Tables and indexes are organized as B+ trees composed of extents:
Read-Only BTree:
- Immutable snapshot at a specific XID
- Root extent identified by offset
- Branch extents contain:
[sort_key_fields...][child_extent_offset: uint64] - Leaf extents contain: actual table data or index entries
- Binary search within extents for O(log n) lookups
Mutable BTree (include/storage/mutable_btree.hh):
- Copy-on-write modifications at target XID
- In-memory page cache with configurable size
- Dirty pages flushed hierarchically (children before parents)
- Supports insert, remove, upsert operations
- Auto-splits extents when they exceed size threshold
Branch Extents:
- Schema: sort key columns +
child_offsetfield - Each row represents separator key + pointer to child extent
- Used for navigation during tree traversal
Leaf Extents:
- For primary index: full table rows with all columns
- For secondary index: index columns + extent_id + row_id for lookup
The BTree::Iterator performs in-order traversal:
- Maintains path from root to current leaf (via
Nodeobjects) -
++operator: advances within extent or navigates to next leaf via parent -
--operator: similar backward traversal - Supports binary search via
lower_bound(),upper_bound(),find()
A Table represents a logical collection of rows with:
Primary Index (_primary_index):
- BTree storing full row data, sorted by primary key
- Schema includes all table columns
- Branch extents use primary key fields for navigation
- Leaf extents contain complete rows
Secondary Indexes (_secondary_indexes):
- Map from index_id → (BTree, column_positions)
- Schema: index columns + internal_row_id
- Used for queries with non-primary-key predicates
- Maintained alongside primary index
Look-Aside Index (_look_aside_index):
- Maps internal_row_id → (extent_id, row_position)
- Enables secondary index lookups to find actual rows
- Schema:
[internal_row_id][extent_id][row_offset]
Table::Iterator supports multiple iteration modes:
Primary iteration: Traverses BTree, loads data extents from extent_id pointers Secondary iteration: Uses secondary index + look-aside for data access Index-only iteration: Returns index entries without fetching full rows
Tables maintain metadata in a "roots" file:
Schema: [index_id: uint64][extent_id: uint64][last_internal_row_id: uint64]
One row per index tracks the root extent for each snapshot.
Provides write operations on tables:
Operations:
-
insert(): Add new row (append or positioned by primary key) -
update(): Modify existing row (copy-on-write) -
remove(): Delete row from extent and indexes -
upsert(): Insert or update based on primary key existence
Write Strategies:
- Empty table: Use
_empty_pagefor first extent - Append: Add to last extent if sorted correctly
- Lookup: Use primary index to find target extent
- Direct: Write to specified extent_id (from write cache)
Index Maintenance:
- Dirty pages trigger
_flush_handler()callback - Invalidates old index entries via
_invalidate_indexes() - Populates new entries via
_flush_and_populate_indexes() - Maintains consistency between primary and secondary indexes
MutableTable::finalize():
- Flushes all dirty pages (primary data + all index BTrees)
- Returns
TableMetadatawith new root extent offsets - Optionally calls
sync_data_and_indexes()for durability
The StorageCache provides unified page management:
Page States:
-
CLEAN: Read from disk, unmodified, can be evicted -
DIRTY: Modified by MutableTable, needs flush -
MUTABLE: Actively being modified -
FLUSHING: Asynchronous write in progress -
INVALID: Evicted or replaced
Page Structure:
- Wraps one or more
Extentobjects - Tracks extent_id, cache_id, file path
- Reference counting for multi-user access
- LRU eviction for clean pages
Cache Operations:
-
get(): Retrieve page by (database_id, file, extent_id, xid) -
get_or_create(): Get existing or create new mutable page - Copy-on-write: CLEAN pages copied to MUTABLE on first write
- Hierarchical flushing: Dirty pages written with updated child pointers
RAII wrapper managing page lifecycle:
- Acquires page from cache on construction
- Releases back to cache on destruction
- Supports read (shared) and write (exclusive) locks
- Transparent access to underlying
Pageobject
All modifications create new extent versions:
- Read extent at access_xid
- Copy to new MUTABLE extent
- Apply modifications
- Flush with target_xid in header
- Update parent extent with new child pointer
Extents form version chains via prev_offset:
extent@xid=100 → extent@xid=50 → extent@xid=20 → ...
Old versions remain for queries at earlier XIDs (cleaned by Vacuumer).
-
Tableobjects snapshot at specific XID - BTree navigation uses extent headers to find correct version
- Time-travel queries access historical extent versions
- Write transactions create new snapshot at target_xid
Schemas are versioned alongside data:
ExtentSchema (include/storage/schema.hh):
- Defines column order, types, nullability
- Calculates fixed row layout and field offsets
- Generates Field accessors for reading/writing
Schema Changes:
- New columns: Added with default values or NULL
- Dropped columns: Marked as non-existent in schema
- Type changes: Handled via schema conversion during page reads
- Schema metadata tracked per XID range
Tables stored in directories:
{table_base}/{db_id}/{table_id}/{snapshot_xid}/
├── raw - Primary table data extents
├── {index_id}.idx - Secondary index extents (one file per index)
├── look_aside - Look-aside index for row lookup
└── roots - Index root metadata (system tables only)
Each file is an append-only sequence of extents written at increasing offsets.
Springtail's storage layer provides:
- Extent-level: Compact fixed+variable row storage with deduplication
- Field-level: Type-safe column access with null/undefined tracking
- Index-level: B+ tree organization with copy-on-write modifications
- Table-level: Primary + secondary indexes with look-aside support
- Cache-level: Unified page management with MVCC snapshots
This architecture enables efficient read-only OLTP-style queries while supporting MVCC for time-travel and concurrent access.