This document provides a deeper dive into the IMLAB database engine architecture for developers and researchers.
IMLAB follows a classic database engine architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────┐
│ Interactive Shell (imlabdb) │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Query Processing Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ 1. Lexical Analysis (Flex Scanner) │
│ 2. Syntax Analysis (Bison Parser) → AST │
│ 3. Semantic Analysis → Statement Objects │
│ 4. Query Optimization → Optimized Operator Tree │
│ 5. Code Generation & Execution │
└──────────────────────────┬──────────────────────────────────────┘
│
┌──────────────────────────▼──────────────────────────────────────┐
│ Database Storage & Execution Engine │
├─────────────────────────────────────────────────────────────────┤
│ • Table Storage (Row-based with indexing) │
│ • Type System (Type checking & conversion) │
│ • Relational Algebra Operators (Scan, Filter, Join) │
│ • Expression Evaluation │
└─────────────────────────────────────────────────────────────────┘
-
Parse Phase: SQL text → AST
- Flex lexer tokenizes input
- Bison parser builds AST according to grammar
- Error handling at parse time
-
Analysis Phase: AST → Semantic-Correct Statement
- Semantic analyzer validates query
- Type checking and inference
- Scope resolution (table and column references)
- Builds Statement object with Operator tree
-
Optimization Phase: Statement → Optimized Statement
- Currently implements predicate pushdown
- Moves WHERE clause filters closer to table scans
- Extensible framework for future passes
-
Execution Phase: Statement → Results
- Operator tree traversal
- Pull-based iterator model
- Each operator produces/consumes tuples
IMLAB implements a classic volcano-style operator model:
class Operator {
virtual void prepare(...); // Initialize operator
virtual void produce(...); // Generate code/signals
virtual void consume(...); // Consume input
virtual std::set<IU*> collectIUs(); // Collect columns
};- TableScan: Scan table
- Unary Operators:
- Select: Filter rows (WHERE clause)
- Print: Output results
- Binary Operators:
- InnerJoin: Join two tables
Register (variant-based container)
├── Bool (true/false)
├── Integer (32-bit signed)
├── Numeric (fixed-point decimal)
├── Text (variable-length string)
└── Timestamp (date/time)
The Register class acts as a universal value container using std::variant:
using Value = std::variant<Bool, Integer, Numeric, Text, Timestamp>;
class Register {
Type type; // Type descriptor
Value value; // Actual value
// Operations
std::string toString() const;
uint64_t hash() const;
bool operator==(const Register& other) const;
};This design enables:
- Type-safe value handling
- Efficient polymorphism without vtable overhead
- Easy serialization and comparison
Automatic type coercion follows SQL semantics:
- Integer → Numeric (precision preserved)
- Text → Integer/Numeric (parsing with error handling)
- Timestamp → Custom conversion rules
The parser follows standard SQL DML/DDL grammar:
DDL:
CREATE TABLE name (
column_def [, column_def]*
[, PRIMARY KEY (column_id [, column_id]*)]
);
COPY TABLE name FROM 'file' DELIMITER 'delim';
DML:
SELECT [*|column_ref [, column_ref]*]
FROM table_ref [, table_ref]*
[WHERE condition [AND condition]*];
- Case-insensitive: Keywords and identifiers normalized to lowercase
- Batch mode: Optimized for stream parsing
- Input streams: Reads from std::istream instead of files
- Location tracking: Maintains line/column for error reporting
- LALR(1): Standard lookahead parser
- Symbol types: Enables type-safe token/rule actions
- Error recovery: Attempts to continue after syntax errors
- Location tracking: Integrated with scanner locations
The semantic analyzer maintains scope information:
class Scope {
std::unordered_map<std::string, ColumnInfo> columns;
std::unordered_map<std::string, TableInfo> tables;
};- Table Validation: Verify table and column existence
- Type Checking: Ensure type compatibility in expressions
- Column Resolution: Map column references to actual columns
- Ambiguity Detection: Detect ambiguous column references in joins
Converts AST to executable Statement objects:
std::unique_ptr<Statement> buildStatement(AST& ast) {
if (ast.type() == CreateTableType) {
return analyzeCreateTable(...);
} else if (ast.type() == QueryType) {
return analyzeSelectStmt(...);
}
// ...
}The primary optimization pass moves filter predicates down the operator tree:
Before:
Print
└── InnerJoin (c.id = o.customer_id)
├── Select (name = 'John') ← Filter on warehouse table
│ └── TableScan (warehouse)
└── TableScan (customer)
After Pushdown:
Print
└── InnerJoin (c.id = o.customer_id)
├── TableScan (warehouse)
└── Select (name = 'John') ← Pushed closer to table
└── TableScan (customer)
Benefits:
- Reduces input rows to downstream operators
- Leverages index lookups (if available in future)
- Significant performance improvement for selective predicates
The optimizer framework supports multiple passes:
enum class OptimizerPass {
PredicatePushdown, // Current
// Future passes:
// - ProjectionPushdown
// - JoinOrdering
// - ConstantFolding
};Tables are stored in-memory with row-based layout:
class Table {
std::vector<std::vector<Register>> rows; // Row storage
std::unordered_map<std::string, size_t> indexMap; // Primary key index
std::unordered_map<std::string, size_t> columnMap; // Column positions
};Primary keys are stringified for efficient indexing:
std::string makeKeyFromValues(const std::vector<Register>& keyValues) {
std::string key;
for (const auto& val : keyValues) {
key += val.toString() + "|";
}
return key;
}This enables:
- O(1) row lookup by primary key
- Composite key support
- String-based hashing
test/
├── infra/ # Infrastructure components
├── types/ # Data type operations
├── parser/ # Parser correctness
├── optimizer/ # Optimization passes
├── semana/ # Semantic analysis
└── tpcc/ # Integration tests with TPC-C data
- Unit Tests: Individual component functionality
- Integration Tests: End-to-end query execution
- Regression Tests: TPC-C workload validation
- Performance Tests: Benchmark queries
Full TPC-C schema and sample data enables:
- Real-world query validation
- Performance benchmarking
- Complex join scenarios
std::unique_ptr: Operator trees, statements (clear ownership)std::shared_ptr: Register references in expressions (optional)std::vector: Contiguous storage for rows
All resources managed through RAII:
- Temporary file handles automatically close
- Parser state cleaned up in destructors
- No manual delete or free calls
IMLAB includes a compile-time Defer utility for cleanup:
Defer cleanup([&]() {
scanner.endScan();
});- Type Variant: std::variant provides zero-overhead abstraction
- String Interning: Common strings cached (future enhancement)
- Hash-based Join: O(1) average case join performance
- Native int32: Direct integer operations without indirection
Database supports profiling mode:
db.profilingEnabled = true;
// Execution times reported for each operation- Structured bindings:
auto [key, value] = ... - Ranges library:
std::views::filter,std::views::keys - Concepts: Type constraints (where applicable)
- Three-way comparison:
<=>operator
-march=native: CPU-specific optimizations-O3: Maximum optimization in release builds-Wall -Wextra -Wpedantic: Strict warnings- Clang-tidy: Static analysis integration
- Clear namespace hierarchy:
imlab::algebra::,imlab::parser:: - Header-only components where appropriate
- Const correctness throughout
- Comprehensive documentation
- Index Structures: B-tree indexes for faster lookups
- Query Caching: Cache frequently-run queries
- Better Error Messages: More detailed parse error reporting
- Extended Operators: Group By, Order By
- Cost-based Optimizer: Join ordering, selectivity estimation
- Disk Persistence: BerkeleyDB or custom format
- Multi-threading: Parallel operator execution
- Prepared Statements: Query reuse optimization
- MVCC: Multi-version concurrency control
- Transactions: ACID guarantees
- Query Statistics: Runtime statistics collection
- Distributed Execution: Nodes and network communication
- Database Systems: The Complete Book (2nd edition) - Ullman & Widom
- Modern C++ Design - Andrei Alexandrescu
- Flex & Bison Documentation
- TPC-C Benchmark Specification