Skip to content

Latest commit

 

History

History
395 lines (295 loc) · 11.7 KB

File metadata and controls

395 lines (295 loc) · 11.7 KB

Architecture Overview

This document provides a deeper dive into the IMLAB database engine architecture for developers and researchers.

System Design Philosophy

IMLAB follows a classic database engine architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────────┐
│                      Interactive Shell (imlabdb)                 │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│                   Query Processing Pipeline                      │
├─────────────────────────────────────────────────────────────────┤
│  1. Lexical Analysis (Flex Scanner)                              │
│  2. Syntax Analysis (Bison Parser) → AST                         │
│  3. Semantic Analysis → Statement Objects                        │
│  4. Query Optimization → Optimized Operator Tree                 │
│  5. Code Generation & Execution                                  │
└──────────────────────────┬──────────────────────────────────────┘
                           │
┌──────────────────────────▼──────────────────────────────────────┐
│              Database Storage & Execution Engine                 │
├─────────────────────────────────────────────────────────────────┤
│  • Table Storage (Row-based with indexing)                       │
│  • Type System (Type checking & conversion)                      │
│  • Relational Algebra Operators (Scan, Filter, Join)            │
│  • Expression Evaluation                                         │
└─────────────────────────────────────────────────────────────────┘

Query Execution Model

Pipeline Overview

  1. Parse Phase: SQL text → AST

    • Flex lexer tokenizes input
    • Bison parser builds AST according to grammar
    • Error handling at parse time
  2. Analysis Phase: AST → Semantic-Correct Statement

    • Semantic analyzer validates query
    • Type checking and inference
    • Scope resolution (table and column references)
    • Builds Statement object with Operator tree
  3. Optimization Phase: Statement → Optimized Statement

    • Currently implements predicate pushdown
    • Moves WHERE clause filters closer to table scans
    • Extensible framework for future passes
  4. Execution Phase: Statement → Results

    • Operator tree traversal
    • Pull-based iterator model
    • Each operator produces/consumes tuples

Relational Algebra Operators

IMLAB implements a classic volcano-style operator model:

class Operator {
    virtual void prepare(...);     // Initialize operator
    virtual void produce(...);     // Generate code/signals
    virtual void consume(...);     // Consume input
    virtual std::set<IU*> collectIUs();  // Collect columns
};

Operator Hierarchy

  • TableScan: Scan table
  • Unary Operators:
    • Select: Filter rows (WHERE clause)
    • Print: Output results
  • Binary Operators:
    • InnerJoin: Join two tables

Data Type System

Type Hierarchy

Register (variant-based container)
├── Bool (true/false)
├── Integer (32-bit signed)
├── Numeric (fixed-point decimal)
├── Text (variable-length string)
└── Timestamp (date/time)

Register Class

The Register class acts as a universal value container using std::variant:

using Value = std::variant<Bool, Integer, Numeric, Text, Timestamp>;

class Register {
    Type type;           // Type descriptor
    Value value;         // Actual value
    
    // Operations
    std::string toString() const;
    uint64_t hash() const;
    bool operator==(const Register& other) const;
};

This design enables:

  • Type-safe value handling
  • Efficient polymorphism without vtable overhead
  • Easy serialization and comparison

Type Conversion

Automatic type coercion follows SQL semantics:

  • Integer → Numeric (precision preserved)
  • Text → Integer/Numeric (parsing with error handling)
  • Timestamp → Custom conversion rules

Parser Implementation

Grammar Structure

The parser follows standard SQL DML/DDL grammar:

DDL:

CREATE TABLE name (
    column_def [, column_def]*
    [, PRIMARY KEY (column_id [, column_id]*)]
);

COPY TABLE name FROM 'file' DELIMITER 'delim';

DML:

SELECT [*|column_ref [, column_ref]*]
FROM table_ref [, table_ref]*
[WHERE condition [AND condition]*];

Flex Scanner Configuration

  • Case-insensitive: Keywords and identifiers normalized to lowercase
  • Batch mode: Optimized for stream parsing
  • Input streams: Reads from std::istream instead of files
  • Location tracking: Maintains line/column for error reporting

Bison Parser Configuration

  • LALR(1): Standard lookahead parser
  • Symbol types: Enables type-safe token/rule actions
  • Error recovery: Attempts to continue after syntax errors
  • Location tracking: Integrated with scanner locations

Semantic Analysis

Scope Management

The semantic analyzer maintains scope information:

class Scope {
    std::unordered_map<std::string, ColumnInfo> columns;
    std::unordered_map<std::string, TableInfo> tables;
};

Analysis Phases

  1. Table Validation: Verify table and column existence
  2. Type Checking: Ensure type compatibility in expressions
  3. Column Resolution: Map column references to actual columns
  4. Ambiguity Detection: Detect ambiguous column references in joins

Statement Building

Converts AST to executable Statement objects:

std::unique_ptr<Statement> buildStatement(AST& ast) {
    if (ast.type() == CreateTableType) {
        return analyzeCreateTable(...);
    } else if (ast.type() == QueryType) {
        return analyzeSelectStmt(...);
    }
    // ...
}

Query Optimization

Predicate Pushdown

The primary optimization pass moves filter predicates down the operator tree:

Before:

Print
└── InnerJoin (c.id = o.customer_id)
    ├── Select (name = 'John')   ← Filter on warehouse table
    │   └── TableScan (warehouse)
    └── TableScan (customer)

After Pushdown:

Print
└── InnerJoin (c.id = o.customer_id)
    ├── TableScan (warehouse)
    └── Select (name = 'John')   ← Pushed closer to table
        └── TableScan (customer)

Benefits:

  • Reduces input rows to downstream operators
  • Leverages index lookups (if available in future)
  • Significant performance improvement for selective predicates

Extensibility

The optimizer framework supports multiple passes:

enum class OptimizerPass {
    PredicatePushdown,  // Current
    // Future passes:
    // - ProjectionPushdown
    // - JoinOrdering
    // - ConstantFolding
};

Storage Model

In-Memory Table Storage

Tables are stored in-memory with row-based layout:

class Table {
    std::vector<std::vector<Register>> rows;     // Row storage
    std::unordered_map<std::string, size_t> indexMap;  // Primary key index
    std::unordered_map<std::string, size_t> columnMap; // Column positions
};

Key Storage

Primary keys are stringified for efficient indexing:

std::string makeKeyFromValues(const std::vector<Register>& keyValues) {
    std::string key;
    for (const auto& val : keyValues) {
        key += val.toString() + "|";
    }
    return key;
}

This enables:

  • O(1) row lookup by primary key
  • Composite key support
  • String-based hashing

Testing Strategy

Test Organization

test/
├── infra/        # Infrastructure components
├── types/        # Data type operations
├── parser/       # Parser correctness
├── optimizer/    # Optimization passes
├── semana/       # Semantic analysis
└── tpcc/         # Integration tests with TPC-C data

Test Categories

  1. Unit Tests: Individual component functionality
  2. Integration Tests: End-to-end query execution
  3. Regression Tests: TPC-C workload validation
  4. Performance Tests: Benchmark queries

TPC-C Integration

Full TPC-C schema and sample data enables:

  • Real-world query validation
  • Performance benchmarking
  • Complex join scenarios

Memory Management

Smart Pointers

  • std::unique_ptr: Operator trees, statements (clear ownership)
  • std::shared_ptr: Register references in expressions (optional)
  • std::vector: Contiguous storage for rows

RAII Principles

All resources managed through RAII:

  • Temporary file handles automatically close
  • Parser state cleaned up in destructors
  • No manual delete or free calls

Defer Utility

IMLAB includes a compile-time Defer utility for cleanup:

Defer cleanup([&]() {
    scanner.endScan();
});

Performance Considerations

Optimization Techniques

  1. Type Variant: std::variant provides zero-overhead abstraction
  2. String Interning: Common strings cached (future enhancement)
  3. Hash-based Join: O(1) average case join performance
  4. Native int32: Direct integer operations without indirection

Profiling

Database supports profiling mode:

db.profilingEnabled = true;
// Execution times reported for each operation

Code Quality Standards

Modern C++ (C++23)

  • Structured bindings: auto [key, value] = ...
  • Ranges library: std::views::filter, std::views::keys
  • Concepts: Type constraints (where applicable)
  • Three-way comparison: <=> operator

Compiler Flags

  • -march=native: CPU-specific optimizations
  • -O3: Maximum optimization in release builds
  • -Wall -Wextra -Wpedantic: Strict warnings
  • Clang-tidy: Static analysis integration

Code Organization

  • Clear namespace hierarchy: imlab::algebra::, imlab::parser::
  • Header-only components where appropriate
  • Const correctness throughout
  • Comprehensive documentation

Future Enhancements

Short-term

  1. Index Structures: B-tree indexes for faster lookups
  2. Query Caching: Cache frequently-run queries
  3. Better Error Messages: More detailed parse error reporting
  4. Extended Operators: Group By, Order By

Medium-term

  1. Cost-based Optimizer: Join ordering, selectivity estimation
  2. Disk Persistence: BerkeleyDB or custom format
  3. Multi-threading: Parallel operator execution
  4. Prepared Statements: Query reuse optimization

Long-term

  1. MVCC: Multi-version concurrency control
  2. Transactions: ACID guarantees
  3. Query Statistics: Runtime statistics collection
  4. Distributed Execution: Nodes and network communication

References

  • Database Systems: The Complete Book (2nd edition) - Ullman & Widom
  • Modern C++ Design - Andrei Alexandrescu
  • Flex & Bison Documentation
  • TPC-C Benchmark Specification