Skip to content

mathis-k/codegen-db-engine

Repository files navigation

IMLAB — Codegen Database Engine

A research-grade, in-memory SQL database engine written in modern C++23 that executes queries via runtime code generation. When a SELECT query is submitted, IMLAB compiles it to native C++ code, dynamically links the resulting shared library, and runs it — all within milliseconds.


How It Works

SQL Input
   │
   ▼
┌──────────────────────────────────────────────┐
│  Lexer (Flex) + Parser (Bison)  → AST        │
└──────────────────────────┬───────────────────┘
                           │
┌──────────────────────────▼───────────────────┐
│  Semantic Analysis  → Operator Tree          │
└──────────────────────────┬───────────────────┘
                           │
┌──────────────────────────▼───────────────────┐
│  Optimizer  (Predicate Pushdown)             │
└──────────────────────────┬───────────────────┘
                           │
┌──────────────────────────▼───────────────────┐
│  Code Generation  → query_N.cc               │
│  Compilation      → query_N.so  (c++ -O2)   │
│  dlopen() + dlsym("runQuery") + execute      │
└──────────────────────────────────────────────┘

Each SELECT statement is lowered to a volcano-style operator tree (TableScan → Select → InnerJoin → Print), which then generates a self-contained C++ translation unit that is compiled and loaded at runtime using dlopen/dlsym. Profiling timings (codegen / compile / execute) are printed after every query.


Features

  • Runtime Query Compilation — queries are JIT-compiled to native shared objects; generated sources are logged to codegen/queries/ and logs/generated_queries.log
  • Parallel Execution — generated code uses Intel TBB (tbb::parallel_for, enumerable_thread_specific) for data-parallel table scans and join probing
  • Relational Operators — TableScan, Select (filter), InnerJoin (hash join), Print
  • Predicate Pushdown — optimizer moves WHERE filters as close to table scans as possible before codegen
  • Full SQL Parser — Flex/Bison grammar covering CREATE TABLE, COPY TABLE FROM, and SELECT … FROM … WHERE … AND …
  • Rich Type SystemBool, Integer, Numeric(p,s), Text/varchar, Timestamp; stored in a std::variant-based Register
  • TPC-C Dataset — full TPC-C schema (warehouse, district, customer, orders, …) with CSV loaders and sample data
  • Interactive Shellimlabdb REPL with meta-commands (.help, .tpcc, .tables, .schema, .count, .logs)
  • Test Suite — Google Test unit and integration tests for every subsystem

Project Structure

codegen-db-engine/
├── imlab/
│   ├── algebra/          # Relational operators & expression nodes
│   │   ├── Operator.h    # Abstract base; prepare/produce/consume/optimize
│   │   ├── TableScan     # Full table scan; emits parallel_for loop
│   │   ├── Select        # Filter predicate
│   │   ├── InnerJoin     # Hash-join build/probe
│   │   ├── Print         # Output operator; collects TBB thread buffers
│   │   ├── IU / IURef    # Information Unit (column reference) nodes
│   │   ├── Expression    # Abstract expression base
│   │   ├── BinaryExpression  # AND / comparison operators
│   │   └── Const         # Literal constant node
│   ├── infra/
│   │   ├── Register.h/cc # Universal value container (std::variant)
│   │   ├── HashTable.h   # Open-addressing hash table used in joins
│   │   ├── Hash.h        # Hashing utilities
│   │   ├── Bits.h        # Bit manipulation helpers
│   │   ├── Helper.h      # Name generator for codegen symbols
│   │   ├── Template.h    # Template utilities
│   │   ├── Defer.h       # RAII defer utility
│   │   └── Types.h/cc    # Type descriptors
│   ├── types/            # Concrete value types (Bool, Integer, Numeric, Text, Timestamp)
│   ├── parser/
│   │   ├── scanner.l     # Flex lexer (case-insensitive, location-tracked)
│   │   ├── parser.y      # Bison LALR(1) grammar
│   │   ├── AST.h/cc      # AST node definitions
│   │   └── ParseContext  # Entry point: Parse(istream) → vector<AST>
│   ├── semana/
│   │   └── SemanticAnalysis  # Validates & converts AST → Statement + Operator tree
│   ├── optimizer/
│   │   └── OptimizerPass.h   # Enum of optimization passes
│   ├── statement/
│   │   ├── Statement.h        # Abstract base (run, optimize, setSQL)
│   │   ├── QueryStatement     # SELECT: codegen → compile → dlopen → execute
│   │   ├── CreateTableStatement
│   │   └── CopyTableStatement
│   ├── runtime/
│   │   └── RuntimeException   # Runtime error type
│   ├── Database.h/cc         # In-memory store: Table (rows + PK index) + Database
├── tools/
│   ├── imlabdb.cc            # Interactive SQL shell (mathiSQL prompt)
│   └── statictpcc/TPCC.cc/h  # TPC-C bulk loader
├── codegen/
│   └── queries/              # Generated query_N.cc / query_N.so (ephemeral)
├── data/
│   ├── schema.sql            # Full TPC-C DDL
│   └── tpcc/                 # TPC-C CSV data files (*.tbl)
├── logs/
│   └── generated_queries.log # Append-only log of every compiled query
├── test/                     # Google Test suite
│   ├── infra/                # HashTable tests
│   ├── types/                # Bool / Integer / Numeric / Text / Timestamp tests
│   ├── parser/               # Schema & query parser tests
│   ├── semana/               # Semantic analysis tests
│   ├── optimizer/            # Optimizer tests
│   └── tpcc/                 # End-to-end TPC-C integration tests
├── ARCHITECTURE.md           # In-depth architectural documentation
└── CMakeLists.txt

Requirements

Tool / Library Notes
C++23 compiler (GCC ≥ 13 or Clang ≥ 17) Must be on $PATH as c++ or set via $CXX
CMake ≥ 3.10
Bison ≥ 3.0 macOS: brew install bison
Flex macOS: brew install flex
Intel TBB libtbb-dev on Debian/Ubuntu
pthread Usually pre-installed

Vendored (fetched automatically by CMake):

  • fmtlib — string formatting
  • gflags — command-line flags
  • GoogleTest / GoogleMock — test framework

Build

# Configure (debug)
cmake -S . -B cmake-build-debug -DCMAKE_BUILD_TYPE=Debug

# Configure (release — required for the REPL and codegen path)
cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release

# Compile
cmake --build cmake-build-release -j$(nproc)

This produces:

  • cmake-build-release/libimlab.so — core engine shared library
  • cmake-build-release/imlabdb — interactive shell
  • cmake-build-release/tester — test runner

Running the Shell

cd cmake-build-release
./imlabdb
Type .help for commands or any SQL statement ending with ';'
mathiSQL> .help
Available commands:
  .help                Show this help message
  .tpcc                Load TPC-C dataset
  .tables              List all tables
  .schema  <table>     Show table schema
  .count   <table>     Show row count of a table
  .logs [n]            Show last n generated-query log entries (default: 5)
  .exit                Exit shell

Example session

mathiSQL> .tpcc
✅ Loaded TPC-C dataset in 312.45 ms

mathiSQL> select * from neworder;
no_o_id, no_d_id, no_w_id
2101, 1, 1
2101, 2, 1
...
(900 rows)

⏱  Codegen: 0.12 ms | Compile: 423.50 ms | Execute: 1.83 ms | Total: 425.45 ms

mathiSQL> select no_w_id from neworder where no_d_id = 1;
...

mathiSQL> .logs 1
=== Generated Query 1 ===
timestamp: 2026-05-28 13:20:52
source: .../codegen/queries/query_1.cc
shared_object: .../codegen/queries/query_1.so
sql:
select * from neworder;

Supported SQL syntax

-- DDL
CREATE TABLE name (col type [NOT NULL] [, ...] [, PRIMARY KEY (col [, ...])]);
COPY TABLE name FROM 'path' DELIMITER 'delim';

-- DML
SELECT * FROM table [, table]* [WHERE expr [AND expr]*];
SELECT col [, col]* FROM table [, table]* [WHERE expr [AND expr]*];

Multi-table FROM clauses are handled as cross-joins narrowed by WHERE predicates; an explicit InnerJoin operator is generated when join predicates are detected.


Running Tests

cd cmake-build-release
ctest --output-on-failure
# or directly:
./tester

Code Generation Deep Dive

When a SELECT is executed, QueryStatement::run():

  1. Calls tree->prepare() / tree->produce() on the operator tree, which writes C++ source into an std::ostringstream.
  2. Writes the source to codegen/queries/query_N.cc.
  3. Invokes the system C++ compiler:
    c++ -std=c++23 -O2 -fPIC -shared -I<project_root> query_N.cc -o query_N.so -L<build_dir> libimlab.so -ltbb
    
  4. dlopens the resulting .so, resolves runQuery(Database&, ostream&) via dlsym, and calls it.
  5. Unloads the library and removes the .so after execution; the .cc source is kept for inspection.

Generated code uses TBB parallel scans and thread-local output buffers for cache-friendly, lock-free printing.


Architecture

See ARCHITECTURE.md for a full deep-dive covering:

  • Volcano-style operator model
  • Register / type-variant design
  • Parser (Flex + Bison) internals
  • Semantic analysis & scope management
  • Optimizer extensibility
  • Storage model & primary-key indexing
  • Memory management & RAII conventions
  • Future roadmap (cost-based optimizer, disk persistence, MVCC, …)

License

This project is a research/educational database engine. No license file is included — contact the author for usage terms.

About

A research-grade, in-memory SQL database engine written in modern C++23 that executes queries via runtime code generation. When a SELECT query is submitted, IMLAB compiles it to native C++ code, dynamically links the resulting shared library, and runs it — all within milliseconds.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors