A research-grade, in-memory SQL database engine written in modern C++23 that executes queries via runtime code generation. When a SELECT query is submitted, IMLAB compiles it to native C++ code, dynamically links the resulting shared library, and runs it — all within milliseconds.
SQL Input
│
▼
┌──────────────────────────────────────────────┐
│ Lexer (Flex) + Parser (Bison) → AST │
└──────────────────────────┬───────────────────┘
│
┌──────────────────────────▼───────────────────┐
│ Semantic Analysis → Operator Tree │
└──────────────────────────┬───────────────────┘
│
┌──────────────────────────▼───────────────────┐
│ Optimizer (Predicate Pushdown) │
└──────────────────────────┬───────────────────┘
│
┌──────────────────────────▼───────────────────┐
│ Code Generation → query_N.cc │
│ Compilation → query_N.so (c++ -O2) │
│ dlopen() + dlsym("runQuery") + execute │
└──────────────────────────────────────────────┘
Each SELECT statement is lowered to a volcano-style operator tree (TableScan → Select → InnerJoin → Print), which then generates a self-contained C++ translation unit that is compiled and loaded at runtime using dlopen/dlsym. Profiling timings (codegen / compile / execute) are printed after every query.
- Runtime Query Compilation — queries are JIT-compiled to native shared objects; generated sources are logged to
codegen/queries/andlogs/generated_queries.log - Parallel Execution — generated code uses Intel TBB (
tbb::parallel_for,enumerable_thread_specific) for data-parallel table scans and join probing - Relational Operators — TableScan, Select (filter), InnerJoin (hash join), Print
- Predicate Pushdown — optimizer moves WHERE filters as close to table scans as possible before codegen
- Full SQL Parser — Flex/Bison grammar covering
CREATE TABLE,COPY TABLE FROM, andSELECT … FROM … WHERE … AND … - Rich Type System —
Bool,Integer,Numeric(p,s),Text/varchar,Timestamp; stored in astd::variant-basedRegister - TPC-C Dataset — full TPC-C schema (warehouse, district, customer, orders, …) with CSV loaders and sample data
- Interactive Shell —
imlabdbREPL with meta-commands (.help,.tpcc,.tables,.schema,.count,.logs) - Test Suite — Google Test unit and integration tests for every subsystem
codegen-db-engine/
├── imlab/
│ ├── algebra/ # Relational operators & expression nodes
│ │ ├── Operator.h # Abstract base; prepare/produce/consume/optimize
│ │ ├── TableScan # Full table scan; emits parallel_for loop
│ │ ├── Select # Filter predicate
│ │ ├── InnerJoin # Hash-join build/probe
│ │ ├── Print # Output operator; collects TBB thread buffers
│ │ ├── IU / IURef # Information Unit (column reference) nodes
│ │ ├── Expression # Abstract expression base
│ │ ├── BinaryExpression # AND / comparison operators
│ │ └── Const # Literal constant node
│ ├── infra/
│ │ ├── Register.h/cc # Universal value container (std::variant)
│ │ ├── HashTable.h # Open-addressing hash table used in joins
│ │ ├── Hash.h # Hashing utilities
│ │ ├── Bits.h # Bit manipulation helpers
│ │ ├── Helper.h # Name generator for codegen symbols
│ │ ├── Template.h # Template utilities
│ │ ├── Defer.h # RAII defer utility
│ │ └── Types.h/cc # Type descriptors
│ ├── types/ # Concrete value types (Bool, Integer, Numeric, Text, Timestamp)
│ ├── parser/
│ │ ├── scanner.l # Flex lexer (case-insensitive, location-tracked)
│ │ ├── parser.y # Bison LALR(1) grammar
│ │ ├── AST.h/cc # AST node definitions
│ │ └── ParseContext # Entry point: Parse(istream) → vector<AST>
│ ├── semana/
│ │ └── SemanticAnalysis # Validates & converts AST → Statement + Operator tree
│ ├── optimizer/
│ │ └── OptimizerPass.h # Enum of optimization passes
│ ├── statement/
│ │ ├── Statement.h # Abstract base (run, optimize, setSQL)
│ │ ├── QueryStatement # SELECT: codegen → compile → dlopen → execute
│ │ ├── CreateTableStatement
│ │ └── CopyTableStatement
│ ├── runtime/
│ │ └── RuntimeException # Runtime error type
│ ├── Database.h/cc # In-memory store: Table (rows + PK index) + Database
├── tools/
│ ├── imlabdb.cc # Interactive SQL shell (mathiSQL prompt)
│ └── statictpcc/TPCC.cc/h # TPC-C bulk loader
├── codegen/
│ └── queries/ # Generated query_N.cc / query_N.so (ephemeral)
├── data/
│ ├── schema.sql # Full TPC-C DDL
│ └── tpcc/ # TPC-C CSV data files (*.tbl)
├── logs/
│ └── generated_queries.log # Append-only log of every compiled query
├── test/ # Google Test suite
│ ├── infra/ # HashTable tests
│ ├── types/ # Bool / Integer / Numeric / Text / Timestamp tests
│ ├── parser/ # Schema & query parser tests
│ ├── semana/ # Semantic analysis tests
│ ├── optimizer/ # Optimizer tests
│ └── tpcc/ # End-to-end TPC-C integration tests
├── ARCHITECTURE.md # In-depth architectural documentation
└── CMakeLists.txt
| Tool / Library | Notes |
|---|---|
| C++23 compiler (GCC ≥ 13 or Clang ≥ 17) | Must be on $PATH as c++ or set via $CXX |
| CMake ≥ 3.10 | |
| Bison ≥ 3.0 | macOS: brew install bison |
| Flex | macOS: brew install flex |
| Intel TBB | libtbb-dev on Debian/Ubuntu |
| pthread | Usually pre-installed |
Vendored (fetched automatically by CMake):
- fmtlib — string formatting
- gflags — command-line flags
- GoogleTest / GoogleMock — test framework
# Configure (debug)
cmake -S . -B cmake-build-debug -DCMAKE_BUILD_TYPE=Debug
# Configure (release — required for the REPL and codegen path)
cmake -S . -B cmake-build-release -DCMAKE_BUILD_TYPE=Release
# Compile
cmake --build cmake-build-release -j$(nproc)This produces:
cmake-build-release/libimlab.so— core engine shared librarycmake-build-release/imlabdb— interactive shellcmake-build-release/tester— test runner
cd cmake-build-release
./imlabdbType .help for commands or any SQL statement ending with ';'
mathiSQL> .help
Available commands:
.help Show this help message
.tpcc Load TPC-C dataset
.tables List all tables
.schema <table> Show table schema
.count <table> Show row count of a table
.logs [n] Show last n generated-query log entries (default: 5)
.exit Exit shell
mathiSQL> .tpcc
✅ Loaded TPC-C dataset in 312.45 ms
mathiSQL> select * from neworder;
no_o_id, no_d_id, no_w_id
2101, 1, 1
2101, 2, 1
...
(900 rows)
⏱ Codegen: 0.12 ms | Compile: 423.50 ms | Execute: 1.83 ms | Total: 425.45 ms
mathiSQL> select no_w_id from neworder where no_d_id = 1;
...
mathiSQL> .logs 1
=== Generated Query 1 ===
timestamp: 2026-05-28 13:20:52
source: .../codegen/queries/query_1.cc
shared_object: .../codegen/queries/query_1.so
sql:
select * from neworder;-- DDL
CREATE TABLE name (col type [NOT NULL] [, ...] [, PRIMARY KEY (col [, ...])]);
COPY TABLE name FROM 'path' DELIMITER 'delim';
-- DML
SELECT * FROM table [, table]* [WHERE expr [AND expr]*];
SELECT col [, col]* FROM table [, table]* [WHERE expr [AND expr]*];Multi-table FROM clauses are handled as cross-joins narrowed by WHERE predicates; an explicit InnerJoin operator is generated when join predicates are detected.
cd cmake-build-release
ctest --output-on-failure
# or directly:
./testerWhen a SELECT is executed, QueryStatement::run():
- Calls
tree->prepare()/tree->produce()on the operator tree, which writes C++ source into anstd::ostringstream. - Writes the source to
codegen/queries/query_N.cc. - Invokes the system C++ compiler:
c++ -std=c++23 -O2 -fPIC -shared -I<project_root> query_N.cc -o query_N.so -L<build_dir> libimlab.so -ltbb dlopens the resulting.so, resolvesrunQuery(Database&, ostream&)viadlsym, and calls it.- Unloads the library and removes the
.soafter execution; the.ccsource is kept for inspection.
Generated code uses TBB parallel scans and thread-local output buffers for cache-friendly, lock-free printing.
See ARCHITECTURE.md for a full deep-dive covering:
- Volcano-style operator model
- Register / type-variant design
- Parser (Flex + Bison) internals
- Semantic analysis & scope management
- Optimizer extensibility
- Storage model & primary-key indexing
- Memory management & RAII conventions
- Future roadmap (cost-based optimizer, disk persistence, MVCC, …)
This project is a research/educational database engine. No license file is included — contact the author for usage terms.