TIMBER
Classical ML Inference Compiler
Technical Product Documentation | v0.1
Timber is a compiler that takes trained classical ML model artifacts — XGBoost, LightGBM, scikit-learn, and others — and emits optimized, self-contained inference binaries targeting specific hardware. No Python runtime at inference time. No generic runtime overhead. Just fast, auditable, portable code.
This document defines the end-to-end technical architecture, compiler pipeline, supported formats, optimization strategy, output targets, and integration surface of Timber.
Version: 0.1 — Initial Architecture Draft
Status: Pre-implementation design specification
The modern MLOps ecosystem was built almost entirely around deep learning. Frameworks like TensorFlow Serving, Triton Inference Server, and TorchServe were designed to serve neural networks — and they show it. Classical ML models, which still represent the majority of production workloads in finance, insurance, healthcare, ad tech, and industrial applications, are treated as second-class citizens.
The typical production path for an XGBoost or scikit-learn model today looks like this:
-
Train model in Python, serialize to .pkl or framework-native format
-
Wrap in a Flask or FastAPI endpoint
-
Deploy behind a load balancer with the full Python runtime
-
Pay the overhead of CPython, scikit-learn, and pandas on every inference call
This approach carries significant costs:
| Latency | Python function call overhead, interpreter dispatch, and memory allocation add microseconds to milliseconds on every request — unacceptable at high throughput |
|---|---|
| Memory | A Python process serving a 2MB XGBoost model may consume 200MB+ of RAM due to runtime dependencies |
| Portability | Deployment requires matching Python versions, dependency trees, and platform-specific wheels — brittle and painful |
| Auditability | A .pkl file is not human-readable, not certifiable, and not inspectable by safety reviewers |
| Edge / Embedded | Shipping a Python runtime to a microcontroller or ECU is simply not possible |
Timber solves this by treating classical ML inference as a compilation problem, not a runtime problem.
Timber is an ahead-of-time (AOT) compiler for classical machine learning models. It ingests a trained model artifact and a target hardware specification, and emits an optimized inference artifact — a shared library, static library, WebAssembly module, or standalone executable — that performs inference with zero runtime dependencies beyond the C standard library.
Timber is not a serving framework, a training library, or an MLOps platform. It occupies the layer between model training and model deployment, analogous to what LLVM does for programming languages — it is the backend that turns high-level model representations into efficient machine code.
-
Ahead-of-time over just-in-time. Models are compiled fully before deployment. There is no JIT warmup, no interpreter, no runtime graph rewriting.
-
Static shape exploitation. Classical ML models have fixed input dimensionality. Timber exploits this fully — loops are unrolled, strides are hardcoded, bounds checks are eliminated.
-
Target-aware code generation. The compiler knows the deployment target at compile time and emits code specifically tuned for that ISA, cache topology, and SIMD width.
-
Whole-pipeline compilation. Preprocessing stages — scaling, encoding, imputation — are compiled together with the model into a single fused artifact. No boundary overhead.
-
Auditability by design. Every compiled artifact is accompanied by a deterministic compilation report documenting every optimization applied, every tree pruned, and every precision decision made.
Timber. Decision trees are made of branches and leaves. The compiler turns forests into lumber — raw, optimized material ready to be built with.
Timber defines a canonical internal representation (the Timber IR) and provides front-end parsers for each supported framework format. Adding a new framework means writing a new parser that emits Timber IR — the optimizer and backends are framework-agnostic.
| Format | Framework | Model Types |
|---|---|---|
| XGBoost JSON / binary | XGBoost | GBT Classifier, GBT Regressor, GBT Ranker |
| LightGBM model.txt | LightGBM | GBT Classifier, GBT Regressor |
| scikit-learn Pipeline (.pkl) | scikit-learn | RandomForest, GradientBoosting, LogisticRegression, LinearSVC, plus preprocessing stages |
| ONNX (.onnx) | ONNX ML opset | TreeEnsemble operators, Linear operators, preprocessing ops |
-
CatBoost model artifacts (.cbm)
-
Spark MLlib exported models (via MLeap intermediate format)
-
H2O MOJO format
-
Custom rule-based models via Timber Rule DSL
All front-end parsers lower their input to the Timber IR before optimization. The IR is a typed, hierarchical representation of a full inference pipeline. It is serializable, diffable, and inspectable independently of any framework.
A Timber IR document consists of three top-level sections:
-
Pipeline — an ordered sequence of stages. Each stage is either a preprocessing transform or a model.
-
Schema — the input and output schema of the full pipeline: field names, types, shapes, and value constraints.
-
Metadata — provenance, training framework version, feature importance, and compilation hints.
| Stage Type | Description | Example |
|---|---|---|
| Scaler | Elementwise affine transform per feature | StandardScaler, MinMaxScaler |
| Encoder | Categorical to numeric transform | OneHotEncoder, OrdinalEncoder, TargetEncoder |
| Imputer | Missing value fill | SimpleImputer (mean/median/mode/constant) |
| TreeEnsemble | Gradient boosted or bagged tree ensemble | XGBoost, LightGBM, RandomForest |
| Linear | Dot product + bias with optional nonlinearity | LogisticRegression, LinearSVC |
| Aggregator | Combines multiple stage outputs | Voting ensemble, Stacking |
Tree ensembles are represented in the IR as a flat array of nodes. Each node carries:
-
A feature index (into the input vector)
-
A split threshold (float32 or float16 depending on precision mode)
-
Left and right child offsets (relative, enabling SIMD-friendly vectorization)
-
A leaf value or leaf distribution (for multi-class classification)
-
A depth tag (used for cache-aware traversal scheduling)
This flat layout is different from the pointer-linked tree structures used by XGBoost and LightGBM internally. The conversion is done at parse time and enables the vectorized traversal strategies described in Section 6.
Input model artifact → Front-end parser → Timber IR → Optimizer passes → Code generator → Target artifact
Each supported framework has a dedicated parser module. The parser is responsible for:
-
Deserializing the model artifact in a format-safe way (handling version differences, missing fields, non-standard encodings)
-
Validating model integrity and flagging unsupported constructs
-
Lowering the framework-native structure to Timber IR
-
Attaching metadata: original framework version, objective function, feature names if available
The optimizer operates on the Timber IR and applies a sequence of passes. Passes are ordered and may be repeated until a fixpoint is reached. Each pass is independently testable and produces a diff against the previous IR state for auditability.
Prune leaves whose contribution to the final prediction falls below a configurable threshold relative to the largest leaf value in the ensemble. This reduces tree size without meaningful accuracy loss. The threshold is configurable (default: 0.001 relative contribution).
Features with zero variance in the training data (as recorded in model metadata) are identified. Split nodes gating exclusively on constant features are folded to their dominant branch. The corresponding input column is flagged as ignorable — the code generator will not read it.
Split thresholds are analyzed for precision requirements. Where float32 precision is unnecessary (e.g., features representing integer counts, binary flags, bounded categorical ordinals), thresholds are quantized to float16 or int8. This halves or quarters the size of the threshold array, improving cache utilization.
This pass requires a calibration dataset — a representative sample of real inference inputs (minimum 1,000 rows, recommended 10,000+). The pass profiles split node outcomes across the calibration set and reorders each node's children so the more frequently taken branch is evaluated first. On modern CPUs with static branch prediction defaulting to fall-through, this meaningfully reduces mispredictions.
Preprocessing stages (Scalers, Encoders, Imputers) are analyzed for opportunities to fuse with downstream stages. Where a scaler is followed directly by a tree ensemble, the scaler's per-feature affine transforms are absorbed into the split thresholds — the input is compared to a pre-adjusted threshold, eliminating the scaling step at runtime entirely.
The flat node array is rearranged into a Structure-of-Arrays layout optimized for the target SIMD width. For AVX-512 targets, 16 tree traversals are interleaved — 16 data points traverse all trees in lockstep, with comparisons and conditional moves executed as vector operations. For ARM NEON (128-bit), the interleave width is 4.
The code generator lowers optimized Timber IR to target-specific code. Timber supports multiple code generation backends:
| Backend | Output | Use Case |
|---|---|---|
| C99 emitter | Portable C source (.c + .h) | Embedded targets, safety-certified environments, human readability |
| LLVM IR emitter | LLVM bitcode (.bc) | Maximum optimization via LLVM backend, x86/ARM/RISC-V targets |
| WASM emitter | WebAssembly module (.wasm) | Browser-side inference, edge workers, serverless |
| Assembly emitter | Hand-tuned x86 ASM (.asm) | Ultra-low-latency paths requiring direct register control |
The LLVM IR emitter is the primary backend for cloud and server deployments. It emits LLVM IR with target feature annotations (e.g., +avx512f, +avx512bw) and delegates to LLVM for register allocation, instruction scheduling, and final machine code emission.
The C99 emitter is the primary backend for embedded and safety-critical deployments. It emits strict C99 with no dynamic allocation, no recursion, and no floating-point unless the target explicitly supports it. The output is readable, auditable, and compilable with any C99-compliant toolchain.
For latency-optimized single-sample inference (batch size 1), Timber compiles the entire forward pass as a straight-line sequence of operations — no loops over trees, no dynamic dispatch. Each tree is unrolled into a nested conditional structure. The compiler statically determines the maximum depth across all trees and pads shallower trees to match, enabling the code generator to emit uniform basic blocks with no variable-length control flow.
The compiled artifact for a 500-tree XGBoost model with maximum depth 6 executing a single inference consists of approximately 500 × 64 = 32,000 comparisons and conditional moves. At modern CPU clock speeds this executes in under 10 microseconds on x86.
For throughput-optimized batched inference, Timber uses the vectorized traversal strategy developed in Pass 6 of the optimizer. Rather than executing trees one sample at a time, multiple samples traverse each tree in parallel using SIMD comparison instructions.
On AVX-512 hardware, 16 float32 comparisons are executed per instruction. A batch of 16 samples traverses a depth-6 tree in 6 vector comparison steps plus 6 conditional selection steps — 12 total instructions instead of 16 × 6 = 96 scalar instructions. This is roughly an 8x throughput improvement before accounting for cache effects.
The flat node array layout (Section 4.3) is designed for sequential memory access. During batched traversal, the 16 samples simultaneously access the same node — their divergence only appears in which child to follow. The node array is sized to fit in L1 or L2 cache for models of typical depth. For large ensembles, the code generator emits prefetch instructions for the probable next node based on the frequency data from Pass 4.
Timber supports three precision modes, selectable at compile time:
-
float32 (default) — Full single-precision throughout. Bit-identical to framework reference implementation.
-
float16 — Half-precision thresholds and leaf values. Acceptable accuracy for most tabular tasks. 2x reduction in data working set, improving cache hit rates.
-
mixed — float32 arithmetic, float16 storage. Thresholds promoted to float32 at comparison time. Balance of accuracy and cache efficiency.
The primary output for server-side deployments. The compiled inference function is exposed through a stable C ABI:
int timber_infer(
const float\* inputs, // input feature vector, length \= n\_features
int n\_samples, // number of samples in batch
float\* outputs, // output buffer, pre-allocated by caller
TimberCtx\* ctx // opaque context, zero-copy thread safe
);
The context object holds compiled model state (the node array, threshold array, leaf values) and is initialized once at process startup. It is read-only after initialization and safe to use concurrently from multiple threads without locking.
For embedding directly into a host application binary. Same C ABI as shared library. Useful for mobile applications, command-line tools, and environments where dynamic linking is unavailable or undesirable.
For browser-side inference, edge worker deployment, and sandboxed server environments. The WASM module exposes the same logical interface via a WASM-compatible calling convention. Input and output buffers are passed as linear memory offsets.
A self-contained directory containing:
-
model.c — the compiled inference logic
-
model.h — the public header
-
model_data.c — the node array and threshold data as static const arrays
-
CMakeLists.txt — build configuration for the target
-
Makefile — fallback build for environments without CMake
-
audit_report.json — compilation report (see Section 9)
This output is the primary format for embedded and safety-critical targets. It is human-readable, grep-able, diffable between model versions, and compilable with any conforming C99 toolchain.
A thin Python package containing only the compiled shared library and a ctypes wrapper. No scikit-learn, no XGBoost, no pandas dependency. Import and call:
from timber_runtime import TimberModel
model = TimberModel.load('model.timber')
predictions = model.predict(X_numpy) # X is a numpy float32 array
This is the migration path for teams who want the performance benefits of compiled inference without changing their serving infrastructure.
| Target | SIMD Width | Key Features |
|---|---|---|
| x86-64 generic | SSE2 (128-bit) | Baseline, broadly compatible |
| x86-64 AVX2 | AVX2 (256-bit) | 8-wide float32 vectorization |
| x86-64 AVX-512 | AVX-512 (512-bit) | 16-wide float32 vectorization, masking, preferred target |
| ARM64 (Graviton2/3) | NEON (128-bit) | 4-wide float32, AWS Graviton optimization |
| ARM64 with SVE | SVE (scalable) | Scalable vectorization for Ampere Altra, future Graviton |
| Target | ISA | Notes |
|---|---|---|
| ARM Cortex-M4/M7 | Thumb-2 + FPU | IoT, wearables, industrial sensors — integer-only mode available |
| ARM Cortex-A55 | AArch64 + NEON | Mobile, automotive infotainment |
| RISC-V RV32GC | RISC-V 32-bit | Emerging embedded targets, open ISA |
| x86 (i686) | x86 SSE2 | Legacy industrial systems |
The target is specified as a TOML file passed to the compiler. Example:
[target]
arch = "x86_64"
features = ["avx512f", "avx512bw", "avx512vl"]
os = "linux"
abi = "systemv"
[precision]
mode = "float32"
[output]
format = "shared_library"
strip_symbols = true
The primary user interface is the timber CLI. It provides commands for compilation, inspection, benchmarking, and validation.
timber compile \
--model model.json \ # input model artifact
--format xgboost \ # input format hint (auto-detected if omitted)
--target targets/avx512.toml \ # hardware target spec
--out ./dist/ \ # output directory
--calibration calib.csv \ # optional: calibration data for freq-ordering
--pipeline pipeline.pkl \ # optional: scikit-learn Pipeline for preprocessing
timber inspect model.json
Prints a summary of the model: framework, number of trees, maximum depth, number of features, objective, estimated compiled size. No compilation is performed.
timber bench \
--artifact ./dist/model.so \
--data bench_data.csv \
--batch-sizes 1,16,64,256 \
--warmup-iters 1000
Runs the compiled artifact against a data file and reports latency (P50/P95/P99), throughput (samples/second), and memory usage. Compares against a baseline (original framework) if --baseline is provided.
timber validate \
--artifact ./dist/model.so \
--reference model.json \
--data validation.csv \
--tolerance 1e-5
Runs the compiled artifact and the reference framework side-by-side on a validation dataset. Reports maximum absolute error, mean absolute error, and any samples where the outputs diverge beyond the tolerance threshold.
Every compilation produces an audit_report.json alongside the artifact. The report documents the complete compilation history and is intended to satisfy regulatory review requirements in financial services, healthcare, and automotive contexts.
-
Input artifact hash (SHA-256) — ties the compiled artifact to the exact model version
-
Timber version and target spec — full reproducibility record
-
Optimizer pass log — every pass that ran, what it changed, and what it left unchanged
-
Pruning summary — how many leaves were pruned by dead leaf elimination, and the estimated accuracy impact
-
Quantization decisions — which features were quantized, to what precision, and why
-
Calibration data statistics — if calibration data was provided, summary statistics of the input distribution
-
Output artifact hash (SHA-256) — the deterministic fingerprint of the compiled binary
Determinism guarantee: given the same input artifact, target spec, and optimizer configuration, Timber will always produce the bit-identical output artifact. The audit report is reproducible.
Teams migrating from scikit-learn or XGBoost prediction in Python can adopt Timber with minimal code changes by using the timber-runtime wheel:
# Before
predictions = model.predict(X)
# After
from timber_runtime import TimberModel
tmodel = TimberModel.load('model.timber')
predictions = tmodel.predict(X) # returns numpy array, same shape
The timber-runtime package has no dependency on the training framework. It imports in under 50ms and adds no startup penalty.
For teams who want a ready-made HTTP inference endpoint without writing serving code:
timber serve --artifact ./dist/model.so --port 8080
This starts a minimal HTTP server (written in Rust, not Python) that accepts JSON or binary-encoded feature vectors and returns predictions. The server handles batching, concurrency, and health checks. It is not intended to replace a production API gateway but is sufficient for development and internal tooling.
For embedded targets, the C source package output (Section 7.4) is the integration path. The generated code has the following guarantees:
-
No dynamic memory allocation — all working memory is statically allocated at compile time
-
No recursion — the traversal is iterative with a statically bounded stack
-
No floating-point if target specifies integer-only mode
-
No standard library dependencies beyond <stdint.h> and <string.h>
-
Deterministic execution time — no data-dependent branching in the hot path (branch counts are fixed by the unrolled structure)
These properties make the output suitable for MISRA-C compliance review and IEC 61508 / ISO 26262 safety certification processes.
The following targets represent the design goals for Phase 1. They are validated by the benchmark suite in the repository.
| Scenario | Target | Baseline (Python) |
|---|---|---|
| XGBoost 500 trees, depth 6, batch=1, AVX-512 | < 50 microseconds | ~2-5 milliseconds |
| XGBoost 500 trees, depth 6, batch=256, AVX-512 | < 5ms total (>50K samples/sec) | ~100-300ms |
| scikit-learn Pipeline (StandardScaler + RF 200 trees), batch=1 | < 100 microseconds | ~5ms |
| LightGBM 1000 trees, depth 8, batch=1, AVX2 | < 200 microseconds | ~8ms |
| XGBoost, batch=1, ARM Cortex-M7 (no SIMD) | < 5ms | N/A (no Python) |
Target latency improvements of 10-100x over the Python reference implementation. Memory footprint improvements of 50-200x (removing runtime dependencies).
-
XGBoost and LightGBM front-end parsers
-
Timber IR definition and serialization
-
Optimizer passes 1–5 (dead leaf, constant feature, quantization, branch sorting, pipeline fusion)
-
LLVM IR emitter for x86-64 (generic, AVX2, AVX-512)
-
C99 emitter
-
Shared library output format
-
timber compile, inspect, validate CLI commands
-
Benchmark suite with reference comparisons
-
Audit report generation
-
scikit-learn Pipeline parser
-
ONNX ML opset parser
-
ARM64 NEON and SVE targets
-
WebAssembly emitter
-
timber serve HTTP endpoint
-
timber-runtime Python wheel
-
Vectorization pass (Pass 6) for batched inference
-
Embedded target profiles (Cortex-M, RISC-V)
-
CatBoost and H2O MOJO parsers
-
Stacking and voting ensemble support
-
Multi-output model support (multi-class, multi-label, multi-target regression)
-
Differential compilation — detect what changed between model versions and recompile only affected trees
-
MISRA-C compliance mode for safety-critical C output
-
GUI for audit report visualization
-
Cloud SaaS: compile via API, serve via Timber-hosted endpoint
To maintain focus, the following are explicitly out of scope for Timber:
-
Training. Timber does not train models. It consumes artifacts from existing training frameworks.
-
Neural network inference. Deep learning is a solved (and crowded) problem. Timber does not compete with TensorRT, ONNX Runtime's neural net path, or TVM for this workload.
-
Model monitoring and drift detection. Timber is an inference compiler, not an MLOps platform.
-
Feature engineering at training time. Timber may compile preprocessing transformers that were fitted at training time, but it does not perform feature engineering on raw data.
-
GPU inference for classical ML. The tree traversal problem is not well-suited to GPUs for batch sizes typical of classical ML deployments. This may be revisited if customer demand warrants it.
Timber — Classical ML Inference Compiler | Technical Product Documentation v0.1