A polished PyTorch implementation of the current State-Of-The-Art(SOTA) Transformer. Designed for clarity, reproducibility, and interoperability with HuggingFace Transformers, this repository provides a robust baseline for Research and Engineering being Fully Configurable. The codebase emphasizes readable and well-documented components so you can iterate on Feed-Forward, Attention and Normalization blocks and other architectural variants with minimal friction.
- Fully Configurable architecture (layers, heads, model dimensions, dropout, etc.)
- HuggingFace-compatible API alignment.
- Compact and easily extensible design for rapid prototyping and research experiments.
- Clear, well-documented modules to facilitate experimentation with attention, FFNs, etc.
git clone --depth=1 https://github.com/lof310/transformer
cd transformer# Install dependencies
pip install -r requirements.txt
# Install on developer mode (Recommended)
pip install -e .
# Install Normally
pip install .import torch
import torch.nn as nn
import torch.nn.functional as F
from transformer import Transformer, TransformerConfig
# Configure the model
config = TransformerConfig(
n_layers = 12,
n_heads = 32,
d_model = 1536,
attn_qk_norm = False,
tied_weights = False,
seq_len = 1024,
max_seq_len = 4096,
)
# Initialize model
model = Transformer(config)
# Forward Pass
B, N = 16, 1024
input_ids = torch.randint(low=0, high=config.vocab_size, size(B, N))
output = model(input_ids, return_states=False)The default configuration implements the latest SOTA Transformer design.
from transformer import TransformerConfig
TransformerConfig(
n_layers = 12,
d_model = 1536,
n_heads = 32,
n_kv_heads = None, # QKA Disabled
vocab_size = 50000,
d_ff = None, # Choosen Automatically, ratio 8/3=2.666
norm_design = "pre_norm",
norm_class = "rms_norm",
ffn_class = "SwiGLU",
attn_class = "MHA",
block_class = None, # transformer.TransformerBlock
attn_bias = False,
ffn_bias = True,
lm_head_bias = False,
attn_qk_norm = True,
attn_dropout = 0.0,
tied_weights = False,
seq_len = 1024,
pos_encoding = "RoPE",
rope_base = 10000.0,
max_seq_len = 4096
)Full Documentation available at This Page
Contributions are welcome!
Distributed under the Apache License 2.0. See LICENSE for more information.
If you use transformer in your research, please cite:
@software{transformer2026,
author = {Leinier Orama},
title = {transformer: PyTorch implementation of the current State-Of-The-Art(SOTA) Transformer},
year = {2026},
publisher = {GitHub},
url = {https://github.com/lof310/transformer}
}