A Complete End-to-End Tutorial for Building, Training, and Deploying a Production-Ready Small Language Model
This repository contains a complete, end-to-end blueprint for building, training, and deploying a custom ~400M parameter Small Language Model (SLM) from scratch. The model is trained to handle natural language instructions and execute Python function calls, with the final artifact being a production-ready ONNX model run by a lightweight Python agent.
- 🏗️ Build from Scratch: Define and initialize a ~400M parameter Llama-style model with random weights using transformers
- 📚 Two-Phase Training: Complete foundational pre-training (TinyStories) + Supervised Fine-Tuning (Oasst, Alpaca, Hermes-Function-Calling)
- 🔧 Function-Calling: Trained to use the Hermes/ChatML format for tool use (
<tools>,英寸,<tool_response>) - ⚡ ONNX Export: Convert final trained PyTorch model to ONNX with proper KV cache handling using Hugging Face Optimum
- 🤖 Local Agent: Production-ready agent.py script using onnxruntime-genai for ReAct-style function execution
- 🚀 End-to-End: Complete pipeline from random weights to local, deployable function-calling AI
- Python 3.8+
- CUDA-compatible GPU (recommended for training)
- 16GB+ RAM
- At least 50GB free disk space
# Clone the repository
git clone <repository-url>
cd folder-name
# Install dependencies
pip install -r requirements.txtThis tutorial is divided into 5 sequential phases. Each phase builds upon the previous one and includes complete code implementations:
Duration: 5-10 minutes
Output: ./slm_from_scratch/ directory
Build the 400M parameter model configuration and initialize with random weights.
# Run the initialization script
python 01_initialize_model.pyWhat you'll learn:
- Custom Llama-style architecture design
- Parameter count calculation and optimization
- Random weight initialization using AutoConfig
Duration: 10-15 minutes
Output: ./unified_sft_dataset/ directory
Download, merge, and format the Supervised Fine-Tuning datasets.
# Create unified SFT dataset
python 02_create_sft_dataset.pyWhat you'll learn:
- Dataset curation and merging strategies
- Hermes function-calling format implementation
- Data preprocessing for conversational AI
Duration: 6-24 hours (depending on hardware)
Output: ./slm_final_trained/ directory
Execute the complete training pipeline with both pre-training and SFT phases.
# Launch Jupyter notebook for training
jupyter notebook notebook/the_SLM.ipynbWhat you'll learn:
- Two-phase training methodology
- PyTorch Trainer API usage
- Data collator and tokenization strategies
- Training optimization and monitoring
Duration: 10-20 minutes
Output: ./slm_onnx/ directory
Convert the trained PyTorch model to production-ready ONNX format.
# Export model to ONNX
python 04_export_to_onnx.pyWhat you'll learn:
- Hugging Face Optimum integration
- KV cache handling for generative models
- Production deployment strategies
Duration: 5 minutes
Output: Interactive function-calling agent
Start the local, ONNX-powered agent and interact with function calls.
# Launch the agent
python 05_run_agent.pyExample Interaction:
User: What's the weather like in Boston?
Agent: Let me check the weather for you.
<function_call>get_weather("Boston")</function_call>
Tool Response: Current weather in Boston: 72°F, sunny
Agent: The current weather in Boston is 72°F with sunny skies.
This project follows a "golden path" of technical decisions ensuring compatibility across all stages:
- Architecture: Custom Llama-style config via AutoConfig (ecosystem compatibility)
- Training: Two-phase approach (foundation → alignment)
- Export: Hugging Face Optimum (proper KV cache handling)
- Inference: onnxruntime-genai (specialized generative AI runtime)
| Parameter | Value | Rationale |
|---|---|---|
| Model Type | Llama-style Transformer | Modern architecture with RoPE and SwiGLU |
| Parameters | ~400M | Balance of capability vs. resource requirements |
| Layers | 20 | Transformer depth for good performance |
| Hidden Size | 1280 | Attention dimension |
| Heads | 16 | Multi-head attention |
| FFN Size | 3584 | SwiGLU feed-forward dimension |
- Dataset: TinyStories (clean, simple language)
- Objective: Learn grammar, syntax, basic world knowledge
- Duration: Model must learn to form coherent text
- Datasets:
- OpenAssistant/oasst1 (conversational)
- tatsu-lab/alpaca (instruction-following)
- NousResearch/hermes-function-calling-v1 (tool use)
- Objective: Align for assistance + function calling
- Format: Hermes/ChatML with structured tool calls
The model learns to use this structured format:
System: <tools>get_weather(location) -> str</tools>
User: What's the weather like in Boston?
Agent: I'll check the weather for you.
<function_call>get_weather("Boston")</function_call>
User: <tool_response>Current weather in Boston: 72°F, sunny</tool_response>
Agent: The current weather in Boston is 72°F with sunny skies.
SLM-training-ONNX-and-functional-calling/
├── README.md # This file
├── requirements.txt # Dependencies
├── 01_initialize_model.py # Phase 1: Model initialization
├── 02_create_sft_dataset.py # Phase 2: Dataset preparation
├── 03_training_pipeline.ipynb # Phase 3: Training notebook
├── 04_export_to_onnx.py # Phase 4: ONNX export
├── 05_run_agent.py # Phase 5: Agent deployment
├── readme info.md # Original technical documentation
└── notebook/
└── the_SLM.ipynb # Interactive training notebook
By completing this tutorial, you will understand:
- Model Architecture: How transformer models are constructed and parameterized
- Training Methodology: Two-phase training from scratch to specialized assistant
- Data Engineering: Curating and formatting datasets for different training phases
- Production Deployment: Converting models to portable, high-performance formats
- Function Calling: Implementing tool use in language models
- Agent Design: Building reasoning loops for autonomous function execution
- Ecosystem Compatibility: Using AutoConfig/AutoModel ensures compatibility with Trainer API and Optimum
- Proper Training Order: Foundation training before instruction following is essential
- ONNX Optimization: Optimum handles the complex KV cache export automatically
- Agent Efficiency: onnxruntime-genai provides high-performance generative inference
- Parameter Calculation: Total params = embeddings + attention + feedforward
- KV Cache: Critical for efficient autoregressive generation
- Format Consistency: Hermes format provides structured tool interaction
- Export Complexity: Dynamic computation graphs require specialized tools
# Out of memory errors
- Reduce batch size in training config
- Use gradient accumulation
- Enable model parallelism if available
# Slow training
- Verify CUDA is available: torch.cuda.is_available()
- Check GPU utilization: nvidia-smi
- Optimize data loading with num_workers# Unsupported operations
- Verify all operations are exportable to ONNX
- Check for dynamic shapes issues
- Ensure proper model configuration
# KV Cache errors
- Verify Optimum version compatibility
- Check model architecture compatibility
- Validate input/output shapes# Function call parsing errors
- Verify Hermes format compliance
- Check JSON structure in function calls
- Ensure proper tool definitions
# Performance issues
- Enable ONNX optimizations
- Check memory usage
- Verify GPU acceleration if available- Constrained Decoding: Implement schema-constrained generation for reliable function calling
- Quantization: Add INT8 quantization for faster inference
- Model Scaling: Extend to 1B+ parameters using the provided architecture tables
- Multi-modal Support: Extend to vision-language tasks
- Distributed Training: Implement multi-GPU/multi-node training
- Web Interface: Create Gradio/Streamlit UI for the agent
- API Server: Build REST API for model serving
- Memory Optimization: Implement attention optimization techniques
- Custom Functions: Add domain-specific tool libraries
- Safety Filters: Implement content filtering and safety measures
- Evaluation Suite: Comprehensive benchmarking framework
- Attention Is All You Need - Original transformer paper
- Llama 2 Paper - Modern architecture insights
- Function Calling in LLMs - Theoretical foundations
- Hugging Face Transformers - Model implementation
- Hugging Face Optimum - ONNX export
- ONNX Runtime GenAI - Inference engine
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- Hugging Face Team - For the transformers and Optimum libraries
- Microsoft - For ONNX Runtime and GenAI
- EleutherAI - For training methodologies and datasets
- The SLM Community - For insights and feedback
If you encounter any issues or have questions:
- Check the troubleshooting section above
- Search existing issues in the repository
- Create a new issue with detailed information
- Join the community discussions
Happy Learning! 🎉
This tutorial represents a complete, production-ready workflow for building custom language models. Each phase has been tested and optimized for reliability and educational value.