A cutting-edge testbed demonstrating advanced vector database capabilities with TiDB, showcasing end-to-end LLM-powered retrieval systems using remote API embeddings, including support for self-hosted models like Qwen.
This project showcases expertise in:
- Vector Databases & AI Integration: Implementing TiDB's vector search with LangChain for semantic retrieval
- Full-Stack Data Engineering: From document ingestion to evaluation metrics
- Modern Python Development: Clean, modular code with comprehensive testing
- Performance Benchmarking: Latency analysis and relevance scoring
- Knowledge Base Systems: Processing and querying large document collections
- Remote API Integration: Working with cloud-hosted and self-hosted embedding models
- Custom AI Implementations: Building custom embedding classes for specialized use cases
- 🔗 Seamless TiDB Integration: Direct connection to TiDB Cloud or self-hosted clusters
- 🧠 Remote API Embeddings: Support for OpenAI-compatible remote embedding models, including custom implementations for self-hosted models like Qwen
- 📚 Rich Document Processing: Markdown-based knowledge base with intelligent chunking
- ⚡ High-Performance Retrieval: Optimized vector indexing and similarity search
- 📊 Comprehensive Evaluation: Precision, Recall, NDCG, MRR, and latency metrics
- 🛠️ Modular Architecture: Clean, extensible codebase for easy customization
- 🔧 Flexible Configuration: Environment-based setup with sensible defaults
- 📈 Benchmarking Suite: Automated performance testing and reporting
- Database: TiDB Vector Database
- AI/ML: LangChain, Custom OpenAI-compatible embeddings for self-hosted models
- Backend: Python 3.12+, SQLAlchemy, PyMySQL
- Data Processing: Pandas, NumPy, Scikit-learn
- Development: Modern Python packaging (pyproject.toml)
tidb-vector-llm-testbed/
├── 📄 benchmark.py # Main orchestration script
├── ⚙️ config.py # Environment configuration
├── 🗄️ db_connection.py # TiDB connection & schema management
├── 🧠 embedding_models.py # Custom OpenAI-compatible embedding model loader
├── 🔍 vector_store.py # LangChain-compatible vector store
├── 📊 evaluation.py # Retrieval metrics & benchmarking
├── 📚 sample_data.py # Document loading & preprocessing
├── 📋 scspedia/ # Knowledge base documents (Sarawak/Malaysia)
├── 📦 pyproject.toml # Modern Python packaging
├── 📋 requirements.txt # Dependencies
├── 📝 CHANGELOG.md # Change history
├── 🔐 .env.example # Configuration template
└── 📖 README.md # This file
Get up and running in minutes:
# Clone and setup
git clone https://github.com/haja-k/tidb-vector-llm-testbed.git
cd tidb-vector-llm-testbed
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env with your TiDB credentials
# Run the complete benchmark
python benchmark.py- Python 3.12 or higher
- TiDB cluster (Cloud or self-hosted)
- API keys for remote embedding provider
-
Clone the repository
git clone https://github.com/haja-k/tidb-vector-llm-testbed.git cd tidb-vector-llm-testbed -
Create virtual environment (recommended)
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
cp .env.example .env # Edit .env with your settings -
Set up TiDB
- Create a TiDB Cloud account or set up self-hosted TiDB
- Create a database for testing
- Update
.envwith connection details
The .env file supports remote API embedding models:
# TiDB Connection
TIDB_HOST=your-tidb-host.com
TIDB_PORT=4000
TIDB_USER=your-username
TIDB_PASSWORD=your-password
TIDB_DATABASE=vector_testbed
# Remote API Settings (for Qwen and other OpenAI-compatible models)
REMOTE_EMBEDDING_BASE_URL=http://your-embedding-api.com/v1/embeddings
REMOTE_EMBEDDING_API_KEY=your-embedding-api-key
REMOTE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B
REMOTE_LLM_BASE_URL=https://ai-service.sains.com.my/llm/v1
REMOTE_LLM_API_KEY=your-llm-api-key
REMOTE_LLM_MODEL=Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic
# Vector dimensions (must match your model)
VECTOR_DIMENSION=4096Run the full 6-step workflow:
python benchmark.pyThis executes:
- ✅ Validate configuration
- ✅ Load embedding model
- ✅ Create vector tables and indexes
- ✅ Ingest and embed documents
- ✅ Set up LangChain retriever
- ✅ Evaluate performance metrics
# Fresh start (drop existing data)
python benchmark.py --drop-existing
# Skip ingestion (reuse existing embeddings)
python benchmark.py --skip-ingest
# Use full documents instead of chunks
python benchmark.py --markdownUse components in your own applications:
from vector_store import TiDBVectorStoreManager
from sample_data import get_documents
# Initialize vector store
manager = TiDBVectorStoreManager()
manager.initialize()
# Load and ingest documents
documents = get_documents()
manager.ingest_documents(documents)
# Create retriever for queries
retriever = manager.get_retriever(k=5)
results = retriever.get_relevant_documents("What is Sarawak?")
for doc in results:
print(f"Content: {doc.page_content[:200]}...")The testbed includes a comprehensive knowledge base of 13 documents about Sarawak, Malaysia:
- Federal Constitution
- State Constitution of Sarawak
- Cabinet and Premier information
- Economic development plans (PCDS 2030)
- Digital economy blueprint
- Cultural and geographical facts
- Government orders and policies
Documents are automatically chunked for optimal retrieval performance.
Comprehensive benchmarking includes:
- Precision@K & Recall@K: Relevance accuracy
- F1 Score: Balanced precision/recall metric
- NDCG@K: Ranking quality assessment
- MRR: Mean Reciprocal Rank
- Latency Analysis: Response time statistics
This project highlights proficiency in:
- Database Engineering: Vector database design, indexing, and optimization
- AI/ML Integration: Embedding models, semantic search, and LLM workflows
- Software Architecture: Modular design, dependency injection, and clean code
- Data Pipeline Development: ETL processes, document processing, and chunking strategies
- Performance Engineering: Benchmarking, latency optimization, and metrics analysis
- DevOps Practices: Environment configuration, dependency management, and deployment
- API Integration: Working with remote AI services and cloud APIs
Contributions welcome! Areas for improvement:
- Additional embedding model support
- Custom evaluation metrics
- UI dashboard for results visualization
- Multi-language document support
- Distributed deployment patterns
MIT License - see LICENSE for details.
Built with ❤️ for demonstrating cutting-edge AI database technologies
VECTOR_DIMENSION=4096 # Dimension of your remote embedding model TABLE_NAME=documents_vector
## Usage
### Running the Complete Benchmark
Run the full benchmark pipeline (all 6 steps):
```bash
python benchmark.py
This will execute:
- ✓ Connect to TiDB cluster
- ✓ Load embedding model
- ✓ Create vector index table
- ✓ Ingest and embed sample documents
- ✓ Query through LangChain Retriever
- ✓ Evaluate precision/recall and relevance
# Drop existing table and recreate
python benchmark.py --drop-existing
# Skip document ingestion (use existing data)
python benchmark.py --skip-ingest
# Use markdown format instead of FAQ format
python benchmark.py --markdown
# Combine options
python benchmark.py --drop-existing --markdownYou can also use individual components in your own scripts:
from db_connection import TiDBConnection
from config import Config
db = TiDBConnection()
engine = db.connect()
db.create_vector_table()from embedding_models import EmbeddingModelLoader
embeddings = EmbeddingModelLoader.load_model()from vector_store import TiDBVectorStoreManager
from sample_data import get_documents
manager = TiDBVectorStoreManager()
manager.initialize()
documents = get_documents()
manager.ingest_documents(documents)from vector_store import TiDBVectorStoreManager
manager = TiDBVectorStoreManager()
manager.initialize()
retriever = manager.get_retriever(k=5)
results = retriever.get_relevant_documents("What is TiDB vector search?")
for doc in results:
print(doc.page_content)from evaluation import RetrievalEvaluator
evaluator = RetrievalEvaluator()
# Evaluate query
metrics = evaluator.evaluate_query(
query="What is TiDB?",
retrieved_docs=results,
relevant_ids=[0, 1, 2],
k_values=[1, 3, 5]
)
# Measure latency
latency_metrics = evaluator.evaluate_retrieval_latency(
retriever,
"test query",
num_runs=5
)The benchmark evaluates retrieval performance using:
- Precision@K: Fraction of retrieved documents that are relevant
- Recall@K: Fraction of relevant documents that are retrieved
- F1@K: Harmonic mean of precision and recall
- NDCG@K: Normalized Discounted Cumulative Gain
- MRR: Mean Reciprocal Rank
- Latency: Mean, median, min, max, and standard deviation
================================================================================
TiDB Vector LLM Testbed - Benchmark Suite
================================================================================
STEP 1: Connecting to TiDB Cluster
================================================================================
✓ Configuration validated
- Host: your-tidb-host.com:6000
- Database: vector_testbed
- Embedding Model: Qwen/Qwen3-Embedding-8B
STEP 2: Loading Embedding Model
================================================================================
Loading remote API embedding model...
Remote API embeddings loaded successfully: Qwen/Qwen3-Embedding-8B
- API URL: http://your-embedding-api.com/v1/embeddings
✓ Embedding model loaded successfully
STEP 3: Creating Vector Index Table
================================================================================
Initializing TiDB connection and vector table...
Connecting to TiDB at your-tidb-host.com:6000...
Successfully connected to TiDB version: 8.0.11-TiDB-v8.5.2
Vector table documents_vector already exists.
Setting up TiFlash replica for vector index...
TiFlash replica set successfully. Waiting for replica to be available...
Vector table setup completed successfully.
TiDB Vector Store initialized successfully.
✓ Vector index table created: documents_vector
- Vector dimension: 4096
- Distance metric: Cosine
STEP 4: Ingesting and Embedding Documents
================================================================================
Loaded 8127 sample documents (FAQ dataset)
✓ Successfully ingested 8127 documents with embeddings
STEP 5: Querying Through LangChain Retriever
================================================================================
Retriever created: similarity search with k=5
Running 20 test queries...
✓ Completed 20 queries
STEP 6: Evaluating Retrieval Performance
================================================================================
Mean latency: 71.38 ms
Median latency: 71.59 ms
RETRIEVAL EVALUATION REPORT
================================================================================
Total Queries Evaluated: 20
Average Metrics:
K = 1:
Precision@1: 1.0000
Recall@1: 0.3333
F1@1: 0.5000
NDCG@1: 1.0000
K = 3:
Precision@3: 1.0000
Recall@3: 1.0000
F1@3: 1.0000
NDCG@3: 1.0000
K = 5:
Precision@5: 0.6000
Recall@5: 1.0000
F1@5: 0.7500
NDCG@5: 1.0000
Mean Reciprocal Rank (MRR): 1.0000
================================================================================
✓ BENCHMARK COMPLETED SUCCESSFULLY
The testbed supports both FAQ format and markdown-based facts:
FAQ Format (default):
python benchmark.pyMarkdown Format:
python benchmark.py --markdownThe markdown format is ideal for storing knowledge bases, documentation, or factual content. Each document can include:
- Headings and subheadings
- Code blocks
- Lists and structured content
- Full markdown syntax
- Create your document list:
custom_documents = [
{
"content": "Your document text here",
"metadata": {"id": 0, "category": "custom"}
},
# ... more documents
]- For markdown content:
markdown_documents = [
{
"content": """# Title
Your markdown content here with **formatting**
- List item 1
- List item 2
```code
Example code""", "metadata": {"id": 0, "title": "Title", "format": "markdown"} } ]
3. Ingest them:
```python
manager.ingest_documents(custom_documents)
The testbed supports OpenAI-compatible remote API models, including self-hosted models. You can use any embedding service that follows the OpenAI API format, such as:
- Self-hosted Qwen Models: Qwen/Qwen3-Embedding-8B (as demonstrated)
- Llama Models: Various Llama-based embedding models
- Other OpenAI-compatible APIs: Any service with OpenAI-compatible endpoints
Configure your API provider in .env:
REMOTE_EMBEDDING_BASE_URL=https://your-api-provider.com/v1
REMOTE_EMBEDDING_API_KEY=your-api-key
REMOTE_EMBEDDING_MODEL=your-model-name
VECTOR_DIMENSION=4096 # Check your model's dimensionImplement your own relevance judgments:
from evaluation import RetrievalEvaluator
evaluator = RetrievalEvaluator()
# Your ground truth data
ground_truth = {
"query1": [0, 1, 5], # Relevant document IDs
"query2": [2, 3, 4],
}
for query, relevant_ids in ground_truth.items():
results = retriever.get_relevant_documents(query)
metrics = evaluator.evaluate_query(query, results, relevant_ids)- Verify TiDB host and port in
.env - Check network connectivity and firewall rules
- Ensure TiDB user has appropriate permissions
- Verify
REMOTE_EMBEDDING_API_KEYandREMOTE_EMBEDDING_BASE_URLare set correctly - Ensure your API provider supports OpenAI-compatible endpoints
- Check
VECTOR_DIMENSIONmatches your embedding model - Verify API connectivity and authentication
- Create vector indexes for better search performance
- Consider batch ingestion for large datasets
- Use appropriate k values for your use case
Contributions are welcome! Please feel free to submit issues or pull requests.
This project is licensed under the terms of the LICENSE file.
For issues, questions, or contributions, please open an issue on GitHub.