TiDB Vector LLM Testbed 🚀

A cutting-edge testbed demonstrating advanced vector database capabilities with TiDB, showcasing end-to-end LLM-powered retrieval systems using remote API embeddings, including support for self-hosted models like Qwen.

🌟 What This Project Demonstrates

This project showcases expertise in:

Vector Databases & AI Integration: Implementing TiDB's vector search with LangChain for semantic retrieval
Full-Stack Data Engineering: From document ingestion to evaluation metrics
Modern Python Development: Clean, modular code with comprehensive testing
Performance Benchmarking: Latency analysis and relevance scoring
Knowledge Base Systems: Processing and querying large document collections
Remote API Integration: Working with cloud-hosted and self-hosted embedding models
Custom AI Implementations: Building custom embedding classes for specialized use cases

✨ Key Features

🔗 Seamless TiDB Integration: Direct connection to TiDB Cloud or self-hosted clusters
🧠 Remote API Embeddings: Support for OpenAI-compatible remote embedding models, including custom implementations for self-hosted models like Qwen
📚 Rich Document Processing: Markdown-based knowledge base with intelligent chunking
⚡ High-Performance Retrieval: Optimized vector indexing and similarity search
📊 Comprehensive Evaluation: Precision, Recall, NDCG, MRR, and latency metrics
🛠️ Modular Architecture: Clean, extensible codebase for easy customization
🔧 Flexible Configuration: Environment-based setup with sensible defaults
📈 Benchmarking Suite: Automated performance testing and reporting

🛠️ Tech Stack

Database: TiDB Vector Database
AI/ML: LangChain, Custom OpenAI-compatible embeddings for self-hosted models
Backend: Python 3.12+, SQLAlchemy, PyMySQL
Data Processing: Pandas, NumPy, Scikit-learn
Development: Modern Python packaging (pyproject.toml)

📁 Project Structure

tidb-vector-llm-testbed/
├── 📄 benchmark.py           # Main orchestration script
├── ⚙️ config.py              # Environment configuration
├── 🗄️ db_connection.py       # TiDB connection & schema management
├── 🧠 embedding_models.py    # Custom OpenAI-compatible embedding model loader
├── 🔍 vector_store.py        # LangChain-compatible vector store
├── 📊 evaluation.py          # Retrieval metrics & benchmarking
├── 📚 sample_data.py         # Document loading & preprocessing
├── 📋 scspedia/              # Knowledge base documents (Sarawak/Malaysia)
├── 📦 pyproject.toml         # Modern Python packaging
├── 📋 requirements.txt       # Dependencies
├── 📝 CHANGELOG.md           # Change history
├── 🔐 .env.example           # Configuration template
└── 📖 README.md              # This file

🚀 Quick Start

Get up and running in minutes:

# Clone and setup
git clone https://github.com/haja-k/tidb-vector-llm-testbed.git
cd tidb-vector-llm-testbed

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp .env.example .env
# Edit .env with your TiDB credentials

# Run the complete benchmark
python benchmark.py

📖 Installation

Prerequisites

Python 3.12 or higher
TiDB cluster (Cloud or self-hosted)
API keys for remote embedding provider

Detailed Setup

Clone the repository

git clone https://github.com/haja-k/tidb-vector-llm-testbed.git
cd tidb-vector-llm-testbed

Create virtual environment (recommended)

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies
```
pip install -r requirements.txt
```

Configure environment variables

cp .env.example .env
# Edit .env with your settings

Set up TiDB
- Create a TiDB Cloud account or set up self-hosted TiDB
- Create a database for testing
- Update .env with connection details

⚙️ Configuration

The .env file supports remote API embedding models:

# TiDB Connection
TIDB_HOST=your-tidb-host.com
TIDB_PORT=4000
TIDB_USER=your-username
TIDB_PASSWORD=your-password
TIDB_DATABASE=vector_testbed

# Remote API Settings (for Qwen and other OpenAI-compatible models)
REMOTE_EMBEDDING_BASE_URL=http://your-embedding-api.com/v1/embeddings
REMOTE_EMBEDDING_API_KEY=your-embedding-api-key
REMOTE_EMBEDDING_MODEL=Qwen/Qwen3-Embedding-8B

REMOTE_LLM_BASE_URL=https://ai-service.sains.com.my/llm/v1
REMOTE_LLM_API_KEY=your-llm-api-key
REMOTE_LLM_MODEL=Infermatic/Llama-3.3-70B-Instruct-FP8-Dynamic

# Vector dimensions (must match your model)
VECTOR_DIMENSION=4096

🎯 Usage

Complete Benchmark Pipeline

Run the full 6-step workflow:

python benchmark.py

This executes:

✅ Validate configuration
✅ Load embedding model
✅ Create vector tables and indexes
✅ Ingest and embed documents
✅ Set up LangChain retriever
✅ Evaluate performance metrics

Command Options

# Fresh start (drop existing data)
python benchmark.py --drop-existing

# Skip ingestion (reuse existing embeddings)
python benchmark.py --skip-ingest

# Use full documents instead of chunks
python benchmark.py --markdown

Programmatic Usage

Use components in your own applications:

from vector_store import TiDBVectorStoreManager
from sample_data import get_documents

# Initialize vector store
manager = TiDBVectorStoreManager()
manager.initialize()

# Load and ingest documents
documents = get_documents()
manager.ingest_documents(documents)

# Create retriever for queries
retriever = manager.get_retriever(k=5)
results = retriever.get_relevant_documents("What is Sarawak?")

for doc in results:
    print(f"Content: {doc.page_content[:200]}...")

📊 Sample Dataset

The testbed includes a comprehensive knowledge base of 13 documents about Sarawak, Malaysia:

Federal Constitution
State Constitution of Sarawak
Cabinet and Premier information
Economic development plans (PCDS 2030)
Digital economy blueprint
Cultural and geographical facts
Government orders and policies

Documents are automatically chunked for optimal retrieval performance.

📈 Evaluation Metrics

Comprehensive benchmarking includes:

Precision@K & Recall@K: Relevance accuracy
F1 Score: Balanced precision/recall metric
NDCG@K: Ranking quality assessment
MRR: Mean Reciprocal Rank
Latency Analysis: Response time statistics

🏆 Skills Demonstrated

This project highlights proficiency in:

Database Engineering: Vector database design, indexing, and optimization
AI/ML Integration: Embedding models, semantic search, and LLM workflows
Software Architecture: Modular design, dependency injection, and clean code
Data Pipeline Development: ETL processes, document processing, and chunking strategies
Performance Engineering: Benchmarking, latency optimization, and metrics analysis
DevOps Practices: Environment configuration, dependency management, and deployment
API Integration: Working with remote AI services and cloud APIs

🤝 Contributing

Contributions welcome! Areas for improvement:

Additional embedding model support
Custom evaluation metrics
UI dashboard for results visualization
Multi-language document support
Distributed deployment patterns

📄 License

MIT License - see LICENSE for details.

🔗 Resources

Built with ❤️ for demonstrating cutting-edge AI database technologies

Vector Configuration

VECTOR_DIMENSION=4096 # Dimension of your remote embedding model TABLE_NAME=documents_vector


## Usage

### Running the Complete Benchmark

Run the full benchmark pipeline (all 6 steps):

```bash
python benchmark.py

This will execute:

✓ Connect to TiDB cluster
✓ Load embedding model
✓ Create vector index table
✓ Ingest and embed sample documents
✓ Query through LangChain Retriever
✓ Evaluate precision/recall and relevance

Command-Line Options

# Drop existing table and recreate
python benchmark.py --drop-existing

# Skip document ingestion (use existing data)
python benchmark.py --skip-ingest

# Use markdown format instead of FAQ format
python benchmark.py --markdown

# Combine options
python benchmark.py --drop-existing --markdown

Using Individual Modules

You can also use individual components in your own scripts:

Connect to TiDB

from db_connection import TiDBConnection
from config import Config

db = TiDBConnection()
engine = db.connect()
db.create_vector_table()

Load Embedding Model

from embedding_models import EmbeddingModelLoader

embeddings = EmbeddingModelLoader.load_model()

Ingest Documents

from vector_store import TiDBVectorStoreManager
from sample_data import get_documents

manager = TiDBVectorStoreManager()
manager.initialize()

documents = get_documents()
manager.ingest_documents(documents)

Query with Retriever

from vector_store import TiDBVectorStoreManager

manager = TiDBVectorStoreManager()
manager.initialize()

retriever = manager.get_retriever(k=5)
results = retriever.get_relevant_documents("What is TiDB vector search?")

for doc in results:
    print(doc.page_content)

Evaluate Performance

from evaluation import RetrievalEvaluator

evaluator = RetrievalEvaluator()

# Evaluate query
metrics = evaluator.evaluate_query(
    query="What is TiDB?",
    retrieved_docs=results,
    relevant_ids=[0, 1, 2],
    k_values=[1, 3, 5]
)

# Measure latency
latency_metrics = evaluator.evaluate_retrieval_latency(
    retriever, 
    "test query", 
    num_runs=5
)

Evaluation Metrics

The benchmark evaluates retrieval performance using:

Precision@K: Fraction of retrieved documents that are relevant
Recall@K: Fraction of relevant documents that are retrieved
F1@K: Harmonic mean of precision and recall
NDCG@K: Normalized Discounted Cumulative Gain
MRR: Mean Reciprocal Rank
Latency: Mean, median, min, max, and standard deviation

Sample Output

================================================================================
TiDB Vector LLM Testbed - Benchmark Suite
================================================================================

STEP 1: Connecting to TiDB Cluster
================================================================================
✓ Configuration validated
  - Host: your-tidb-host.com:6000
  - Database: vector_testbed
  - Embedding Model: Qwen/Qwen3-Embedding-8B

STEP 2: Loading Embedding Model
================================================================================
Loading remote API embedding model...
Remote API embeddings loaded successfully: Qwen/Qwen3-Embedding-8B
  - API URL: http://your-embedding-api.com/v1/embeddings
✓ Embedding model loaded successfully

STEP 3: Creating Vector Index Table
================================================================================
Initializing TiDB connection and vector table...
Connecting to TiDB at your-tidb-host.com:6000...
Successfully connected to TiDB version: 8.0.11-TiDB-v8.5.2
Vector table documents_vector already exists.
Setting up TiFlash replica for vector index...
TiFlash replica set successfully. Waiting for replica to be available...
Vector table setup completed successfully.
TiDB Vector Store initialized successfully.
✓ Vector index table created: documents_vector
  - Vector dimension: 4096
  - Distance metric: Cosine

STEP 4: Ingesting and Embedding Documents
================================================================================
Loaded 8127 sample documents (FAQ dataset)
✓ Successfully ingested 8127 documents with embeddings

STEP 5: Querying Through LangChain Retriever
================================================================================
Retriever created: similarity search with k=5
Running 20 test queries...
✓ Completed 20 queries

STEP 6: Evaluating Retrieval Performance
================================================================================
Mean latency: 71.38 ms
Median latency: 71.59 ms

RETRIEVAL EVALUATION REPORT
================================================================================
Total Queries Evaluated: 20

Average Metrics:
K = 1:
  Precision@1: 1.0000
  Recall@1:    0.3333
  F1@1:        0.5000
  NDCG@1:      1.0000

K = 3:
  Precision@3: 1.0000
  Recall@3:    1.0000
  F1@3:        1.0000
  NDCG@3:      1.0000

K = 5:
  Precision@5: 0.6000
  Recall@5:    1.0000
  F1@5:        0.7500
  NDCG@5:      1.0000

Mean Reciprocal Rank (MRR): 1.0000
================================================================================

✓ BENCHMARK COMPLETED SUCCESSFULLY

Extending the Testbed

Using Markdown Format for Documents

The testbed supports both FAQ format and markdown-based facts:

FAQ Format (default):

python benchmark.py

Markdown Format:

python benchmark.py --markdown

The markdown format is ideal for storing knowledge bases, documentation, or factual content. Each document can include:

Headings and subheadings
Code blocks
Lists and structured content
Full markdown syntax

Adding Custom Datasets

Create your document list:

custom_documents = [
    {
        "content": "Your document text here",
        "metadata": {"id": 0, "category": "custom"}
    },
    # ... more documents
]

For markdown content:

markdown_documents = [
    {
        "content": """# Title
Your markdown content here with **formatting**

- List item 1
- List item 2

```code
Example code

""", "metadata": {"id": 0, "title": "Title", "format": "markdown"} } ]


3. Ingest them:
```python
manager.ingest_documents(custom_documents)

Using Remote API Embedding Models

The testbed supports OpenAI-compatible remote API models, including self-hosted models. You can use any embedding service that follows the OpenAI API format, such as:

Self-hosted Qwen Models: Qwen/Qwen3-Embedding-8B (as demonstrated)
Llama Models: Various Llama-based embedding models
Other OpenAI-compatible APIs: Any service with OpenAI-compatible endpoints

Configure your API provider in .env:

REMOTE_EMBEDDING_BASE_URL=https://your-api-provider.com/v1
REMOTE_EMBEDDING_API_KEY=your-api-key
REMOTE_EMBEDDING_MODEL=your-model-name
VECTOR_DIMENSION=4096  # Check your model's dimension

Custom Evaluation

Implement your own relevance judgments:

from evaluation import RetrievalEvaluator

evaluator = RetrievalEvaluator()

# Your ground truth data
ground_truth = {
    "query1": [0, 1, 5],  # Relevant document IDs
    "query2": [2, 3, 4],
}

for query, relevant_ids in ground_truth.items():
    results = retriever.get_relevant_documents(query)
    metrics = evaluator.evaluate_query(query, results, relevant_ids)

Troubleshooting

Connection Issues

Verify TiDB host and port in .env
Check network connectivity and firewall rules
Ensure TiDB user has appropriate permissions

Embedding Issues

Verify REMOTE_EMBEDDING_API_KEY and REMOTE_EMBEDDING_BASE_URL are set correctly
Ensure your API provider supports OpenAI-compatible endpoints
Check VECTOR_DIMENSION matches your embedding model
Verify API connectivity and authentication

Performance Issues

Create vector indexes for better search performance
Consider batch ingestion for large datasets
Use appropriate k values for your use case

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

This project is licensed under the terms of the LICENSE file.

Support

For issues, questions, or contributions, please open an issue on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
benchmark.py		benchmark.py
config.py		config.py
db_connection.py		db_connection.py
embedding_models.py		embedding_models.py
evaluation.py		evaluation.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
sample_data.py		sample_data.py
test_connection.py		test_connection.py
uv.lock		uv.lock
validate.py		validate.py
vector_store.py		vector_store.py

License

haja-k/tidb-vector-llm-testbed

Folders and files

Latest commit

History

Repository files navigation

TiDB Vector LLM Testbed 🚀

🌟 What This Project Demonstrates

✨ Key Features

🛠️ Tech Stack

📁 Project Structure

🚀 Quick Start

📖 Installation

Prerequisites

Detailed Setup

⚙️ Configuration

🎯 Usage

Complete Benchmark Pipeline

Command Options

Programmatic Usage

📊 Sample Dataset

📈 Evaluation Metrics

🏆 Skills Demonstrated

🤝 Contributing

📄 License

🔗 Resources

Vector Configuration

Command-Line Options

Using Individual Modules

Connect to TiDB

Load Embedding Model

Ingest Documents

Query with Retriever

Evaluate Performance

Evaluation Metrics

Sample Output

Extending the Testbed

Using Markdown Format for Documents

Adding Custom Datasets

Using Remote API Embedding Models

Custom Evaluation

Troubleshooting

Connection Issues

Embedding Issues

Performance Issues

Contributing

License

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages