Light Chat - Character AI Chatbot

A character AI chatbot system using ChromaDB for RAG (Retrieval Augmented Generation) with comprehensive text analysis and collection management tools.

Features

RAG-powered conversations: Uses ChromaDB vector database for context retrieval
Character-specific knowledge: Manages character data and message examples
Comprehensive analysis tools: Extract metadata and topics from text
Collection management: Full CRUD operations for ChromaDB collections
Metadata enrichment: Automatic metadata tagging for improved retrieval
Flexible configuration: Configurable chunking, embedding, and processing

Quick Start

Installation

# Using uv (recommended)
uv sync

# Or using pip
pip install -r requirements.txt

Basic Usage

Prepare RAG data (batch processing):

python prepare_rag.py

Analyze text and extract metadata:

python analyze_rag_text.py analyze rag_data/shodan.txt -v

Push single file to ChromaDB:

python push_rag_data.py rag_data/shodan.txt -c shodan

Manage collections:

# List all collections
python manage_collections.py list-collections -v

# Test a collection
python manage_collections.py test shodan -q "SHODAN artificial intelligence"

# Delete a collection
python manage_collections.py delete old_collection

RAG Scripts

1. analyze_rag_text.py

Analyze text files to extract metadata and topics:

Extract named entities (capitalized phrases, dates, numbers)
Identify key phrases (frequently occurring terms)
Generate metadata JSON files
Validate existing metadata files
Scan directories for missing metadata

Commands: analyze, validate, scan

2. push_rag_data.py

Push individual text files to ChromaDB with enhanced features:

Single file processing with explicit collection naming
Dry-run mode for testing
Overwrite protection
Custom metadata file selection
Detailed progress tracking

3. manage_collections.py

Comprehensive collection management:

List collections with statistics
Delete single or multiple collections
Test collections with similarity search
Export collection data to JSON
Show detailed collection information

4. prepare_rag.py

Original batch processing script:

Process all text files in a directory
Create collections for base files and message examples
Parallel metadata enrichment
Automatic collection naming

5. collection_helper.py

Original collection helper:

List, delete, and test collections
Metadata-based filtering
Simple command-line interface

Documentation

See docs/RAG_SCRIPTS_GUIDE.md for comprehensive documentation including:

Detailed command reference
Configuration options
Common workflows
Troubleshooting guide
Best practices

Project Structure

light-chat/
├── rag_data/              # RAG text files and metadata
│   ├── shodan.txt         # Character context data
│   ├── shodan_message_examples.txt
│   └── shodan.json        # Metadata keys
├── configs/               # Configuration files
│   └── appconf.json       # Application configuration
├── character_storage/     # ChromaDB persistent storage
├── docs/                  # Documentation
│   └── RAG_SCRIPTS_GUIDE.md
├── analyze_rag_text.py    # Text analysis and metadata extraction
├── push_rag_data.py       # Single file upload to ChromaDB
├── manage_collections.py  # Collection management
├── prepare_rag.py         # Batch processing script
├── collection_helper.py   # Original collection helper
├── context_manager.py     # Runtime RAG retrieval
├── conversation_manager.py # Conversation handling
└── main.py                # Main application

Configuration

Edit configs/appconf.json to customize:

{
  "PERSIST_DIRECTORY": "./character_storage/",
  "KEY_STORAGE": "./rag_data/",
  "DOCUMENTS_DIRECTORY": "./rag_data/",
  "CHUNK_SIZE": 2048,
  "CHUNK_OVERLAP": 1024,
  "THREADS": 6,
  "EMBEDDING_DEVICE": "cpu",
  "RAG_COLLECTION": "shodan",
  "RAG_K": 2
}

Metadata Format

Metadata files should follow this structure:

[
  {
    "uuid": "unique-identifier",
    "text": "Searchable content"
  }
]

Supported text field names: text, content, value, text_field, text_fields

Common Workflows

Adding New Character Data

# 1. Analyze and extract metadata
python analyze_rag_text.py analyze new_character.txt -o new_character.json

# 2. Validate metadata
python analyze_rag_text.py validate new_character.json

# 3. Push to ChromaDB
python push_rag_data.py new_character.txt -c new_character

# 4. Test the collection
python manage_collections.py test new_character -q "test query"

Updating Existing Collection

# 1. Backup
python manage_collections.py export shodan -o shodan_backup.json

# 2. Update with overwrite
python push_rag_data.py rag_data/shodan.txt -c shodan -w

# 3. Test
python manage_collections.py test shodan -q "verification query"

Batch Processing

# Scan and auto-generate missing metadata
python analyze_rag_text.py scan rag_data/ --auto-generate

# Process all files
python prepare_rag.py

Advanced Features

Dry-Run Mode

Test configuration without making changes:

python push_rag_data.py file.txt -c collection -d

Custom Chunk Sizes

Optimize for your use case:

python push_rag_data.py file.txt -c collection -cs 1024 -co 512

Metadata Filtering

Search with metadata filters:

python manage_collections.py test collection -q "query with metadata"

Bulk Operations

Delete multiple collections:

python manage_collections.py delete-multiple --pattern "test_*" -y

Dependencies

chromadb: Vector database for embeddings
langchain: RAG orchestration and document processing
langchain-chroma: ChromaDB integration
langchain-huggingface: Embedding models
sentence-transformers: Embedding backend
click: CLI framework
loguru: Logging

Development

Running Tests

# Test scripts with sample data
python analyze_rag_text.py analyze rag_data/shodan.txt -v
python manage_collections.py list-collections -v

Linting

# Using ruff
ruff check .
ruff format .

Troubleshooting

No output from scripts?

Set SHOW_LOGS: true in configs/appconf.json

Collection already exists?

Use --overwrite flag or delete first: python manage_collections.py delete <name> -y

Out of memory?

Reduce chunk size: --chunk-size 1024
Reduce threads: --threads 2

Metadata not applied?

Validate: python analyze_rag_text.py validate <file.json>
Check filename matches (e.g., shodan.txt → shodan.json)

Contributing

Contributions are welcome! Please:

Follow the existing code style (ruff configuration)
Add tests for new features
Update documentation
Submit pull requests

License

See LICENSE for details.

Support

For issues or questions:

Check docs/RAG_SCRIPTS_GUIDE.md
Open an issue on GitHub

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
cards		cards
configs		configs
docs		docs
rag_data		rag_data
.gitignore		.gitignore
.python-version		.python-version
AGENTS.MD		AGENTS.MD
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
README.md		README.md
analyze_rag_text.py		analyze_rag_text.py
build_flash_attention.py		build_flash_attention.py
build_flash_attention.sh		build_flash_attention.sh
chat_tui.py		chat_tui.py
collection_helper.py		collection_helper.py
context_manager.py		context_manager.py
conversation_manager.py		conversation_manager.py
gpu_utils.py		gpu_utils.py
main.py		main.py
manage_collections.py		manage_collections.py
prepare_rag.py		prepare_rag.py
push_rag_data.py		push_rag_data.py
pyproject.toml		pyproject.toml
test_gpu_auto.py		test_gpu_auto.py
test_kv_quantization.py		test_kv_quantization.py
test_rag_scripts.py		test_rag_scripts.py
uv.lock		uv.lock

License

ossirytk/light-chat

Folders and files

Latest commit

History

Repository files navigation

Light Chat - Character AI Chatbot

Features

Quick Start

Installation

Basic Usage

RAG Scripts

1. analyze_rag_text.py

2. push_rag_data.py

3. manage_collections.py

4. prepare_rag.py

5. collection_helper.py

Documentation

Project Structure

Configuration

Metadata Format

Common Workflows

Adding New Character Data

Updating Existing Collection

Batch Processing

Advanced Features

Dry-Run Mode

Custom Chunk Sizes

Metadata Filtering

Bulk Operations

Dependencies

Development

Running Tests

Linting

Troubleshooting

Contributing

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages