Key Features โข Supported Benchmarks โข Getting Started โข Reporting โข Project Ethos
eka-eval is the official evaluation pipeline for the EKA project (eka.soket.ai), designed to provide comprehensive, fair, and transparent benchmarking for large language models (LLMs). Our framework supports both global and India-centric evaluations, with special emphasis on multilingual capabilities across Indian languages.
- ๐ Global + India-First: Combines international benchmarks with India-specific evaluations
- ๐ฌ Rigorous & Reproducible: Standardized evaluation protocols with detailed logging
- ๐ Production-Ready: Optimized for efficiency with quantization and multi-GPU support
- ๐ง Extensible: Easy integration of custom benchmarks and evaluation logic
- ๐ Transparent: Comprehensive reporting with detailed error analysis
- 17+ English Benchmarks: MMLU, GSM8K, HumanEval, ARC-Challenge, and more
- 12+ Indic Benchmarks: MMLU-IN, BoolQ-IN, ARC-Challenge-IN, MILU, and others
- Specialized Tasks: Code generation, mathematical reasoning, long-context understanding
- Multi-modal Support: Text, code, and multilingual evaluation capabilities
- 11 Indian Languages: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
- Smart Language Handling: Automatic script recognition and Hindi-English letter mapping
- Per-Language Metrics: Detailed breakdown of performance across languages
- Multi-GPU Support: Distributed evaluation across multiple GPUs
- Quantization Ready: 4-bit/8-bit quantization for efficient large model evaluation
- Batched Inference: Optimized throughput with configurable batch sizes
- Memory Management: Smart resource cleanup and CUDA cache management
- Modular Architecture: Clean separation of concerns with extensible design
- Prompt System: Template-based prompts with language-specific customization
- Rich Configuration: JSON-based benchmark configs with validation
- Detailed Logging: Comprehensive debug information and progress tracking
- Multiple Output Formats: CSV summaries, JSONL details, console tables
- Error Analysis: Per-instance results for debugging and improvement
- Reproducibility: Timestamped results with full configuration tracking
- Flexible Metrics: Accuracy, F1, BLEU, pass@k, and custom metrics
git clone https://github.com/your-org/eka-eval.git
cd eka-eval# Create virtual environment
python3 -m venv eka-env
source eka-env/bin/activate # On Windows: eka-env\Scripts\activate
# Install dependencies
pip install -r requirements.txttorch>=2.0.0
transformers>=4.35.0
datasets>=2.14.0
evaluate>=0.4.0
accelerate>=0.24.0
bitsandbytes>=0.41.0 # For quantization
pandas>=1.5.0
tqdm>=4.64.0
numpy>=1.24.0For private models or gated datasets:
huggingface-cli login
# OR
export HF_TOKEN="your_hf_token_here"# Run interactive evaluation
python3 scripts/run_benchmarks.py# Evaluate specific model on math benchmarks
python3 scripts/run_benchmarks.py \
--model "google/gemma-2b" \
--task_groups "MATH AND REASONING" \
--benchmarks "GSM8K,MATH"
# Multi-language evaluation
python3 scripts/run_benchmarks.py \
--model "sarvamai/sarvam-1" \
--task_groups "INDIC BENCHMARKS" \
--languages "hi,bn,gu"
# Code generation evaluation
python3 scripts/run_benchmarks.py \
--model "microsoft/CodeT5-large" \
--task_groups "CODE GENERATION" \
--pass_k "1,5,10"# Test individual benchmarks
python eka_eval/benchmarks/tasks/math/gsm8k.py --model_name_test gpt2
python eka_eval/benchmarks/tasks/indic/boolq_in.py --target_languages_test hi en
python eka_eval/benchmarks/tasks/long_context/infinitebench.py --dataset_split_test longdialogue_qa_engeka-eval/
โโโ ๐ฆ eka_eval/ # Core library
โ โโโ ๐งช benchmarks/ # Evaluation logic
โ โ โโโ tasks/
โ โ โ โโโ ๐ป code/ # HumanEval, MBPP, etc.
โ โ โ โโโ ๐งฎ math/ # GSM8K, MATH, etc.
โ โ โ โโโ ๐ indic/ # Indic language benchmarks
โ โ โ โโโ ๐ง reasoning/ # ARC, HellaSwag, etc.
โ โ โ โโโ ๐ long_context/ # InfiniteBench, etc.
โ โ โ โโโ ๐ฏ general/ # MMLU, AGIEval, etc.
โ โ โโโ benchmark_registry.py
โ โโโ โ๏ธ core/ # Model loading & evaluation
โ โโโ ๐ง utils/ # Utilities & helpers
โ โโโ ๐ config/ # Benchmark configurations
โโโ ๐ scripts/ # Execution scripts
โ โโโ run_benchmarks.py # Main orchestrator
โ โโโ evaluation_worker.py # Worker process logic
โโโ ๐ results_output/ # Evaluation results
โโโ ๐ฏ prompts/ # Prompt templates
โ โโโ math/ # Math benchmark prompts
โ โโโ indic/ # Indic benchmark prompts
โ โโโ general/ # General benchmark prompts
โ โโโ long_context/ # Long context prompts
โโโ ๐ requirements.txt
| Category | Benchmarks | Languages | Metrics |
|---|---|---|---|
| ๐ Knowledge | MMLU, MMLU-Pro, TriviaQA, NaturalQuestions | English | Accuracy |
| ๐งฎ Mathematics | GSM8K, MATH, GPQA, ARC-Challenge | English | Accuracy |
| ๐ป Code Generation | HumanEval, MBPP, HumanEval+, MBPP+ | Python, Multi-PL | pass@1, pass@k |
| ๐ง Reasoning | BBH, AGIEval, HellaSwag, WinoGrande | English | Accuracy |
| ๐ Reading | SQuAD, QuAC, BoolQ, XQuAD | English + Others | F1, EM, Accuracy |
| ๐ Long Context | InfiniteBench, ZeroSCROLLS, Needle-in-Haystack | English | Task-specific |
| Benchmark | Languages | Description | Metrics |
|---|---|---|---|
| MMLU-IN | 11 Indic + EN | Knowledge understanding across subjects | Accuracy |
| BoolQ-IN | 11 Indic + EN | Yes/No question answering | Accuracy |
| ARC-Challenge-IN | 11 Indic + EN | Science reasoning questions | Accuracy |
| MILU | 11 Indic + EN | AI4Bharat's multilingual understanding | Accuracy |
| GSM8K-IN | Hindi, Others | Math word problems in Indian languages | Accuracy |
| IndicGenBench | Multiple | Generation tasks for Indic languages | Task-specific |
| Flores-IN | 22 Languages | Translation quality assessment | BLEU, ChrF |
| XQuAD-IN | 11 Languages | Cross-lingual reading comprehension | F1, EM |
- English: Primary evaluation language
- Hindi (hi): เคฆเฅเคตเคจเคพเคเคฐเฅ script with smart character mapping
- Bengali (bn): เฆฌเฆพเฆเฆฒเฆพ script
- Gujarati (gu): เชเซเชเชฐเชพเชคเซ script
- Kannada (kn): เฒเฒจเณเฒจเฒก script
- Malayalam (ml): เดฎเดฒเดฏเดพเดณเด script
- Marathi (mr): เคฎเคฐเคพเค เฅ script
- Odia (or): เฌเฌกเฌผเฌฟเฌ script
- Punjabi (pa): เจชเฉฐเจเจพเจฌเฉ script
- Tamil (ta): เฎคเฎฎเฎฟเฎดเฏ script
- Telugu (te): เฐคเฑเฐฒเฑเฐเฑ script
Enter model source ('1' for Hugging Face, '2' for Local Path): 1
Enter Hugging Face model name: google/gemma-2b
--- Available Benchmark Task Groups ---
1. CODE GENERATION 7. MMLU
2. MATH AND REASONING 8. MMLU-Pro
3. READING COMPREHENSION 9. IFEval
4. COMMONSENSE REASONING 10. BBH
5. WORLD KNOWLEDGE 11. AGIEval
6. LONG CONTEXT 12. INDIC BENCHMARKS
Select task group #(s): 2 12
--- Select benchmarks for MATH AND REASONING ---
1. GSM8K 4. ARC-Challenge
2. MATH 5. ALL
3. GPQA 6. SKIP
Select benchmark #(s): 1 2
[Worker 0 (GPU 0)] Loading model: google/gemma-2b (2.0B parameters)
[Worker 0 (GPU 0)] Running GSM8K evaluation...
[Worker 0 (GPU 0)] GSM8K Accuracy: 42.3% (527/1247)
[Worker 0 (GPU 0)] Running MATH evaluation...
[Worker 0 (GPU 0)] MATH Accuracy: 12.1% (601/5000)
Results saved to: results_output/calculated.csv
# my_benchmark.py
def evaluate_my_task(pipe, tokenizer, model_name_for_logging, device, **kwargs):
# Your evaluation logic here
results = {"MyTask": accuracy_score}
return results// prompts/custom/my_task.json
{
"my_task_0shot": {
"template": "Question: {question}\nAnswer:",
"description": "Zero-shot prompt for my task"
},
"default_few_shot_examples": [
{"question": "Example question", "answer": "Example answer"}
]
}# Add to benchmark_config.py
"MyTask": {
"description": "My custom evaluation task",
"evaluation_function": "my_project.my_benchmark.evaluate_my_task",
"task_args": {
"prompt_template_name_zeroshot": "my_task_0shot",
"prompt_file_benchmark_key": "my_task",
"prompt_file_category": "custom"
}
}# Automatic 4-bit quantization for large models
python3 scripts/run_benchmarks.py \
--model "meta-llama/Llama-2-70b-hf" \
--quantization "4bit" \
--batch_size 1# Distributed evaluation across GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 scripts/run_benchmarks.py \
--model "microsoft/DialoGPT-large" \
--task_groups "ALL" \
--num_gpus 4Located at results_output/calculated.csv:
| Model | Size (B) | Task | Benchmark | Score | Timestamp | Status |
|---|---|---|---|---|---|---|
| gemma-2b | 2.00 | MATH AND REASONING | GSM8K | 42.3% | 2024-01-15T10:30:45 | Completed |
| gemma-2b | 2.00 | INDIC BENCHMARKS | BoolQ-IN | 67.8% | 2024-01-15T11:15:20 | Completed |
Per-benchmark detailed results in results_output/detailed_results/:
{
"question_id": 123,
"question": "What is 2+2?",
"correct_answer": "4",
"predicted_answer": "4",
"is_correct": true,
"generated_text": "The answer is 4.",
"prompt_used": "Question: What is 2+2?\nAnswer:"
}| Model | Task | Benchmark | Score |
|------------|--------------------|--------------------|---------|
| gemma-2b | MATH AND REASONING | GSM8K | 42.3% |
| gemma-2b | MATH AND REASONING | MATH | 12.1% |
| gemma-2b | INDIC BENCHMARKS | BoolQ-IN | 67.8% |
| gemma-2b | INDIC BENCHMARKS | MMLU-IN | 39.2% |{
"BoolQ-IN": 67.8,
"BoolQ-IN_hi": 65.2,
"BoolQ-IN_bn": 70.1,
"BoolQ-IN_en": 74.5,
"BoolQ-IN_gu": 63.8
}| Issue | Solution |
|---|---|
๐ด ModuleNotFoundError: eka_eval |
Run from project root directory |
| ๐ด CUDA Out of Memory | Reduce generation_batch_size or use quantization |
| ๐ด Hugging Face 404 Error | Check model name and authentication |
๐ด code_eval metric error |
Set HF_ALLOW_CODE_EVAL=1 environment variable |
| ๐ด Prompt template not found | Check prompt file exists in correct category folder |
| ๐ด Dataset loading failure | Verify dataset name and internet connection |
# For large models (>7B parameters)
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 scripts/run_benchmarks.py \
--model "meta-llama/Llama-2-70b-hf" \
--quantization "4bit" \
--batch_size 1 \
--max_new_tokens 256
# For faster evaluation
python3 scripts/run_benchmarks.py \
--model "google/gemma-2b" \
--batch_size 16 \
--max_examples 100 # Limit dataset size for testing# Enable detailed logging
python3 scripts/run_benchmarks.py \
--model "google/gemma-2b" \
--log_level DEBUG \
--save_detailed trueWe welcome contributions from the community! Here's how you can help:
- Use our issue template
- Include error logs, model names, and reproduction steps
- Test with minimal examples when possible
- Propose new benchmarks or evaluation metrics
- Suggest performance improvements
- Request additional language support
# Fork the repository
git clone https://github.com/your-username/eka-eval.git
cd eka-eval
# Create feature branch
git checkout -b feature/amazing-feature
# Make changes and test
python3 -m pytest tests/
# Submit pull request- Improve README examples
- Add benchmark documentation
- Create tutorial notebooks
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
If you use eka-eval in your research, please cite:
@software{eka_eval_2024,
title={eka-eval: The Unified LLM Evaluation Framework for India and the World},
author={EKA Team},
year={2024},
url={https://github.com/your-org/eka-eval},
version={1.0}
}- MMLU - Hendrycks et al., ICLR 2021
- GSM8K - Cobbe et al., 2021
- HumanEval - Chen et al., 2021
- BBH - Suzgun et al., 2022
- AGIEval - Zhong et al., 2023
- AI4Bharat - IndicNLP toolkit and datasets
- MILU - Multilingual Indic understanding
- IndicGLUE - Indic language evaluation
- Hugging Face Evaluate - Evaluation library
- LM Evaluation Harness - Alternative framework
- OpenCompass - Comprehensive LLM evaluation
- All code and evaluation protocols are freely available
- Transparent methodology with detailed documentation
- Community-driven development and improvement
- Fair and unbiased evaluation practices
- Privacy-preserving evaluation methods
- Responsible AI development guidelines
- Comprehensive coverage of Indian languages
- Cultural and linguistic sensitivity in evaluation
- Supporting the growth of Indic AI capabilities
- Reproducible evaluation protocols
- Standardized metrics and reporting
- Peer-reviewed benchmark implementations
Open โข Ethical โข Comprehensive โข India-First
๐ Website โข ๐ฌ Discord โข ๐ Issues โข ๐ก Discussions