Skip to content

eka-eval is an open-source, modular benchmarking pipeline for evaluating large language models (LLMs) on English and Indic benchmarks. It supports standardized evaluation across reasoning, math, code, world knowledge, tool use, reading comprehension, and multilingual tasks, part of Project EKA.

License

Notifications You must be signed in to change notification settings

sam22ridhi/eka-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

55 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

eka-eval logo

eka-eval.

The Unified LLM Evaluation Framework for India and the World.

Release v1.0 License Apache 2.0 Open In Colab Discord Community

Key Features โ€ข Supported Benchmarks โ€ข Getting Started โ€ข Reporting โ€ข Project Ethos


Overview

eka-eval is the official evaluation pipeline for the EKA project (eka.soket.ai), designed to provide comprehensive, fair, and transparent benchmarking for large language models (LLMs). Our framework supports both global and India-centric evaluations, with special emphasis on multilingual capabilities across Indian languages.

๐ŸŽฏ Why eka-eval?

  • ๐ŸŒ Global + India-First: Combines international benchmarks with India-specific evaluations
  • ๐Ÿ”ฌ Rigorous & Reproducible: Standardized evaluation protocols with detailed logging
  • ๐Ÿš€ Production-Ready: Optimized for efficiency with quantization and multi-GPU support
  • ๐Ÿ”ง Extensible: Easy integration of custom benchmarks and evaluation logic
  • ๐Ÿ“Š Transparent: Comprehensive reporting with detailed error analysis

Key Features

๐ŸŽฏ Comprehensive Benchmark Coverage

  • 17+ English Benchmarks: MMLU, GSM8K, HumanEval, ARC-Challenge, and more
  • 12+ Indic Benchmarks: MMLU-IN, BoolQ-IN, ARC-Challenge-IN, MILU, and others
  • Specialized Tasks: Code generation, mathematical reasoning, long-context understanding
  • Multi-modal Support: Text, code, and multilingual evaluation capabilities

๐ŸŒ Multilingual Excellence

  • 11 Indian Languages: Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, Telugu
  • Smart Language Handling: Automatic script recognition and Hindi-English letter mapping
  • Per-Language Metrics: Detailed breakdown of performance across languages

โšก Performance & Scalability

  • Multi-GPU Support: Distributed evaluation across multiple GPUs
  • Quantization Ready: 4-bit/8-bit quantization for efficient large model evaluation
  • Batched Inference: Optimized throughput with configurable batch sizes
  • Memory Management: Smart resource cleanup and CUDA cache management

๐Ÿ”ง Developer Experience

  • Modular Architecture: Clean separation of concerns with extensible design
  • Prompt System: Template-based prompts with language-specific customization
  • Rich Configuration: JSON-based benchmark configs with validation
  • Detailed Logging: Comprehensive debug information and progress tracking

๐Ÿ“Š Advanced Reporting

  • Multiple Output Formats: CSV summaries, JSONL details, console tables
  • Error Analysis: Per-instance results for debugging and improvement
  • Reproducibility: Timestamped results with full configuration tracking
  • Flexible Metrics: Accuracy, F1, BLEU, pass@k, and custom metrics

Installation

1. Clone the Repository

git clone https://github.com/your-org/eka-eval.git
cd eka-eval

2. Environment Setup

# Create virtual environment
python3 -m venv eka-env
source eka-env/bin/activate  # On Windows: eka-env\Scripts\activate

# Install dependencies
pip install -r requirements.txt

3. Required Dependencies

torch>=2.0.0
transformers>=4.35.0
datasets>=2.14.0
evaluate>=0.4.0
accelerate>=0.24.0
bitsandbytes>=0.41.0  # For quantization
pandas>=1.5.0
tqdm>=4.64.0
numpy>=1.24.0

4. Authentication (Optional)

For private models or gated datasets:

huggingface-cli login
# OR
export HF_TOKEN="your_hf_token_here"

๐Ÿš€ Quick Start

Basic Evaluation

# Run interactive evaluation
python3 scripts/run_benchmarks.py

Command Line Examples

# Evaluate specific model on math benchmarks
python3 scripts/run_benchmarks.py \
    --model "google/gemma-2b" \
    --task_groups "MATH AND REASONING" \
    --benchmarks "GSM8K,MATH"

# Multi-language evaluation
python3 scripts/run_benchmarks.py \
    --model "sarvamai/sarvam-1" \
    --task_groups "INDIC BENCHMARKS" \
    --languages "hi,bn,gu"

# Code generation evaluation
python3 scripts/run_benchmarks.py \
    --model "microsoft/CodeT5-large" \
    --task_groups "CODE GENERATION" \
    --pass_k "1,5,10"

Standalone Benchmark Testing

# Test individual benchmarks
python eka_eval/benchmarks/tasks/math/gsm8k.py --model_name_test gpt2
python eka_eval/benchmarks/tasks/indic/boolq_in.py --target_languages_test hi en
python eka_eval/benchmarks/tasks/long_context/infinitebench.py --dataset_split_test longdialogue_qa_eng

๐Ÿ—๏ธ Project Structure

eka-eval/
โ”œโ”€โ”€ ๐Ÿ“ฆ eka_eval/                    # Core library
โ”‚   โ”œโ”€โ”€ ๐Ÿงช benchmarks/              # Evaluation logic
โ”‚   โ”‚   โ”œโ”€โ”€ tasks/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ’ป code/            # HumanEval, MBPP, etc.
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿงฎ math/            # GSM8K, MATH, etc.
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐ŸŒ indic/           # Indic language benchmarks
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿง  reasoning/       # ARC, HellaSwag, etc.
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ ๐Ÿ“š long_context/    # InfiniteBench, etc.
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ ๐ŸŽฏ general/         # MMLU, AGIEval, etc.
โ”‚   โ”‚   โ””โ”€โ”€ benchmark_registry.py
โ”‚   โ”œโ”€โ”€ โš™๏ธ core/                    # Model loading & evaluation
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง utils/                   # Utilities & helpers
โ”‚   โ””โ”€โ”€ ๐Ÿ“‹ config/                  # Benchmark configurations
โ”œโ”€โ”€ ๐Ÿš€ scripts/                     # Execution scripts
โ”‚   โ”œโ”€โ”€ run_benchmarks.py          # Main orchestrator
โ”‚   โ””โ”€โ”€ evaluation_worker.py       # Worker process logic
โ”œโ”€โ”€ ๐Ÿ“Š results_output/              # Evaluation results
โ”œโ”€โ”€ ๐ŸŽฏ prompts/                     # Prompt templates
โ”‚   โ”œโ”€โ”€ math/                      # Math benchmark prompts
โ”‚   โ”œโ”€โ”€ indic/                     # Indic benchmark prompts
โ”‚   โ”œโ”€โ”€ general/                   # General benchmark prompts
โ”‚   โ””โ”€โ”€ long_context/              # Long context prompts
โ””โ”€โ”€ ๐Ÿ“ requirements.txt

Supported Benchmarks

๐ŸŒ Global Benchmarks

Category Benchmarks Languages Metrics
๐Ÿ“š Knowledge MMLU, MMLU-Pro, TriviaQA, NaturalQuestions English Accuracy
๐Ÿงฎ Mathematics GSM8K, MATH, GPQA, ARC-Challenge English Accuracy
๐Ÿ’ป Code Generation HumanEval, MBPP, HumanEval+, MBPP+ Python, Multi-PL pass@1, pass@k
๐Ÿง  Reasoning BBH, AGIEval, HellaSwag, WinoGrande English Accuracy
๐Ÿ“– Reading SQuAD, QuAC, BoolQ, XQuAD English + Others F1, EM, Accuracy
๐Ÿ“ Long Context InfiniteBench, ZeroSCROLLS, Needle-in-Haystack English Task-specific

๐Ÿ‡ฎ๐Ÿ‡ณ India-Centric Benchmarks

Benchmark Languages Description Metrics
MMLU-IN 11 Indic + EN Knowledge understanding across subjects Accuracy
BoolQ-IN 11 Indic + EN Yes/No question answering Accuracy
ARC-Challenge-IN 11 Indic + EN Science reasoning questions Accuracy
MILU 11 Indic + EN AI4Bharat's multilingual understanding Accuracy
GSM8K-IN Hindi, Others Math word problems in Indian languages Accuracy
IndicGenBench Multiple Generation tasks for Indic languages Task-specific
Flores-IN 22 Languages Translation quality assessment BLEU, ChrF
XQuAD-IN 11 Languages Cross-lingual reading comprehension F1, EM

Supported Languages

  • English: Primary evaluation language
  • Hindi (hi): เคฆเฅ‡เคตเคจเคพเค—เคฐเฅ€ script with smart character mapping
  • Bengali (bn): เฆฌเฆพเฆ‚เฆฒเฆพ script
  • Gujarati (gu): เช—เซเชœเชฐเชพเชคเซ€ script
  • Kannada (kn): เฒ•เฒจเณเฒจเฒก script
  • Malayalam (ml): เดฎเดฒเดฏเดพเดณเด‚ script
  • Marathi (mr): เคฎเคฐเคพเค เฅ€ script
  • Odia (or): เฌ“เฌกเฌผเฌฟเฌ† script
  • Punjabi (pa): เจชเฉฐเจœเจพเจฌเฉ€ script
  • Tamil (ta): เฎคเฎฎเฎฟเฎดเฏ script
  • Telugu (te): เฐคเฑ†เฐฒเฑเฐ—เฑ script

๐Ÿ”ง Interactive Evaluation Workflow

1. Model Selection

Enter model source ('1' for Hugging Face, '2' for Local Path): 1
Enter Hugging Face model name: google/gemma-2b

2. Task Group Selection

--- Available Benchmark Task Groups ---
1. CODE GENERATION          7. MMLU
2. MATH AND REASONING       8. MMLU-Pro  
3. READING COMPREHENSION    9. IFEval
4. COMMONSENSE REASONING   10. BBH
5. WORLD KNOWLEDGE         11. AGIEval
6. LONG CONTEXT           12. INDIC BENCHMARKS

Select task group #(s): 2 12

3. Benchmark Selection

--- Select benchmarks for MATH AND REASONING ---
1. GSM8K                    4. ARC-Challenge
2. MATH                     5. ALL
3. GPQA                     6. SKIP

Select benchmark #(s): 1 2

4. Execution & Results

[Worker 0 (GPU 0)] Loading model: google/gemma-2b (2.0B parameters)
[Worker 0 (GPU 0)] Running GSM8K evaluation...
[Worker 0 (GPU 0)] GSM8K Accuracy: 42.3% (527/1247)
[Worker 0 (GPU 0)] Running MATH evaluation...
[Worker 0 (GPU 0)] MATH Accuracy: 12.1% (601/5000)

Results saved to: results_output/calculated.csv

๐ŸŽฏ Advanced Usage

Custom Benchmark Integration

1. Create Evaluation Function

# my_benchmark.py
def evaluate_my_task(pipe, tokenizer, model_name_for_logging, device, **kwargs):
    # Your evaluation logic here
    results = {"MyTask": accuracy_score}
    return results

2. Add Prompt Configuration

// prompts/custom/my_task.json
{
  "my_task_0shot": {
    "template": "Question: {question}\nAnswer:",
    "description": "Zero-shot prompt for my task"
  },
  "default_few_shot_examples": [
    {"question": "Example question", "answer": "Example answer"}
  ]
}

3. Register in Config

# Add to benchmark_config.py
"MyTask": {
    "description": "My custom evaluation task",
    "evaluation_function": "my_project.my_benchmark.evaluate_my_task",
    "task_args": {
        "prompt_template_name_zeroshot": "my_task_0shot",
        "prompt_file_benchmark_key": "my_task",
        "prompt_file_category": "custom"
    }
}

Quantization & Optimization

# Automatic 4-bit quantization for large models
python3 scripts/run_benchmarks.py \
    --model "meta-llama/Llama-2-70b-hf" \
    --quantization "4bit" \
    --batch_size 1

Multi-GPU Evaluation

# Distributed evaluation across GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 scripts/run_benchmarks.py \
    --model "microsoft/DialoGPT-large" \
    --task_groups "ALL" \
    --num_gpus 4

๐Ÿ“Š Results and Reporting

๐Ÿ“ˆ Aggregated Results (CSV)

Located at results_output/calculated.csv:

Model Size (B) Task Benchmark Score Timestamp Status
gemma-2b 2.00 MATH AND REASONING GSM8K 42.3% 2024-01-15T10:30:45 Completed
gemma-2b 2.00 INDIC BENCHMARKS BoolQ-IN 67.8% 2024-01-15T11:15:20 Completed

๐Ÿ“‹ Detailed Analysis (JSONL)

Per-benchmark detailed results in results_output/detailed_results/:

{
  "question_id": 123,
  "question": "What is 2+2?",
  "correct_answer": "4",
  "predicted_answer": "4", 
  "is_correct": true,
  "generated_text": "The answer is 4.",
  "prompt_used": "Question: What is 2+2?\nAnswer:"
}

๐Ÿ–ฅ๏ธ Console Output

| Model      | Task                | Benchmark      | Score   |
|------------|--------------------|--------------------|---------|
| gemma-2b   | MATH AND REASONING | GSM8K             | 42.3%   |
| gemma-2b   | MATH AND REASONING | MATH              | 12.1%   |
| gemma-2b   | INDIC BENCHMARKS   | BoolQ-IN          | 67.8%   |
| gemma-2b   | INDIC BENCHMARKS   | MMLU-IN           | 39.2%   |

๐Ÿ“Š Language-Specific Metrics

{
  "BoolQ-IN": 67.8,
  "BoolQ-IN_hi": 65.2,
  "BoolQ-IN_bn": 70.1,
  "BoolQ-IN_en": 74.5,
  "BoolQ-IN_gu": 63.8
}

โš ๏ธ Troubleshooting

Common Issues & Solutions

Issue Solution
๐Ÿ”ด ModuleNotFoundError: eka_eval Run from project root directory
๐Ÿ”ด CUDA Out of Memory Reduce generation_batch_size or use quantization
๐Ÿ”ด Hugging Face 404 Error Check model name and authentication
๐Ÿ”ด code_eval metric error Set HF_ALLOW_CODE_EVAL=1 environment variable
๐Ÿ”ด Prompt template not found Check prompt file exists in correct category folder
๐Ÿ”ด Dataset loading failure Verify dataset name and internet connection

Performance Optimization

# For large models (>7B parameters)
export CUDA_VISIBLE_DEVICES=0,1,2,3
python3 scripts/run_benchmarks.py \
    --model "meta-llama/Llama-2-70b-hf" \
    --quantization "4bit" \
    --batch_size 1 \
    --max_new_tokens 256

# For faster evaluation
python3 scripts/run_benchmarks.py \
    --model "google/gemma-2b" \
    --batch_size 16 \
    --max_examples 100  # Limit dataset size for testing

Debug Mode

# Enable detailed logging
python3 scripts/run_benchmarks.py \
    --model "google/gemma-2b" \
    --log_level DEBUG \
    --save_detailed true

๐Ÿค Contributing

We welcome contributions from the community! Here's how you can help:

๐Ÿ› Bug Reports

  • Use our issue template
  • Include error logs, model names, and reproduction steps
  • Test with minimal examples when possible

โœจ Feature Requests

  • Propose new benchmarks or evaluation metrics
  • Suggest performance improvements
  • Request additional language support

๐Ÿ”ง Development

# Fork the repository
git clone https://github.com/your-username/eka-eval.git
cd eka-eval

# Create feature branch
git checkout -b feature/amazing-feature

# Make changes and test
python3 -m pytest tests/

# Submit pull request

๐Ÿ“š Documentation

  • Improve README examples
  • Add benchmark documentation
  • Create tutorial notebooks

๐Ÿ“„ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.


๐Ÿ“š Citation

If you use eka-eval in your research, please cite:

@software{eka_eval_2024,
  title={eka-eval: The Unified LLM Evaluation Framework for India and the World},
  author={EKA Team},
  year={2024},
  url={https://github.com/your-org/eka-eval},
  version={1.0}
}

๐Ÿ”— References & Resources

Official Benchmark Papers

  • MMLU - Hendrycks et al., ICLR 2021
  • GSM8K - Cobbe et al., 2021
  • HumanEval - Chen et al., 2021
  • BBH - Suzgun et al., 2022
  • AGIEval - Zhong et al., 2023

Indic Language Resources

  • AI4Bharat - IndicNLP toolkit and datasets
  • MILU - Multilingual Indic understanding
  • IndicGLUE - Indic language evaluation

Related Projects


๐ŸŒŸ Project Ethos

๐Ÿ”“ Open Source

  • All code and evaluation protocols are freely available
  • Transparent methodology with detailed documentation
  • Community-driven development and improvement

โš–๏ธ Ethical AI

  • Fair and unbiased evaluation practices
  • Privacy-preserving evaluation methods
  • Responsible AI development guidelines

๐Ÿ‡ฎ๐Ÿ‡ณ India-First Approach

  • Comprehensive coverage of Indian languages
  • Cultural and linguistic sensitivity in evaluation
  • Supporting the growth of Indic AI capabilities

๐Ÿ”ฌ Scientific Rigor

  • Reproducible evaluation protocols
  • Standardized metrics and reporting
  • Peer-reviewed benchmark implementations

๐Ÿš€ eka-eval: Powering the Future of AI Evaluation ๐Ÿš€

Open โ€ข Ethical โ€ข Comprehensive โ€ข India-First

๐ŸŒ Website โ€ข ๐Ÿ’ฌ Discord โ€ข ๐Ÿ› Issues โ€ข ๐Ÿ’ก Discussions

About

eka-eval is an open-source, modular benchmarking pipeline for evaluating large language models (LLMs) on English and Indic benchmarks. It supports standardized evaluation across reasoning, math, code, world knowledge, tool use, reading comprehension, and multilingual tasks, part of Project EKA.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages