Skip to content

coding-kitties/AI-CodeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

17 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

LLM Evaluation for C++ Code Generation

This repository provides a comprehensive framework to evaluate large language models (LLMs) on C++ programming tasks using multiple evaluation strategies.

Our research focuses on measuring how effective different LLM models and retrieval/augmentation techniques are at:

  • Generating complete C++ solutions from problem descriptions
  • Solving advanced multi-class architecture challenges
  • Handling varying levels of problem complexity and instruction detail

Table of Contents

Research Overview

Benchmarks

We evaluated all test benches against two main benchmarks:

  1. Exercism Benchmark: 80+ standard C++ coding exercises from Exercism.org covering algorithms, data structures, and OOP principles.
  2. Advanced Benchmark: 8 complex multi-class C++ architecture problems requiring deeper understanding of system design and interactions.

The Exercism benchmark tests basic coding skills, while the Advanced benchmark challenges LLMs on design patterns, class interactions, and system-level thinking. For the excercism benchmark we only tested the plain LLM models. For the advanced benchmark we tested RAG, Function Calling, and Agent Mode strategies.

Test Benches

Models Evaluated

  • GPT-4o (single-shot and multi-shot with 3 attempts)
  • GPT-4.1 (single-shot and multi-shot with 3 attempts)

Benchmark Configurations

We evaluated the following LLM configurations:

Excercism Benchmark (80+ exercises):

Model Multishot Attempts
GPT 4.1 1
GPT 4.1 3
GPT 4o 1
GPT 4o 3

Advanced Benchmark (8 exercises):

Model Strategy Multishot Attempts Instruction Detail
GPT 4.1 RAG 3 Simple
GPT 4o RAG 3 Simple
GPT 4.1 RAG 3 Detailed
GPT 4o RAG 3 Detailed
GPT 4.1 Function Calling 3 Simple
GPT 4o Function Calling 3 Simple
GPT 4.1 Function Calling 3 Detailed
GPT 4o Function Calling 3 Detailed
GPT 4o Agent Mode 3 Simple
GPT 4.1 Agent Mode 3 Simple
GPT 4o Agent Mode 3 Detailed
GPT 4.1 Agent Mode 3 Detailed

πŸ“‚ Repository Structure

llm-cpp-eval/
β”‚
β”œβ”€β”€ benchmark/                           # All evaluation benchmarks
β”‚   β”œβ”€β”€ excercism/                       # 80+ Exercism C++ exercises
β”‚   β”‚   β”œβ”€β”€ hello-world/
β”‚   β”‚   β”œβ”€β”€ binary-search/
β”‚   β”‚   └── [70+ more exercises]
β”‚   β”‚
β”‚   β”œβ”€β”€ advanced/                        # 8 complex multi-class challenges
β”‚   β”‚   β”œβ”€β”€ multi_class_network_stack/   # Network architecture with connection management
β”‚   β”‚   β”œβ”€β”€ multi_class_compiler_system/ # Lexer, parser, code generator
β”‚   β”‚   β”œβ”€β”€ multi_class_database_system/ # Query engine with optimization
β”‚   β”‚   β”œβ”€β”€ multi_class_design/          # OOP design patterns
β”‚   β”‚   β”œβ”€β”€ multi_class_event_system/    # Event handling architecture
β”‚   β”‚   β”œβ”€β”€ multi_class_image_processor/ # Image processing pipeline
β”‚   β”‚   β”œβ”€β”€ multi_class_media_processing_system/
β”‚   β”‚   └── multi_class_design_two/
β”‚   β”‚
β”‚   └── azure_search_docs_exercise_pairing.csv  # Ground truth for RAG retrieval
β”‚
β”œβ”€β”€ src/                                 # Evaluation scripts
β”‚   β”œβ”€β”€ evaluate_excercism.py            # Exercism benchmark evaluation
β”‚   β”œβ”€β”€ evaluate_advanced_rag.py         # RAG evaluation on advanced exercises
β”‚   β”œβ”€β”€ evaluate_advanced_function_calling.py  # Function calling evaluation
β”‚   β”œβ”€β”€ evaluate_advanced_agent_mode.py  # Agent mode evaluation (in progress)
β”‚   β”œβ”€β”€ generate_exercise_doc_ground_truth.py  # Create RAG retrieval ground truth
β”‚   └─ utils/
β”‚       β”œβ”€β”€ function_calling.py          # Function calling utilities
β”‚       └── [other helper modules]
β”‚
β”œβ”€β”€ report/                              # Report generation
β”‚   β”œβ”€β”€ create_report.py                 # HTML/PDF report generation
β”‚   β”œβ”€β”€ charts_and_tables.py             # Visualization functions
β”‚   β”œβ”€β”€ exercism_analysis.py             # Analysis for exercism results
β”‚   β”œβ”€β”€ report.html                      # Generated HTML report
β”‚   β”œβ”€β”€ report.pdf                       # Generated PDF report
β”‚   └── charts/                          # Generated chart images
β”‚
β”œβ”€β”€ results/                             # Evaluation results
β”‚   β”œβ”€β”€ excercism/                       # Exercism evaluation results
β”‚   β”œβ”€β”€ advanced/                        # Advanced exercise results
β”‚   β”œβ”€β”€ *_evaluation_*.csv               # Summary CSV files
β”‚   └── *_details_*.json                 # Detailed evaluation data
β”‚
β”œβ”€β”€ analysis/                            # Jupyter notebooks
β”‚   β”œβ”€β”€ excercism_analysis.ipynb         # Exercism data analysis
β”‚   └── finetune.ipynb                   # Fine-tuning analysis
β”‚
β”œβ”€β”€ data/                                # Ground truth and reference data
β”‚   └── truth/                           # Truth datasets for validation
β”‚
β”œβ”€β”€ requirements.txt                     # Python dependencies
β”œβ”€β”€ create_metadata.py                   # Metadata generation for exercises
β”œβ”€β”€ clear_metadata.py                    # Metadata cleanup utilities
└── README.md                            # This file

πŸš€ Getting Started

1. Install Dependencies

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

2. Configure Azure AI Foundry and Azure Search

Make sure you have an Azure AI Foundry and Azure Search resources on the Azure platform with the LLM configured that you want to evaluate (e.g., GPT-4o, GPT-4.1).

Set required environment variables in a .env file:

# Azure OpenAI API Configuration
AZURE_API_KEY="your_azure_api_key"
AZURE_ENDPOINT="https://your-resource.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-05-01-preview"
AZURE_MODEL_NAME="gpt-4o"  # or gpt-4.1

# Azure Search Configuration (for RAG)
AZURE_SEARCH_ENDPOINT="https://your-search-service.search.windows.net"
AZURE_SEARCH_API_KEY="your_search_api_key"

# Azure OpenAI Embeddings (for RAG)
AZURE_OPENAI_EMBEDDING_ENDPOINT="https://your-resource.openai.azure.com/"
AZURE_OPENAI_EMBEDDING_API_KEY="your_embedding_api_key"
AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-3-small"

# LLM Configuration
LLM_TEMP=0.2                    # Temperature for generation
LLM_MAX_TOKENS=1024            # Maximum tokens per response

# Define here your LLM models to evaluate, this will be automatically picked up by the evaluation scripts
# e.g For GPT 4.1 you can add:
GPT_4_1_AZURE_API_KEY=your_azure_api_key
GPT_4_1_AZURE_ENDPOINT=https://<azure_ai_foundry_name>-foundry.cognitiveservices.azure.com/openai/v1/
GPT_4_1_AZURE_MODEL_NAME=gpt-4.1

# For GPT 4o you can add:
GPT_4O_AZURE_API_KEY=your_azure_api_key
GPT_4O_AZURE_ENDPOINT=https://<azure_ai_foundry_name>-foundry
GPT_4O_AZURE_MODEL_NAME=gpt-4o

3. Run Evaluations

Exercism Benchmark

Evaluate the model on 80+ standard C++ coding exercises from Exercism:

python src/evaluate_excercism.py

Advanced Multi-Class Exercises (RAG)

Evaluate on 8 complex architecture problems using Retrieval-Augmented Generation:

python src/evaluate_advanced_rag.py

Advanced Exercises (Function Calling)

Evaluate using LLM function calling to retrieve relevant documentation:

python src/evaluate_advanced_function_calling.py

Advanced Exercises (Agent Mode)

Evaluate using multi-turn agent reasoning:

python src/evaluate_advanced_agent_mode.py

4. View Results

Results are automatically saved to the results/ directory with timestamps. In these directories, you will find CSV summary files and detailed JSON metadata file for each evaluation run.

5. Generate Reports

Create comprehensive HTML and PDF reports with charts and analysis:

python report/create_report.py

This will generate:

  • report/report.html - Interactive HTML report
  • report/report.pdf - PDF version with charts
  • report/charts/ - Individual chart files

πŸ§ͺ Benchmark Details

Exercism Benchmark (80+ exercises)

Standard C++ programming challenges from Exercism.org, covering:

  • Basic algorithms (sorting, searching, prime numbers)
  • Data structures (linked lists, trees, graphs)
  • String manipulation and parsing
  • Mathematical operations
  • OOP principles

Each exercise includes:

  • Problem description
  • Test file with multiple assertions
  • Automated test execution and validation

Advanced Benchmark (8 exercises)

Complex multi-class C++ architecture problems requiring:

  • Understanding of design patterns
  • Knowledge of system interactions
  • Implementation of multiple coordinated classes
  • Proper separation of concerns

Exercises include:

  1. multi_class_network_stack - Connection management, protocol handling
  2. multi_class_compiler_system - Lexical analysis, parsing, code generation
  3. multi_class_database_system - Query optimization, index management
  4. multi_class_design - SOLID principles, design patterns
  5. multi_class_event_system - Event handling, observer pattern
  6. multi_class_image_processor - Image processing pipeline
  7. multi_class_media_processing_system - Media codec handling
  8. multi_class_design_two - Additional OOP scenarios

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages