This repository provides a comprehensive framework to evaluate large language models (LLMs) on C++ programming tasks using multiple evaluation strategies.
Our research focuses on measuring how effective different LLM models and retrieval/augmentation techniques are at:
- Generating complete C++ solutions from problem descriptions
- Solving advanced multi-class architecture challenges
- Handling varying levels of problem complexity and instruction detail
We evaluated all test benches against two main benchmarks:
- Exercism Benchmark: 80+ standard C++ coding exercises from Exercism.org covering algorithms, data structures, and OOP principles.
- Advanced Benchmark: 8 complex multi-class C++ architecture problems requiring deeper understanding of system design and interactions.
The Exercism benchmark tests basic coding skills, while the Advanced benchmark challenges LLMs on design patterns, class interactions, and system-level thinking. For the excercism benchmark we only tested the plain LLM models. For the advanced benchmark we tested RAG, Function Calling, and Agent Mode strategies.
- GPT-4o (single-shot and multi-shot with 3 attempts)
- GPT-4.1 (single-shot and multi-shot with 3 attempts)
We evaluated the following LLM configurations:
Excercism Benchmark (80+ exercises):
| Model | Multishot Attempts |
|---|---|
| GPT 4.1 | 1 |
| GPT 4.1 | 3 |
| GPT 4o | 1 |
| GPT 4o | 3 |
Advanced Benchmark (8 exercises):
| Model | Strategy | Multishot Attempts | Instruction Detail |
|---|---|---|---|
| GPT 4.1 | RAG | 3 | Simple |
| GPT 4o | RAG | 3 | Simple |
| GPT 4.1 | RAG | 3 | Detailed |
| GPT 4o | RAG | 3 | Detailed |
| GPT 4.1 | Function Calling | 3 | Simple |
| GPT 4o | Function Calling | 3 | Simple |
| GPT 4.1 | Function Calling | 3 | Detailed |
| GPT 4o | Function Calling | 3 | Detailed |
| GPT 4o | Agent Mode | 3 | Simple |
| GPT 4.1 | Agent Mode | 3 | Simple |
| GPT 4o | Agent Mode | 3 | Detailed |
| GPT 4.1 | Agent Mode | 3 | Detailed |
llm-cpp-eval/
β
βββ benchmark/ # All evaluation benchmarks
β βββ excercism/ # 80+ Exercism C++ exercises
β β βββ hello-world/
β β βββ binary-search/
β β βββ [70+ more exercises]
β β
β βββ advanced/ # 8 complex multi-class challenges
β β βββ multi_class_network_stack/ # Network architecture with connection management
β β βββ multi_class_compiler_system/ # Lexer, parser, code generator
β β βββ multi_class_database_system/ # Query engine with optimization
β β βββ multi_class_design/ # OOP design patterns
β β βββ multi_class_event_system/ # Event handling architecture
β β βββ multi_class_image_processor/ # Image processing pipeline
β β βββ multi_class_media_processing_system/
β β βββ multi_class_design_two/
β β
β βββ azure_search_docs_exercise_pairing.csv # Ground truth for RAG retrieval
β
βββ src/ # Evaluation scripts
β βββ evaluate_excercism.py # Exercism benchmark evaluation
β βββ evaluate_advanced_rag.py # RAG evaluation on advanced exercises
β βββ evaluate_advanced_function_calling.py # Function calling evaluation
β βββ evaluate_advanced_agent_mode.py # Agent mode evaluation (in progress)
β βββ generate_exercise_doc_ground_truth.py # Create RAG retrieval ground truth
β ββ utils/
β βββ function_calling.py # Function calling utilities
β βββ [other helper modules]
β
βββ report/ # Report generation
β βββ create_report.py # HTML/PDF report generation
β βββ charts_and_tables.py # Visualization functions
β βββ exercism_analysis.py # Analysis for exercism results
β βββ report.html # Generated HTML report
β βββ report.pdf # Generated PDF report
β βββ charts/ # Generated chart images
β
βββ results/ # Evaluation results
β βββ excercism/ # Exercism evaluation results
β βββ advanced/ # Advanced exercise results
β βββ *_evaluation_*.csv # Summary CSV files
β βββ *_details_*.json # Detailed evaluation data
β
βββ analysis/ # Jupyter notebooks
β βββ excercism_analysis.ipynb # Exercism data analysis
β βββ finetune.ipynb # Fine-tuning analysis
β
βββ data/ # Ground truth and reference data
β βββ truth/ # Truth datasets for validation
β
βββ requirements.txt # Python dependencies
βββ create_metadata.py # Metadata generation for exercises
βββ clear_metadata.py # Metadata cleanup utilities
βββ README.md # This file
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtMake sure you have an Azure AI Foundry and Azure Search resources on the Azure platform with the LLM configured that you want to evaluate (e.g., GPT-4o, GPT-4.1).
Set required environment variables in a .env file:
# Azure OpenAI API Configuration
AZURE_API_KEY="your_azure_api_key"
AZURE_ENDPOINT="https://your-resource.openai.azure.com/openai/deployments/your-deployment/chat/completions?api-version=2024-05-01-preview"
AZURE_MODEL_NAME="gpt-4o" # or gpt-4.1
# Azure Search Configuration (for RAG)
AZURE_SEARCH_ENDPOINT="https://your-search-service.search.windows.net"
AZURE_SEARCH_API_KEY="your_search_api_key"
# Azure OpenAI Embeddings (for RAG)
AZURE_OPENAI_EMBEDDING_ENDPOINT="https://your-resource.openai.azure.com/"
AZURE_OPENAI_EMBEDDING_API_KEY="your_embedding_api_key"
AZURE_OPENAI_EMBEDDING_MODEL="text-embedding-3-small"
# LLM Configuration
LLM_TEMP=0.2 # Temperature for generation
LLM_MAX_TOKENS=1024 # Maximum tokens per response
# Define here your LLM models to evaluate, this will be automatically picked up by the evaluation scripts
# e.g For GPT 4.1 you can add:
GPT_4_1_AZURE_API_KEY=your_azure_api_key
GPT_4_1_AZURE_ENDPOINT=https://<azure_ai_foundry_name>-foundry.cognitiveservices.azure.com/openai/v1/
GPT_4_1_AZURE_MODEL_NAME=gpt-4.1
# For GPT 4o you can add:
GPT_4O_AZURE_API_KEY=your_azure_api_key
GPT_4O_AZURE_ENDPOINT=https://<azure_ai_foundry_name>-foundry
GPT_4O_AZURE_MODEL_NAME=gpt-4oEvaluate the model on 80+ standard C++ coding exercises from Exercism:
python src/evaluate_excercism.pyEvaluate on 8 complex architecture problems using Retrieval-Augmented Generation:
python src/evaluate_advanced_rag.pyEvaluate using LLM function calling to retrieve relevant documentation:
python src/evaluate_advanced_function_calling.pyEvaluate using multi-turn agent reasoning:
python src/evaluate_advanced_agent_mode.pyResults are automatically saved to the results/ directory with timestamps.
In these directories, you will find CSV summary files and detailed JSON metadata file for each evaluation run.
Create comprehensive HTML and PDF reports with charts and analysis:
python report/create_report.pyThis will generate:
report/report.html- Interactive HTML reportreport/report.pdf- PDF version with chartsreport/charts/- Individual chart files
Standard C++ programming challenges from Exercism.org, covering:
- Basic algorithms (sorting, searching, prime numbers)
- Data structures (linked lists, trees, graphs)
- String manipulation and parsing
- Mathematical operations
- OOP principles
Each exercise includes:
- Problem description
- Test file with multiple assertions
- Automated test execution and validation
Complex multi-class C++ architecture problems requiring:
- Understanding of design patterns
- Knowledge of system interactions
- Implementation of multiple coordinated classes
- Proper separation of concerns
Exercises include:
- multi_class_network_stack - Connection management, protocol handling
- multi_class_compiler_system - Lexical analysis, parsing, code generation
- multi_class_database_system - Query optimization, index management
- multi_class_design - SOLID principles, design patterns
- multi_class_event_system - Event handling, observer pattern
- multi_class_image_processor - Image processing pipeline
- multi_class_media_processing_system - Media codec handling
- multi_class_design_two - Additional OOP scenarios