Writing clean code is not just about making programs work; it is also about giving variables names that clearly communicate their purpose. This project explores training Llama 3.1 8B to understand and recover missing variable names based on the context of functions in C++.
- Create
.envfile with your Hugging Face token:
echo "HF_TOKEN=hf_your_token_here" > .env- Create and activate conda environment:
conda env create -f environment.yml
conda activate stack-cpp-extract- Run the extraction script:
python extract_functions.pyOutput is saved to data/data_cpp.jsonl (sample in example_output.jsonl).
# Test with 2 samples
python run_inference.py --data example_output.jsonl --max-samples 2
# Check outputs
cat llm_responses.jsonl
cat evaluation_results.jsonNote: First run downloads Llama 3.1 8B (~16GB). Requires GPU with sufficient VRAM.
# Run on sample data (5 examples)
python run_inference.py --data example_output.jsonl
# Run on full dataset
python run_inference.py --data data/data_cpp.jsonl
# Customize parameters
python run_inference.py \
--data data/data_cpp.jsonl \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--max-samples 100 \
--temperature 0.0 \
--tensor-parallel-size 1from data_pipeline import DataPipeline
from evaluate_predictions import PredictionEvaluator
# Load data
pipeline = DataPipeline('example_output.jsonl')
pipeline.load_data()
input_texts, target_texts = pipeline.get_separate_arrays()
# Get LLM responses (see run_inference.py for full implementation)
# llm_responses = [llm(text) for text in input_texts]
# Evaluate
evaluator = PredictionEvaluator()
cleaned = evaluator.prepare_llm_responses(llm_responses)
metrics = evaluator.evaluate(cleaned, target_texts)
evaluator.print_report(metrics)Each JSONL entry contains:
input_text: Code with masked variables (<ID_1>,<ID_2>, etc.)target_text: JSON mapping of placeholders to actual variable names
LLM should output JSON: {"<ID_1>": "varName", "<ID_2>": "anotherVar"}
Exact string matching: 1 point for perfect match, 0 otherwise. Final accuracy = correct_count / total_trials.