NeuralVariableNameRepair

Writing clean code is not just about making programs work; it is also about giving variables names that clearly communicate their purpose. This project explores training Llama 3.1 8B to understand and recover missing variable names based on the context of functions in C++.

Setup

Create .env file with your Hugging Face token:

echo "HF_TOKEN=hf_your_token_here" > .env

Create and activate conda environment:

conda env create -f environment.yml
conda activate stack-cpp-extract

Run the extraction script:

python extract_functions.py

Output is saved to data/data_cpp.jsonl (sample in example_output.jsonl).

Testing

# Test with 2 samples
python run_inference.py --data example_output.jsonl --max-samples 2

# Check outputs
cat llm_responses.jsonl
cat evaluation_results.json

Note: First run downloads Llama 3.1 8B (~16GB). Requires GPU with sufficient VRAM.

Usage

Run Inference with vLLM

# Run on sample data (5 examples)
python run_inference.py --data example_output.jsonl

# Run on full dataset
python run_inference.py --data data/data_cpp.jsonl

# Customize parameters
python run_inference.py \
    --data data/data_cpp.jsonl \
    --model meta-llama/Meta-Llama-3.1-8B-Instruct \
    --max-samples 100 \
    --temperature 0.0 \
    --tensor-parallel-size 1

Programmatic Usage

from data_pipeline import DataPipeline
from evaluate_predictions import PredictionEvaluator

# Load data
pipeline = DataPipeline('example_output.jsonl')
pipeline.load_data()
input_texts, target_texts = pipeline.get_separate_arrays()

# Get LLM responses (see run_inference.py for full implementation)
# llm_responses = [llm(text) for text in input_texts]

# Evaluate
evaluator = PredictionEvaluator()
cleaned = evaluator.prepare_llm_responses(llm_responses)
metrics = evaluator.evaluate(cleaned, target_texts)
evaluator.print_report(metrics)

Data Format

Each JSONL entry contains:

input_text: Code with masked variables (<ID_1>, <ID_2>, etc.)
target_text: JSON mapping of placeholders to actual variable names

LLM should output JSON: {"<ID_1>": "varName", "<ID_2>": "anotherVar"}

Evaluation

Exact string matching: 1 point for perfect match, 0 otherwise. Final accuracy = correct_count / total_trials.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codebert_embeddings.py		codebert_embeddings.py
data_pipeline.py		data_pipeline.py
environment.yml		environment.yml
evaluate_only.py		evaluate_only.py
evaluate_predictions.py		evaluate_predictions.py
example_output.jsonl		example_output.jsonl
extract_functions.py		extract_functions.py
prompt.txt		prompt.txt
requirements_codebert.txt		requirements_codebert.txt
run_inference.py		run_inference.py
run_inference_mac.py		run_inference_mac.py
tinker_finetune.py		tinker_finetune.py
validate_model.py		validate_model.py
validation_results.json		validation_results.json
validation_results_checkpoints.json		validation_results_checkpoints.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeuralVariableNameRepair

Setup

Testing

Usage

Run Inference with vLLM

Programmatic Usage

Data Format

Evaluation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

MYousuf3/NeuralVariableNameRepair

Folders and files

Latest commit

History

Repository files navigation

NeuralVariableNameRepair

Setup

Testing

Usage

Run Inference with vLLM

Programmatic Usage

Data Format

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages