-
Notifications
You must be signed in to change notification settings - Fork 2.9k
Open
Description
Hello! In order to evaluate In-Context Learning in some LLMs, I tried the following code to check for the performances of google/gemma-2-9b on mmlu_elementary_mathematics.
`for n_shots in shots_list:
print(f"\nTesting with {n_shots} shot(s)...")
gc.collect()
torch.cuda.empty_cache()
print("Memory Cleaned!")
eval_output = lm_eval.simple_evaluate(
model="hf",
model_args=f"pretrained={model_id},load_in_4bit=True",
tasks=task_list,
num_fewshot=n_shots,
limit=10,
batch_size=8,
log_samples= True,
# CORRECT SETTINGS FOR BASE MODELS
apply_chat_template=False,
fewshot_as_multiturn=False
)
accuracy = eval_output['results']["mmlu_elementary_mathematics"]['acc,none']
results_data.append({
"shots": n_shots,
"accuracy": accuracy,
"samples": json.dumps(eval_output['samples'], indent=4)
})`
Results are as follows:
0-Shots: Accuracy = 0.7
1-Shot: Accuracy = 0.5
2-Shots: Accuracy = 0.6.
Is it possible that accuracy decreases passing from 0 to 1 and then 2 shots? Am I doing something wrong? Do you have any suggestions if I want to demonstrate the increase in accuracy with in-context learning?
Metadata
Metadata
Assignees
Labels
No labels