Performance decreasing when passing from 0 to few shots?

Hello! In order to evaluate In-Context Learning in some LLMs, I tried the following code to check for the performances of google/gemma-2-9b on mmlu_elementary_mathematics. 

`for n_shots in shots_list:
    print(f"\nTesting with {n_shots} shot(s)...")
    gc.collect()
    torch.cuda.empty_cache()
    print("Memory Cleaned!")
    
    eval_output = lm_eval.simple_evaluate(
        model="hf",
        model_args=f"pretrained={model_id},load_in_4bit=True",
        tasks=task_list,
        num_fewshot=n_shots,
        limit=10,
        batch_size=8,
        log_samples= True,
        
        # CORRECT SETTINGS FOR BASE MODELS
        apply_chat_template=False,
        fewshot_as_multiturn=False
    )

    accuracy = eval_output['results']["mmlu_elementary_mathematics"]['acc,none']
    results_data.append({
        "shots": n_shots,
        "accuracy": accuracy,
        "samples": json.dumps(eval_output['samples'], indent=4)
    })`



Results are as follows:

0-Shots: Accuracy = 0.7
1-Shot: Accuracy = 0.5
2-Shots: Accuracy = 0.6.

Is it possible that accuracy decreases passing from 0 to 1 and then 2 shots? Am I doing something wrong? Do you have any suggestions if I want to demonstrate the increase in accuracy with in-context learning?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance decreasing when passing from 0 to few shots? #3452

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Performance decreasing when passing from 0 to few shots? #3452

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions