Skip to content

Results of eval_n_turn not match the paper #20

@hansjohn

Description

@hansjohn

I run the eval_n_turn.py to reproduce the single turn handicap sql results

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo \
    --handicap \
    --verbose 

i use this script to compute the success rate:

import json
from re import T
result_file_path = './logs/experiments/ic_sql_multiturn_gpt-3.5-turbo_1_turns.json'
with open(result_file_path, 'r') as f:
    result = { key: {'success':0, 'total':0} for key in ['easy', 'medium', 'hard', 'extra','all'] }
    data = json.load(f)
    
    for index in data.keys():
        if data[index]['summary']['max_reward'] == 1.0:
            result[data[index]['hardness']]['success']+=1
            result['all']['success']+=1
        result[data[index]['hardness']]['total']+=1
        result['all']['total']+=1

    for key in result.keys():
        success = result[key]['success']
        total = result[key]['total']
        print(f"{key} Success rate: {success}/{total} ({success/total:.2%})")

get this result:

easy Success rate: 202/248 (81.45%)
medium Success rate: 281/446 (63.00%)
hard Success rate: 75/174 (43.10%)
extra Success rate: 37/166 (22.29%)
all Success rate: 595/1034 (57.54%)

It is lower than the result in paper.
Did I do something wrong?

I also run the eval_n_turn.py to reproduce the single turn sql results.

python -m experiments.eval_n_turn \
    --data_path ./data/sql/spider/ic_spider_dev.json \
    --dialogue_limit 5 \
    --env sql \
    --image_name docker-env-sql \
    --log_dir logs/experiments \
    --max_turns 1 \
    --policy chat \
    --template game_sql \
    --model gpt-3.5-turbo

Result is here:

easy Success rate: 41/248 (16.53%)
medium Success rate: 28/446 (6.28%)
hard Success rate: 3/174 (1.72%)
extra Success rate: 2/166 (1.20%)
all Success rate: 74/1034 (7.16%)

Did I do something wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions