Hi SkillOpt team,
Thank you for the quick response in #14 and for sharing the SearchQA split IDs. That is very helpful for reproducing the SearchQA experiments.
Could you also share the exact split details for the remaining benchmarks, especially SpreadsheetBench?
We tried a protocol-level reproduction on SpreadsheetBench using the official SpreadsheetBench data artifact:
Since the exact SkillOpt SpreadsheetBench split is not currently available, we created a deterministic random split from the Verified 400 set:
train=80
selection/validation=40
test=280
- random seed:
42
Then we ran two no-skill test baselines with gpt-5.4-nano as the target model:
| Optimizer |
Target |
Local no-skill test baseline |
gpt-5.4-nano |
gpt-5.4-nano |
36.1 |
gpt-5.5 |
gpt-5.4-nano |
36.8 |
These are both much higher than the SpreadsheetBench gpt-5.4-nano baseline reported in the paper Table 5 (23.5). On our random split, the cell-level baseline is close to that number (23.8), while the sheet-level baseline is much higher (63.2 / 65.5), so the overall score seems quite sensitive to the exact split, task composition, and/or aggregation details.
Would it be possible to share:
- The exact train/selection/test split manifests or stable task IDs for SpreadsheetBench.
- The same split manifests for the other benchmarks (
OfficeQA, DocVQA, LiveMathematicianBench, and ALFWorld), if available.
- Any preprocessing/filtering scripts used to construct those splits.
- For SpreadsheetBench specifically, whether the reported score is a plain item-level average over the test split, a macro average over task types, or uses any filtering beyond the public Verified 400 artifact.
Thanks again for making the SearchQA split available.
Hi SkillOpt team,
Thank you for the quick response in #14 and for sharing the SearchQA split IDs. That is very helpful for reproducing the SearchQA experiments.
Could you also share the exact split details for the remaining benchmarks, especially SpreadsheetBench?
We tried a protocol-level reproduction on SpreadsheetBench using the official SpreadsheetBench data artifact:
data/spreadsheetbench_verified_400.tar.gzSince the exact SkillOpt SpreadsheetBench split is not currently available, we created a deterministic random split from the Verified 400 set:
train=80selection/validation=40test=28042Then we ran two no-skill test baselines with
gpt-5.4-nanoas the target model:gpt-5.4-nanogpt-5.4-nano36.1gpt-5.5gpt-5.4-nano36.8These are both much higher than the SpreadsheetBench
gpt-5.4-nanobaseline reported in the paper Table 5 (23.5). On our random split, the cell-level baseline is close to that number (23.8), while the sheet-level baseline is much higher (63.2/65.5), so the overall score seems quite sensitive to the exact split, task composition, and/or aggregation details.Would it be possible to share:
OfficeQA,DocVQA,LiveMathematicianBench, andALFWorld), if available.Thanks again for making the SearchQA split available.