Bird-Python bridges the gap between declarative SQL and procedural Python in data analytics by providing a consistent evaluation baseline. This project delivers a rigorously aligned dataset with verified Pandas solutions derived from the BIRD benchmark, integrated with the Logic Completion Framework (LCF) to resolve natural language ambiguity. The codebase includes a complete pipeline for code generation and semantic execution validation.
While Text-to-SQL remains the dominant approach for database interaction, real-world analytics increasingly require the flexibility of general-purpose programming languages such as Python or Pandas. Bird-Python is a benchmark designed for cross-paradigm evaluation between SQL and Python in data analysis tasks.
We systematically refined the original BIRD benchmark to reduce annotation noise and align execution semantics, establishing a consistent baseline. Our work investigates the paradigmatic divergence between SQL's declarative structure and Python's explicit procedural logic. To address the sensitivity of Python generation to underspecified user intent, we introduce the Logic Completion Framework (LCF), which resolves ambiguity by incorporating latent domain knowledge.
This repository provides:
- Aligned SQL-Python Dataset: A refined version of the BIRD development set with verified Python solutions (using Pandas).
- Generation & Verification Pipeline: Tools to generate Python/SQL code and semantically validate execution results.
- Logic Completion Framework: Implementation of LCF to bridge the reasoning gap caused by missing domain context.
The core contribution of this repository is the aligned SQL-Python dataset. The data files are located in the Bird-Python/ directory:
- Description: This is the original development set from the BIRD benchmark.
- Content: It serves as the baseline input, containing natural language questions, gold SQL queries, evidence (external knowledge), and difficulty levels.
- Fields:
question_id,db_id,question,evidence,SQL,difficulty.
- Description: This is our enhanced dataset containing the Python code annotations.
- Content: It includes all the information from the original BIRD dev set, augmented with verified Python code solutions that yield the same execution results as the Gold SQL.
- Methodology: The Python code has been generated and rigorously verified against the database to ensure execution accuracy and logic alignment with the original SQL.
.
├── Bird-Python/
│ ├── dev_databases/ # Original BIRD SQL databases (SQLite)
│ ├── excel_database/ # Converted CSV datasets for Python/Pandas analysis
│ ├── LCP_Enhanced_Bird/ # Enhanced dataset with explicit constraints (addressing info deficits)
│ ├── LLM-based Evaluation/ # LLM-based Semantic Validator Scripts
│ ├── Logic Completion Framework/ # LCF prompt templates and Qwen3-Max generated logic
│ ├── Text2Python/ # Test scripts and prompts for Text-to-Python
│ ├── Text2SQL/ # Test scripts and prompts for Text-to-SQL
│ ├── Origin_dev_Bird.json # Original BIRD development set
│ ├── Verified_Bird_Python.json # Verified dataset containing SQL and Python ground truths
│ ├── Convert_SQLite_to_CSV.py # Convert the SQLite database to a CSV file
│ └── README.md # Project documentation
└──
Before running the evaluation scripts, you must configure your API keys and verify file paths.
To ensure consistent experimental conditions, you first need to download the original BIRD development dataset and convert it into the CSV format used for Python analysis.
-
Download BIRD Dev Set: Download the development set (
dev.zip) from the BIRD Benchmark website and unzip it into theBird-Python/dev_databases/directory. -
Convert SQLite to CSV: Run the conversion script to generate the rigorous CSV dataset (preserving float precision and type hints):
python Bird-Python/Convert_SQLite_to_CSV.py
This will populate the
Bird-Python/excel_database/directory with CSV files required for the Text-to-Python pipeline.
Navigate to the following files and replace `"YOUR_API_KEY"` with your actual `dashscope` or compatible API key:
* `Bird-Python/LLM-based Evaluation/evaluation.py`
* `Bird-Python/Logic Completion Framework/LCP_CODE_Test.py`
* `Bird-Python/Logic Completion Framework/LCP_SQL_Test.py`
* `Bird-Python/Text2Python/Python-test.py`
* `Bird-Python/Text2SQL/SQL-test.py`
- File Path Verification:
The scripts use relative paths assuming the default directory structure. If you move files, ensure you update the
eval_path,db_root_path, and other path variables in theif __name__ == '__main__':section of the respective scripts.
We provide a comprehensive pipeline for generating and verifying Python solutions, designed to facilitate reproducible research. Note that all experimental parameters, including file paths and model configurations, are hardcoded within the scripts.
- Script:
Bird-Python/Text2Python/Python-test.pyorBird-Python/Text2SQL/SQL-test.py - Function: Generates Python code or to SQL queries in the dataset.
- Details: The script uses a retrieval-augmented generation approach (if knowledge is enabled) to produce Python logic. The output directory and model parameters are defined in the
__main__block.
- Script:
Bird-Python/LLM-based Evaluation/evaluation.py - Function: Semantically validates the generated code by comparing its execution results against the ground truth.
- Details: This module executes the generated Python code and compares the resulting data structures with the verified ground truth from
Bird-Python/Verified_Bird_Python.json. An LLM-based validator is employed to determine equivalence, robustly handling format variations (e.g., list vs. tuple, float precision). Users should ensure thePREDICTED_CODE_PATHin this script matches the output location from the generation step.
Please cite our paper if you use this code or dataset in your work (citation will be updated upon publication):
@inproceedings{BirdPython202x,
title={SQL vs. Python: Decoupling Ambiguity from Reasoning in Natural Language Data Analysis},
author={Anonymous Authors},
booktitle={Under Review at ACL},
year={2026}
}[Insert License Name, e.g., MIT, CC-BY-4.0]