The explosion of unstructured data has immense analytical value. By leveraging large language models (LLMs) to extract table-like attributes from unstructured data, researchers are building LLM-powered systems that analyze documents as if querying a database. These unstructured data analysis (UDA) systems differ widely in query interfaces, optimization, and operators, making it unclear which works best in which scenario. However, no benchmark currently offers high-quality, large-scale, diverse datasets and rich query workloads to rigorously evaluate them. We present UDA-Bench, a comprehensive UDA benchmark that addresses this need. We curate 6 datasets from different domains and manually construct a relational database view for each using 30 graduate students. These relational databases serve as ground truth to evaluate any UDA system, regardless of its interface. We further design diverse queries over the database schema that evaluate various analytical operators with different selectivities and complexities. Using this benchmark, we conduct an in-depth analysis of key UDA componentsβquery interface, optimization, operator design, and data processingβand run exhaustive experiments to evaluate systems and techniques along these dimensions. Our main contributions are: (1) a comprehensive benchmark for rigorous UDA evaluation, and (2) a deeper understanding of the strengths and limitations of current systems, paving the way for future work in UDA.
To help users quickly grasp each datasetβs schema, attributes, data distribution, and query workload, we provide an interactive visualization interface. It allows users to browse relational schemas, inspect attribute metadata, view example documents, and explore the query taxonomy, providing a single, easy-to-use interface for exploring and working with UDA-Bench. Please Click Here!
Figure 1: System architecture showing the query interface, logical optimization, physical optimization, and unstructured data processing pipeline.
| Dataset | # Attributes | # Files | Tokens (Max/Min/Avg) | Multi-modal |
|---|---|---|---|---|
| Art | 19 | 1,000 | 1,665 / 619 / 789 | β |
| CSPaper | 20 | 200 | 107,710 / 5,325 / 29,951 | β |
| Player | 28 | 225 | 51,378 / 73 / 8,047 | β |
| Legal | 19 | 566 | 45,437 / 340 / 5,609 | β |
| Finance | 30 | 100 | 838,418 / 7,162 / 130,633 | β |
| Healthcare | 51 | 100,000 | 63,234 / 2,759 / 10,649 | β |
Due to the large size of our datasets, we provide access through download links rather than storing them directly in the repository.
| Dataset | Size | Download Link | Ground Truth |
|---|---|---|---|
| Art | ~379MB | Download Art Dataset | Download Ground Truth |
| CSPaper | ~678.3MB | Download CSPaper Dataset | Download Ground Truth |
| Player | ~2.43MB | Download Player Dataset | Download Ground Truth |
| Legal | ~304MB | Download Legal Dataset | Download Ground Truth |
| Finance | ~413.6MB | Download Finance Dataset | Download Ground Truth |
| Healthcare | ~1.7GB | Download Healthcare Dataset | Download Ground Truth |
- Source: WikiArt.org
- Content: Artists and their artworks spanning from the 19th to 21st centuries
- Characteristics: Multimodal dataset containing biographical information, artistic movements, representative works lists, and images of representative works
- Source: Computer science publications (curated collection of CS papers)
- Content: Paper extracted attributes such as title, authors, baselines, and performance.
- Characteristics: Dataset is crawled from Arxiv containing 200 research papers annotated with key attributes, including authors, baselines and their performance, the modalities of experimental datasets etc. In particular, some papers describe the performance of all baselines in the main text, while other papers only describe the best-performing baselines and leave other results in tables or figures, resulting in an analysis scenario with mixed-modal.
- Source: Wikipedia
- Content: NBA players, teams, team owners, and other information from the 20th century to present, covering basic information and statistics such as player personal honors, team founding year, owner nationality, etc.
- Characteristics: Relatively simple structure, containing player personal honors, team founding year, owner nationality, and other information
- Source: AustLII
- Content: 570 professional legal cases from Australia between 2006-2009
- Characteristics: Domain-specific dataset containing different types such as criminal and administrative cases, requiring semantic reasoning to extract attributes
- Source: Enterprise RAG Challenge
- Content: Annual and quarterly financial reports published in 2022 by 100 listed companies worldwide
- Characteristics: Extremely long documents (average 130,633 tokens), containing mixed content types such as company name, net profit, total assets, etc.
- Source: MMedC
- Content: Large number of healthcare documents since 2020
- Characteristics: Largest scale dataset containing drugs, diseases, medical institutions, news, interviews, and other various healthcare information
unstractured_analysis_benchmark/
βββ README.md # Project documentation
βββ img/ # Project-related images
βββ Queries/ # Benchmark queries
βββ systems/ # Evaluation systems
β βββ evaporate/ # Evaporate system adaptation
β βββ palimpzest/ # Palimpzest system adaptation
β βββ lotus/ # LOTUS system wrapper
β βββ docetl/ # DocETL system usage examples
β βββ quest/ # QUEST system extension
β βββ zendb/ # ZenDB system implementation
β βββ uqe/ # UQE system implementation
βββ evaluation/ # Evaluation scripts
βββ evaluate.py
βββ evaluate_healthcare.py
βββ evaluate_agg.py
βββ attr_types.json
- Collect data from original sources
- Use MinerU toolkit to parse complex formats (such as PDF)
- Organize datasets into JSON format, where each object corresponds to an unstructured document
- For Healthcare and Player datasets, divide documents into multiple related domains
- Hire 6 Ph.D. students from different majors to carefully read documents
- Identify significant attributes with different extraction difficulties
- Examples: Judge names in legal datasets are easy to identify, while case numbers require full-text search and reasoning
- Total of 30 graduate students participated in labeling, consuming approximately 4k human hours
- Use multiple LLMs (Deepseek-V3, GPT-4.1, Claude-sonnet-4) for cross-validation
- Adopt semi-automated iterative labeling strategy for large-scale datasets
- Experts design query templates based on real-world scenarios
- Support both SQL-like queries and Python code interfaces
- Total of 608 queries created, which can be divided into 5 major categories and 42 sub-categories.
- π₯ Download Datasets: Use the provided download links to obtain the datasets you need
- π Extract Files: Unzip the downloaded files to your local directory
- π» Load Data into System: Load the JSON data into your analysis system
- π Execute Queries: Run the benchmark queries (provided separately)
- π Compare Results: Compare your results with the ground truth CSV files
Our benchmark evaluates 7 existing unstructured data analysis systems:
| System | Open Source | Repository | Modifications |
|---|---|---|---|
| π Evaporate | β | GitHub | Adaptation |
| π Palimpzest (PZ) | β | GitHub | Adaptation |
| πΈ LOTUS | β | GitHub | Adaptation |
| π€ DocETL | β | GitHub | Direct Usage |
| β QUEST | β | GitHub | Adaptation |
| π― ZenDB | β | Paper | Implementation |
| π UQE | β | Paper | Implementation |
Evaporate: A table extraction system that extracts structured tables from documents, and subsequently executes SQL queries on the resulting tables.
Palimpzest (PZ): Provides Python API-based operators for unstructured data processing. We convert each SQL query into the corresponding PZ code, execute it and obtain the results.
LOTUS: Provides an open-source Python library for AI-based data processing with indexing, extraction, filtering, and joining capabilities. We use its interface to execute queries.
DocETL: An agentic query rewriting and evaluation system for complex document processing. We directly use the DocETL library to execute queries without any modifications.
QUEST: A query engine for unstructured databases that accepts a subset of standard SQL syntax. We directly use their code to execute queries.
ZenDB: A system that constructs semantic hierarchical trees to identify relevant document sections. We implement their SHT chunking and filter reordering strategies.
UQE: A query engine for unstructured databases that supports SQL-like query syntax with sampling-based aggregation capabilities. We implement its filter and aggregate operators, as well as logical optimizations.
| System | Query Interface | Chunking | Embedding | Multi-modal | Extract | Filter | Join | Aggregate | Logical Opt. | Physical Opt. |
|---|---|---|---|---|---|---|---|---|---|---|
| Evaporate | β | β | β | β | β | β | β | β | β | β |
| Palimpzest | Code | β | β | β | β | β | β | β | β | β |
| LOTUS | Code | β | β | β | β | β | β | β | β | β |
| DocETL | Code | β | β | β | β | β | β | β | β | β |
| ZenDB | SQL-like | β | β | β | β | β | β | β | β | β |
| QUEST | SQL-like | β | β | β | β | β | β | β | β | β |
| UQE | SQL-like | β | β | β | β | β | β | β | β | β |
Table 1: Overview of existing unstructured data analysis systems and their capabilities.
We welcome issue reports, feature requests, or code contributions. Please ensure to follow the project's coding standards and testing requirements.
For questions or suggestions, please contact us through:
- Submit GitHub Issues
- Send email to: [Email to be added]
Last updated: 2025



