Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions docs/huggingface_datacard.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,173 @@ configs:
# ... rest of config
```

### Automatic Metadata Joins

**NEW**: When metadata is stored in separate files (using `applies_to`), the system automatically infers join keys from common columns and enables automatic metadata joins in queries.

```yaml
configs:
# Data config
- config_name: binding_data
dataset_type: annotated_features
data_files:
- split: train
path: binding_scores.parquet
dataset_info:
features:
- name: sample_id
dtype: string
description: Sample identifier
- name: gene_id
dtype: string
description: Gene identifier
- name: binding_score
dtype: float64
description: Binding score value

# Metadata config - join keys inferred from common columns
- config_name: experiment_metadata
dataset_type: metadata
applies_to: ["binding_data"]
data_files:
- split: train
path: metadata.parquet
dataset_info:
features:
- name: sample_id # Common with binding_data - used for JOIN
dtype: string
description: Sample identifier
- name: cell_type
dtype: string
description: Cell type used in experiment
- name: treatment
dtype: string
description: Treatment condition
```

With this configuration, you can query metadata fields directly without manually writing JOINs:

```python
from tfbpapi import HfQueryAPI

api = HfQueryAPI("username/dataset-repo")

# Query metadata field directly - automatic JOIN is performed
df = api.query(
"SELECT * FROM binding_data WHERE cell_type = 'K562'",
"binding_data"
)
# Behind the scenes, the system automatically:
# 1. Detects that 'cell_type' is not in binding_data
# 2. Finds that 'cell_type' is in experiment_metadata
# 3. Identifies 'sample_id' as the common column for joining
# 4. Loads the metadata view
# 5. Rewrites the SQL to: SELECT * FROM binding_data
# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example shows the SQL rewrite using ON binding_data.sample_id = experiment_metadata.sample_id, but the actual implementation in _build_join_sql (line 1046) uses USING (sample_id) clause. The documentation should be updated to reflect the actual implementation:

SELECT * FROM binding_data
LEFT JOIN experiment_metadata USING (sample_id)
WHERE cell_type = 'K562'

The USING clause is actually a better choice (as it deduplicates join columns), but the documentation needs to match the implementation.

Suggested change
# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id
# LEFT JOIN experiment_metadata USING (sample_id)

Copilot uses AI. Check for mistakes.
# WHERE cell_type = 'K562'
```

#### Composite Join Keys

When multiple columns are common between data and metadata configs, they are all used as join keys:

```yaml
- config_name: sample_level_metadata
dataset_type: metadata
applies_to: ["binding_data"]
data_files:
- split: train
path: sample_metadata.parquet
dataset_info:
features:
- name: sample_id # Common column 1
dtype: string
- name: gene_id # Common column 2
dtype: string
- name: replicate
dtype: int64
description: Biological replicate number
```

The system will automatically join on BOTH `sample_id` AND `gene_id`.

#### Multiple Metadata Configs

A data config can have multiple metadata configs applied to it, each inferred from their respective common columns:

```yaml
configs:
- config_name: binding_data
dataset_type: annotated_features
dataset_info:
features:
- name: sample_id # For joining with experiment_metadata
- name: gene_id # For joining with gene_annotations
- name: binding_score

- config_name: experiment_metadata
dataset_type: metadata
applies_to: ["binding_data"]
dataset_info:
features:
- name: sample_id # Common with binding_data
- name: cell_type
- name: treatment

- config_name: gene_annotations
dataset_type: metadata
applies_to: ["binding_data"]
dataset_info:
features:
- name: gene_id # Common with binding_data
- name: gene_name
- name: gene_biotype
```

Queries can reference columns from multiple metadata sources:

```python
# Automatically joins BOTH metadata configs
df = api.query(
"SELECT * FROM binding_data WHERE cell_type = 'K562' AND gene_biotype = 'protein_coding'",
"binding_data"
)
```

#### Disabling Automatic Joins

If you prefer to write JOINs manually, you can disable automatic metadata joining:

```python
df = api.query(
"SELECT * FROM binding_data",
"binding_data",
auto_join_metadata=False # Disable automatic joins
)
```

#### How Join Keys Are Inferred

Join keys are automatically determined by finding the **intersection of column names** between:
- The data config's features
- The metadata config's features

For example:
- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`
- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`
- The join key will be: `[sample_id]` (the only common column)

**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.

#### Composite Join Keys

When multiple columns are common between configs, **all** common columns are used as join keys. For example:
- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`
- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`
- The join keys will be: `[batch, id]` (both common columns)

The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.

Comment on lines +307 to +326
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sections "How Join Keys Are Inferred" (lines 305-316) and "Composite Join Keys" (lines 318-325) are duplicated content that appears immediately after they were already covered earlier in the document (around lines 160-248). This duplication should be removed to avoid confusion and reduce redundancy.

Suggested change
Join keys are automatically determined by finding the **intersection of column names** between:
- The data config's features
- The metadata config's features
For example:
- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`
- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`
- The join key will be: `[sample_id]` (the only common column)
**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.
#### Composite Join Keys
When multiple columns are common between configs, **all** common columns are used as join keys. For example:
- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`
- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`
- The join keys will be: `[batch, id]` (both common columns)
The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.

Copilot uses AI. Check for mistakes.
### Embedded Metadata with `metadata_fields`

When no explicit metadata config exists, you can extract metadata directly from the dataset's own files using the `metadata_fields` field. This specifies which fields should be treated as metadata.
Expand Down
Loading
Loading