Skip to content

Conversation

@cmatKhan
Copy link
Member

This is intended to address issue #47 . However, nearly all of the new code is claude generated and I haven't had a chance to go over it.

It works, so please use this branch (use the github cli cmds to pull it, gh pr checkout --repo BrentLab/tfbpapi <pull_number>).

It would be great if either or both of you could go over this code and improve it -- the duckdb implementation in particular is important. That is the actual "database" that is being created, and this aspect of tfbpapi needs a lot of work.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements automatic metadata join functionality for DuckDB queries. The system automatically infers join keys from common column names between data and metadata configurations, enabling users to query metadata fields without manually writing JOIN clauses.

Key changes:

  • Added join_keys field to MetadataRelationship model to store inferred join columns
  • Implemented automatic join key inference in DataCard.get_metadata_relationships() based on column name intersection
  • Enhanced HfQueryAPI.query() with automatic metadata join detection and SQL rewriting capabilities

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tfbpapi/datainfo/models.py Added join_keys field to MetadataRelationship model; formatting improvements to error messages
tfbpapi/datainfo/datacard.py Implemented join key inference logic from column intersection; formatting improvements
tfbpapi/HfQueryAPI.py Added auto-join functionality with new helper methods for column extraction, metadata matching, and SQL rewriting; updated validation to include metadata columns
tfbpapi/tests/test_metadata_joins.py New comprehensive test suite for metadata join functionality including join key inference and SQL generation
tfbpapi/tests/test_filter_with_metadata_joins.py New test suite for filter validation with metadata columns
tfbpapi/tests/test_HfQueryAPI.py Updated validation tests to use new schema-based approach instead of DataFrame-based
docs/huggingface_datacard.md Added extensive documentation for automatic metadata joins, join key inference, and composite keys

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +307 to +326
Join keys are automatically determined by finding the **intersection of column names** between:
- The data config's features
- The metadata config's features

For example:
- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`
- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`
- The join key will be: `[sample_id]` (the only common column)

**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.

#### Composite Join Keys

When multiple columns are common between configs, **all** common columns are used as join keys. For example:
- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`
- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`
- The join keys will be: `[batch, id]` (both common columns)

The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.

Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sections "How Join Keys Are Inferred" (lines 305-316) and "Composite Join Keys" (lines 318-325) are duplicated content that appears immediately after they were already covered earlier in the document (around lines 160-248). This duplication should be removed to avoid confusion and reduce redundancy.

Suggested change
Join keys are automatically determined by finding the **intersection of column names** between:
- The data config's features
- The metadata config's features
For example:
- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`
- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`
- The join key will be: `[sample_id]` (the only common column)
**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.
#### Composite Join Keys
When multiple columns are common between configs, **all** common columns are used as join keys. For example:
- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`
- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`
- The join keys will be: `[batch, id]` (both common columns)
The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.

Copilot uses AI. Check for mistakes.
Comment on lines +859 to +863
column_patterns = [
r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+",
r"[\w.]+\s*(?:=|!=|<>|<|>|<=|>=|LIKE|IN|IS)",
r"AS\s+([\w.]+)",
]
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+ only captures a single identifier after the keyword. This will miss columns in queries like SELECT col1, col2, col3 (it will only capture col1).

Additionally, SELECT * will capture * as a column name, which should be filtered out. Consider adding * to the sql_keywords set on line 906, or handle it separately in the extraction logic.

Copilot uses AI. Check for mistakes.
Comment on lines +756 to +774
if metadata_matches:
# Load metadata views and build JOIN clauses
metadata_joins = []
for metadata_config, join_keys in metadata_matches:
metadata_table = self._load_metadata_view(
metadata_config, refresh_cache=refresh_cache
)
metadata_joins.append(
(metadata_config, metadata_table, join_keys)
)
self.logger.info(
f"Auto-joining metadata '{metadata_config}' "
f"on keys: {join_keys}"
)

# Rewrite SQL to include JOINs
sql_with_table = self._build_join_sql(
sql_with_table, table_name, metadata_joins
)
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] When multiple metadata configs contain the same column name, the current implementation will join all of them. This could lead to ambiguous column references in the final result if multiple metadata tables have non-join columns with the same name.

For example, if both experiment_metadata and sample_metadata have a notes column, the query SELECT notes FROM data would be ambiguous after both metadata tables are joined.

Consider:

  1. Detecting and warning about column name conflicts between metadata tables
  2. Using table aliases in JOIN clauses to make column references unambiguous
  3. Documenting this limitation in the docstring or user guide

This is a limitation of the automatic join approach that users should be aware of.

Copilot uses AI. Check for mistakes.
Comment on lines +1063 to +1071
final_sql = (
sql_before
+ "".join(join_clauses)
+ " "
+ sql_after[:insert_position].strip()
+ " "
+ sql_after[insert_position:]
)

Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential issue with SQL formatting: The logic constructs the final SQL with multiple string concatenations that may result in extra or missing spaces. Consider the case where sql_after[:insert_position].strip() is an empty string - this will result in two consecutive spaces before sql_after[insert_position:].

For example, if the original SQL is SELECT * FROM table WHERE x = 1, the result might be SELECT * FROM table LEFT JOIN meta USING (id) WHERE x = 1 (note the double space before WHERE).

Consider using a cleaner approach:

parts = [
    sql_before,
    "".join(join_clauses),
    sql_after[:insert_position].strip(),
    sql_after[insert_position:]
]
final_sql = " ".join(filter(None, parts))
return final_sql.strip()
Suggested change
final_sql = (
sql_before
+ "".join(join_clauses)
+ " "
+ sql_after[:insert_position].strip()
+ " "
+ sql_after[insert_position:]
)
parts = [
sql_before,
"".join(join_clauses),
sql_after[:insert_position].strip(),
sql_after[insert_position:]
]
final_sql = " ".join(filter(None, parts))

Copilot uses AI. Check for mistakes.
# 3. Identifies 'sample_id' as the common column for joining
# 4. Loads the metadata view
# 5. Rewrites the SQL to: SELECT * FROM binding_data
# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation example shows the SQL rewrite using ON binding_data.sample_id = experiment_metadata.sample_id, but the actual implementation in _build_join_sql (line 1046) uses USING (sample_id) clause. The documentation should be updated to reflect the actual implementation:

SELECT * FROM binding_data
LEFT JOIN experiment_metadata USING (sample_id)
WHERE cell_type = 'K562'

The USING clause is actually a better choice (as it deduplicates join columns), but the documentation needs to match the implementation.

Suggested change
# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id
# LEFT JOIN experiment_metadata USING (sample_id)

Copilot uses AI. Check for mistakes.
Comment on lines +852 to +853
sql_no_strings = re.sub(r"'[^']*'", "", sql)
sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings)
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern r"'[^']*'" used to remove string literals doesn't handle escaped quotes (e.g., 'O\'Brien'). This could lead to incorrect column extraction when SQL contains escaped quotes in string literals.

Consider using a more robust pattern that handles escaped quotes:

# Remove string literals with escaped quotes handled
sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql)
sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings)

This pattern matches either non-quote/non-backslash characters or any escaped character (backslash followed by any character).

Suggested change
sql_no_strings = re.sub(r"'[^']*'", "", sql)
sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings)
sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql)
sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings)

Copilot uses AI. Check for mistakes.
raise ValueError(f"No table available for config '{config_name}'")

# Replace config name with actual table name in SQL for user convenience
sql_with_table = sql.replace(config_name, table_name)
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using str.replace() on line 731 to replace config names with table names could lead to unintended replacements if the config name appears in string literals or comments, or as a substring of column/table names.

For example:

config_name = "data"
table_name = "metadata_data"
sql = "SELECT * FROM data WHERE description = 'my data'"
# Result: "SELECT * FROM metadata_data WHERE description = 'my metadata_data'"

Consider using a regex with word boundaries for more precise replacement:

import re
sql_with_table = re.sub(
    r'\b' + re.escape(config_name) + r'\b',
    table_name,
    sql
)
Suggested change
sql_with_table = sql.replace(config_name, table_name)
sql_with_table = re.sub(r'\b' + re.escape(config_name) + r'\b', table_name, sql)

Copilot uses AI. Check for mistakes.
Comment on lines +260 to +261
# Use common columns as join keys (sorted for consistency)
join_keys = sorted(list(common_columns)) if common_columns else None
Copy link

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The join key inference logic doesn't validate that the inferred join keys are appropriate for joining. While the keys are common columns, there's no check to ensure:

  1. The data types of the join columns match between configs
  2. The join columns contain compatible values (e.g., not joining an ID column with a boolean column that happen to have the same name)

Consider adding validation or at least logging warnings when:

  • Join key data types don't match between configs
  • Join keys have suspiciously different names (e.g., id vs id could be unrelated IDs for different entities)

This could prevent silent data quality issues from accidental joins on inappropriately named columns.

Suggested change
# Use common columns as join keys (sorted for consistency)
join_keys = sorted(list(common_columns)) if common_columns else None
# Validate join keys: check data types match
join_keys = []
data_features = {f.name: f for f in data_config.dataset_info.features}
meta_features = {f.name: f for f in meta_config.dataset_info.features}
for col in sorted(list(common_columns)):
data_type = getattr(data_features[col], "type", None)
meta_type = getattr(meta_features[col], "type", None)
if data_type != meta_type:
self.logger.warning(
f"Join key '{col}' has mismatched types between configs: "
f"{data_config.config_name} ({data_type}) vs {meta_config.config_name} ({meta_type})"
)
else:
join_keys.append(col)
# Heuristic: warn if join key is generic
if col.lower() in {"id", "name", "key"}:
self.logger.warning(
f"Join key '{col}' is generic and may represent unrelated entities between configs: "
f"{data_config.config_name} and {meta_config.config_name}"
)
if not join_keys:
join_keys = None

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant