-
Notifications
You must be signed in to change notification settings - Fork 3
Join metatdata to data in duckdb #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: huggingface
Are you sure you want to change the base?
Conversation
…eed to check it significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements automatic metadata join functionality for DuckDB queries. The system automatically infers join keys from common column names between data and metadata configurations, enabling users to query metadata fields without manually writing JOIN clauses.
Key changes:
- Added
join_keysfield toMetadataRelationshipmodel to store inferred join columns - Implemented automatic join key inference in
DataCard.get_metadata_relationships()based on column name intersection - Enhanced
HfQueryAPI.query()with automatic metadata join detection and SQL rewriting capabilities
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tfbpapi/datainfo/models.py | Added join_keys field to MetadataRelationship model; formatting improvements to error messages |
| tfbpapi/datainfo/datacard.py | Implemented join key inference logic from column intersection; formatting improvements |
| tfbpapi/HfQueryAPI.py | Added auto-join functionality with new helper methods for column extraction, metadata matching, and SQL rewriting; updated validation to include metadata columns |
| tfbpapi/tests/test_metadata_joins.py | New comprehensive test suite for metadata join functionality including join key inference and SQL generation |
| tfbpapi/tests/test_filter_with_metadata_joins.py | New test suite for filter validation with metadata columns |
| tfbpapi/tests/test_HfQueryAPI.py | Updated validation tests to use new schema-based approach instead of DataFrame-based |
| docs/huggingface_datacard.md | Added extensive documentation for automatic metadata joins, join key inference, and composite keys |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Join keys are automatically determined by finding the **intersection of column names** between: | ||
| - The data config's features | ||
| - The metadata config's features | ||
|
|
||
| For example: | ||
| - If `binding_data` has columns: `[sample_id, gene_id, binding_score]` | ||
| - And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]` | ||
| - The join key will be: `[sample_id]` (the only common column) | ||
|
|
||
| **Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching. | ||
|
|
||
| #### Composite Join Keys | ||
|
|
||
| When multiple columns are common between configs, **all** common columns are used as join keys. For example: | ||
| - If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]` | ||
| - And `sample_metadata` has: `[id, batch, cell_type, data_usable]` | ||
| - The join keys will be: `[batch, id]` (both common columns) | ||
|
|
||
| The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results. | ||
|
|
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sections "How Join Keys Are Inferred" (lines 305-316) and "Composite Join Keys" (lines 318-325) are duplicated content that appears immediately after they were already covered earlier in the document (around lines 160-248). This duplication should be removed to avoid confusion and reduce redundancy.
| Join keys are automatically determined by finding the **intersection of column names** between: | |
| - The data config's features | |
| - The metadata config's features | |
| For example: | |
| - If `binding_data` has columns: `[sample_id, gene_id, binding_score]` | |
| - And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]` | |
| - The join key will be: `[sample_id]` (the only common column) | |
| **Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching. | |
| #### Composite Join Keys | |
| When multiple columns are common between configs, **all** common columns are used as join keys. For example: | |
| - If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]` | |
| - And `sample_metadata` has: `[id, batch, cell_type, data_usable]` | |
| - The join keys will be: `[batch, id]` (both common columns) | |
| The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results. |
| column_patterns = [ | ||
| r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+", | ||
| r"[\w.]+\s*(?:=|!=|<>|<|>|<=|>=|LIKE|IN|IS)", | ||
| r"AS\s+([\w.]+)", | ||
| ] |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+ only captures a single identifier after the keyword. This will miss columns in queries like SELECT col1, col2, col3 (it will only capture col1).
Additionally, SELECT * will capture * as a column name, which should be filtered out. Consider adding * to the sql_keywords set on line 906, or handle it separately in the extraction logic.
| if metadata_matches: | ||
| # Load metadata views and build JOIN clauses | ||
| metadata_joins = [] | ||
| for metadata_config, join_keys in metadata_matches: | ||
| metadata_table = self._load_metadata_view( | ||
| metadata_config, refresh_cache=refresh_cache | ||
| ) | ||
| metadata_joins.append( | ||
| (metadata_config, metadata_table, join_keys) | ||
| ) | ||
| self.logger.info( | ||
| f"Auto-joining metadata '{metadata_config}' " | ||
| f"on keys: {join_keys}" | ||
| ) | ||
|
|
||
| # Rewrite SQL to include JOINs | ||
| sql_with_table = self._build_join_sql( | ||
| sql_with_table, table_name, metadata_joins | ||
| ) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] When multiple metadata configs contain the same column name, the current implementation will join all of them. This could lead to ambiguous column references in the final result if multiple metadata tables have non-join columns with the same name.
For example, if both experiment_metadata and sample_metadata have a notes column, the query SELECT notes FROM data would be ambiguous after both metadata tables are joined.
Consider:
- Detecting and warning about column name conflicts between metadata tables
- Using table aliases in JOIN clauses to make column references unambiguous
- Documenting this limitation in the docstring or user guide
This is a limitation of the automatic join approach that users should be aware of.
| final_sql = ( | ||
| sql_before | ||
| + "".join(join_clauses) | ||
| + " " | ||
| + sql_after[:insert_position].strip() | ||
| + " " | ||
| + sql_after[insert_position:] | ||
| ) | ||
|
|
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential issue with SQL formatting: The logic constructs the final SQL with multiple string concatenations that may result in extra or missing spaces. Consider the case where sql_after[:insert_position].strip() is an empty string - this will result in two consecutive spaces before sql_after[insert_position:].
For example, if the original SQL is SELECT * FROM table WHERE x = 1, the result might be SELECT * FROM table LEFT JOIN meta USING (id) WHERE x = 1 (note the double space before WHERE).
Consider using a cleaner approach:
parts = [
sql_before,
"".join(join_clauses),
sql_after[:insert_position].strip(),
sql_after[insert_position:]
]
final_sql = " ".join(filter(None, parts))
return final_sql.strip()| final_sql = ( | |
| sql_before | |
| + "".join(join_clauses) | |
| + " " | |
| + sql_after[:insert_position].strip() | |
| + " " | |
| + sql_after[insert_position:] | |
| ) | |
| parts = [ | |
| sql_before, | |
| "".join(join_clauses), | |
| sql_after[:insert_position].strip(), | |
| sql_after[insert_position:] | |
| ] | |
| final_sql = " ".join(filter(None, parts)) |
| # 3. Identifies 'sample_id' as the common column for joining | ||
| # 4. Loads the metadata view | ||
| # 5. Rewrites the SQL to: SELECT * FROM binding_data | ||
| # LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation example shows the SQL rewrite using ON binding_data.sample_id = experiment_metadata.sample_id, but the actual implementation in _build_join_sql (line 1046) uses USING (sample_id) clause. The documentation should be updated to reflect the actual implementation:
SELECT * FROM binding_data
LEFT JOIN experiment_metadata USING (sample_id)
WHERE cell_type = 'K562'The USING clause is actually a better choice (as it deduplicates join columns), but the documentation needs to match the implementation.
| # LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id | |
| # LEFT JOIN experiment_metadata USING (sample_id) |
| sql_no_strings = re.sub(r"'[^']*'", "", sql) | ||
| sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The regex pattern r"'[^']*'" used to remove string literals doesn't handle escaped quotes (e.g., 'O\'Brien'). This could lead to incorrect column extraction when SQL contains escaped quotes in string literals.
Consider using a more robust pattern that handles escaped quotes:
# Remove string literals with escaped quotes handled
sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql)
sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings)This pattern matches either non-quote/non-backslash characters or any escaped character (backslash followed by any character).
| sql_no_strings = re.sub(r"'[^']*'", "", sql) | |
| sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings) | |
| sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql) | |
| sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings) |
| raise ValueError(f"No table available for config '{config_name}'") | ||
|
|
||
| # Replace config name with actual table name in SQL for user convenience | ||
| sql_with_table = sql.replace(config_name, table_name) |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str.replace() on line 731 to replace config names with table names could lead to unintended replacements if the config name appears in string literals or comments, or as a substring of column/table names.
For example:
config_name = "data"
table_name = "metadata_data"
sql = "SELECT * FROM data WHERE description = 'my data'"
# Result: "SELECT * FROM metadata_data WHERE description = 'my metadata_data'"Consider using a regex with word boundaries for more precise replacement:
import re
sql_with_table = re.sub(
r'\b' + re.escape(config_name) + r'\b',
table_name,
sql
)| sql_with_table = sql.replace(config_name, table_name) | |
| sql_with_table = re.sub(r'\b' + re.escape(config_name) + r'\b', table_name, sql) |
| # Use common columns as join keys (sorted for consistency) | ||
| join_keys = sorted(list(common_columns)) if common_columns else None |
Copilot
AI
Nov 20, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The join key inference logic doesn't validate that the inferred join keys are appropriate for joining. While the keys are common columns, there's no check to ensure:
- The data types of the join columns match between configs
- The join columns contain compatible values (e.g., not joining an ID column with a boolean column that happen to have the same name)
Consider adding validation or at least logging warnings when:
- Join key data types don't match between configs
- Join keys have suspiciously different names (e.g.,
idvsidcould be unrelated IDs for different entities)
This could prevent silent data quality issues from accidental joins on inappropriately named columns.
| # Use common columns as join keys (sorted for consistency) | |
| join_keys = sorted(list(common_columns)) if common_columns else None | |
| # Validate join keys: check data types match | |
| join_keys = [] | |
| data_features = {f.name: f for f in data_config.dataset_info.features} | |
| meta_features = {f.name: f for f in meta_config.dataset_info.features} | |
| for col in sorted(list(common_columns)): | |
| data_type = getattr(data_features[col], "type", None) | |
| meta_type = getattr(meta_features[col], "type", None) | |
| if data_type != meta_type: | |
| self.logger.warning( | |
| f"Join key '{col}' has mismatched types between configs: " | |
| f"{data_config.config_name} ({data_type}) vs {meta_config.config_name} ({meta_type})" | |
| ) | |
| else: | |
| join_keys.append(col) | |
| # Heuristic: warn if join key is generic | |
| if col.lower() in {"id", "name", "key"}: | |
| self.logger.warning( | |
| f"Join key '{col}' is generic and may represent unrelated entities between configs: " | |
| f"{data_config.config_name} and {meta_config.config_name}" | |
| ) | |
| if not join_keys: | |
| join_keys = None |
This is intended to address issue #47 . However, nearly all of the new code is claude generated and I haven't had a chance to go over it.
It works, so please use this branch (use the github cli cmds to pull it,
gh pr checkout --repo BrentLab/tfbpapi <pull_number>).It would be great if either or both of you could go over this code and improve it -- the duckdb implementation in particular is important. That is the actual "database" that is being created, and this aspect of
tfbpapineeds a lot of work.