Join metatdata to data in duckdb #48

cmatKhan · 2025-11-20T01:08:44Z

This is intended to address issue #47 . However, nearly all of the new code is claude generated and I haven't had a chance to go over it.

It works, so please use this branch (use the github cli cmds to pull it, gh pr checkout --repo BrentLab/tfbpapi <pull_number>).

It would be great if either or both of you could go over this code and improve it -- the duckdb implementation in particular is important. That is the actual "database" that is being created, and this aspect of tfbpapi needs a lot of work.

…eed to check it significantly

Copilot

Pull Request Overview

This PR implements automatic metadata join functionality for DuckDB queries. The system automatically infers join keys from common column names between data and metadata configurations, enabling users to query metadata fields without manually writing JOIN clauses.

Key changes:

Added join_keys field to MetadataRelationship model to store inferred join columns
Implemented automatic join key inference in DataCard.get_metadata_relationships() based on column name intersection
Enhanced HfQueryAPI.query() with automatic metadata join detection and SQL rewriting capabilities

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tfbpapi/datainfo/models.py	Added `join_keys` field to `MetadataRelationship` model; formatting improvements to error messages
tfbpapi/datainfo/datacard.py	Implemented join key inference logic from column intersection; formatting improvements
tfbpapi/HfQueryAPI.py	Added auto-join functionality with new helper methods for column extraction, metadata matching, and SQL rewriting; updated validation to include metadata columns
tfbpapi/tests/test_metadata_joins.py	New comprehensive test suite for metadata join functionality including join key inference and SQL generation
tfbpapi/tests/test_filter_with_metadata_joins.py	New test suite for filter validation with metadata columns
tfbpapi/tests/test_HfQueryAPI.py	Updated validation tests to use new schema-based approach instead of DataFrame-based
docs/huggingface_datacard.md	Added extensive documentation for automatic metadata joins, join key inference, and composite keys

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-20T01:14:54Z

docs/huggingface_datacard.md

+Join keys are automatically determined by finding the **intersection of column names** between:
+- The data config's features
+- The metadata config's features
+
+For example:
+- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`
+- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`
+- The join key will be: `[sample_id]` (the only common column)
+
+**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.
+
+#### Composite Join Keys
+
+When multiple columns are common between configs, **all** common columns are used as join keys. For example:
+- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`
+- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`
+- The join keys will be: `[batch, id]` (both common columns)
+
+The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.
+


The sections "How Join Keys Are Inferred" (lines 305-316) and "Composite Join Keys" (lines 318-325) are duplicated content that appears immediately after they were already covered earlier in the document (around lines 160-248). This duplication should be removed to avoid confusion and reduce redundancy.

Suggested change

Join keys are automatically determined by finding the **intersection of column names** between:

- The data config's features

- The metadata config's features

For example:

- If `binding_data` has columns: `[sample_id, gene_id, binding_score]`

- And `experiment_metadata` has columns: `[sample_id, cell_type, treatment]`

- The join key will be: `[sample_id]` (the only common column)

**Important**: Make sure your common columns have the same name in both configs. The system uses exact name matching.

#### Composite Join Keys

When multiple columns are common between configs, **all** common columns are used as join keys. For example:

- If `annotated_features` has: `[id, batch, regulator_symbol, expression_value]`

- And `sample_metadata` has: `[id, batch, cell_type, data_usable]`

- The join keys will be: `[batch, id]` (both common columns)

The system uses SQL `USING` clause for joins, which automatically deduplicates the join key columns in the result. This means you won't see duplicate columns like `id` and `id_1` in your results.

Copilot · 2025-11-20T01:14:54Z

tfbpapi/HfQueryAPI.py

+        column_patterns = [
+            r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+",
+            r"[\w.]+\s*(?:=|!=|<>|<|>|<=|>=|LIKE|IN|IS)",
+            r"AS\s+([\w.]+)",
+        ]


The regex pattern r"\b(?:SELECT|WHERE|AND|OR|ON|GROUP BY|ORDER BY|HAVING)\s+[\w.]+ only captures a single identifier after the keyword. This will miss columns in queries like SELECT col1, col2, col3 (it will only capture col1).

Additionally, SELECT * will capture * as a column name, which should be filtered out. Consider adding * to the sql_keywords set on line 906, or handle it separately in the extraction logic.

Copilot · 2025-11-20T01:14:54Z

tfbpapi/HfQueryAPI.py

+                if metadata_matches:
+                    # Load metadata views and build JOIN clauses
+                    metadata_joins = []
+                    for metadata_config, join_keys in metadata_matches:
+                        metadata_table = self._load_metadata_view(
+                            metadata_config, refresh_cache=refresh_cache
+                        )
+                        metadata_joins.append(
+                            (metadata_config, metadata_table, join_keys)
+                        )
+                        self.logger.info(
+                            f"Auto-joining metadata '{metadata_config}' "
+                            f"on keys: {join_keys}"
+                        )
+
+                    # Rewrite SQL to include JOINs
+                    sql_with_table = self._build_join_sql(
+                        sql_with_table, table_name, metadata_joins
+                    )


[nitpick] When multiple metadata configs contain the same column name, the current implementation will join all of them. This could lead to ambiguous column references in the final result if multiple metadata tables have non-join columns with the same name.

For example, if both experiment_metadata and sample_metadata have a notes column, the query SELECT notes FROM data would be ambiguous after both metadata tables are joined.

Consider:

Detecting and warning about column name conflicts between metadata tables

Using table aliases in JOIN clauses to make column references unambiguous

Documenting this limitation in the docstring or user guide

This is a limitation of the automatic join approach that users should be aware of.

Copilot · 2025-11-20T01:14:55Z

tfbpapi/HfQueryAPI.py

+        final_sql = (
+            sql_before
+            + "".join(join_clauses)
+            + " "
+            + sql_after[:insert_position].strip()
+            + " "
+            + sql_after[insert_position:]
+        )
+


Potential issue with SQL formatting: The logic constructs the final SQL with multiple string concatenations that may result in extra or missing spaces. Consider the case where sql_after[:insert_position].strip() is an empty string - this will result in two consecutive spaces before sql_after[insert_position:].

For example, if the original SQL is SELECT * FROM table WHERE x = 1, the result might be SELECT * FROM table LEFT JOIN meta USING (id) WHERE x = 1 (note the double space before WHERE).

Consider using a cleaner approach:

parts = [ sql_before, "".join(join_clauses), sql_after[:insert_position].strip(), sql_after[insert_position:] ] final_sql = " ".join(filter(None, parts)) return final_sql.strip()

Suggested change

final_sql = (

sql_before

+ "".join(join_clauses)

+ " "

+ sql_after[:insert_position].strip()

+ " "

+ sql_after[insert_position:]

)

parts = [

sql_before,

"".join(join_clauses),

sql_after[:insert_position].strip(),

sql_after[insert_position:]

]

final_sql = " ".join(filter(None, parts))

Copilot · 2025-11-20T01:14:55Z

docs/huggingface_datacard.md

+# 3. Identifies 'sample_id' as the common column for joining
+# 4. Loads the metadata view
+# 5. Rewrites the SQL to: SELECT * FROM binding_data
+#    LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id


The documentation example shows the SQL rewrite using ON binding_data.sample_id = experiment_metadata.sample_id, but the actual implementation in _build_join_sql (line 1046) uses USING (sample_id) clause. The documentation should be updated to reflect the actual implementation:

SELECT * FROM binding_data LEFT JOIN experiment_metadata USING (sample_id) WHERE cell_type = 'K562'

The USING clause is actually a better choice (as it deduplicates join columns), but the documentation needs to match the implementation.

Suggested change

# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id

# LEFT JOIN experiment_metadata USING (sample_id)

Copilot · 2025-11-20T01:14:55Z

tfbpapi/HfQueryAPI.py

+        sql_no_strings = re.sub(r"'[^']*'", "", sql)
+        sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings)


The regex pattern r"'[^']*'" used to remove string literals doesn't handle escaped quotes (e.g., 'O\'Brien'). This could lead to incorrect column extraction when SQL contains escaped quotes in string literals.

Consider using a more robust pattern that handles escaped quotes:

# Remove string literals with escaped quotes handled sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql) sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings)

This pattern matches either non-quote/non-backslash characters or any escaped character (backslash followed by any character).

Suggested change

sql_no_strings = re.sub(r"'[^']*'", "", sql)

sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings)

sql_no_strings = re.sub(r"'(?:[^'\\]|\\.)*'", "", sql)

sql_no_strings = re.sub(r'"(?:[^"\\]|\\.)*"', "", sql_no_strings)

Copilot · 2025-11-20T01:14:55Z

tfbpapi/HfQueryAPI.py

            raise ValueError(f"No table available for config '{config_name}'")

        # Replace config name with actual table name in SQL for user convenience
        sql_with_table = sql.replace(config_name, table_name)


Using str.replace() on line 731 to replace config names with table names could lead to unintended replacements if the config name appears in string literals or comments, or as a substring of column/table names.

For example:

config_name = "data" table_name = "metadata_data" sql = "SELECT * FROM data WHERE description = 'my data'" # Result: "SELECT * FROM metadata_data WHERE description = 'my metadata_data'"

Consider using a regex with word boundaries for more precise replacement:

import re sql_with_table = re.sub( r'\b' + re.escape(config_name) + r'\b', table_name, sql )

Suggested change

sql_with_table = sql.replace(config_name, table_name)

sql_with_table = re.sub(r'\b' + re.escape(config_name) + r'\b', table_name, sql)

Copilot · 2025-11-20T01:14:55Z

tfbpapi/datainfo/datacard.py

+                    # Use common columns as join keys (sorted for consistency)
+                    join_keys = sorted(list(common_columns)) if common_columns else None


[nitpick] The join key inference logic doesn't validate that the inferred join keys are appropriate for joining. While the keys are common columns, there's no check to ensure:

The data types of the join columns match between configs

The join columns contain compatible values (e.g., not joining an ID column with a boolean column that happen to have the same name)

Consider adding validation or at least logging warnings when:

Join key data types don't match between configs

Join keys have suspiciously different names (e.g., id vs id could be unrelated IDs for different entities)

This could prevent silent data quality issues from accidental joins on inappropriately named columns.

Suggested change

# Use common columns as join keys (sorted for consistency)

join_keys = sorted(list(common_columns)) if common_columns else None

# Validate join keys: check data types match

join_keys = []

data_features = {f.name: f for f in data_config.dataset_info.features}

meta_features = {f.name: f for f in meta_config.dataset_info.features}

for col in sorted(list(common_columns)):

data_type = getattr(data_features[col], "type", None)

meta_type = getattr(meta_features[col], "type", None)

if data_type != meta_type:

self.logger.warning(

f"Join key '{col}' has mismatched types between configs: "

f"{data_config.config_name} ({data_type}) vs {meta_config.config_name} ({meta_type})"

)

else:

join_keys.append(col)

# Heuristic: warn if join key is generic

if col.lower() in {"id", "name", "key"}:

self.logger.warning(

f"Join key '{col}' is generic and may represent unrelated entities between configs: "

f"{data_config.config_name} and {meta_config.config_name}"

)

if not join_keys:

join_keys = None

this seems to address the issue with the metadata. however, I still n…

caff108

…eed to check it significantly

cmatKhan requested review from CalvinFang-code, MackLiao and Copilot November 20, 2025 01:08

Copilot started reviewing on behalf of cmatKhan November 20, 2025 01:09 View session

Copilot finished reviewing on behalf of cmatKhan November 20, 2025 01:13

Copilot AI reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Join metatdata to data in duckdb #48

Join metatdata to data in duckdb #48

Uh oh!

cmatKhan commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Copilot AI Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# LEFT JOIN experiment_metadata ON binding_data.sample_id = experiment_metadata.sample_id
	# LEFT JOIN experiment_metadata USING (sample_id)

		sql_no_strings = re.sub(r"'[^']*'", "", sql)
		sql_no_strings = re.sub(r'"[^"]*"', "", sql_no_strings)

	sql_with_table = sql.replace(config_name, table_name)
	sql_with_table = re.sub(r'\b' + re.escape(config_name) + r'\b', table_name, sql)

		# Use common columns as join keys (sorted for consistency)
		join_keys = sorted(list(common_columns)) if common_columns else None

-                    # Use common columns as join keys (sorted for consistency)
-                    join_keys = sorted(list(common_columns)) if common_columns else None
+                    # Validate join keys: check data types match
+                    join_keys = []
+                    data_features = {f.name: f for f in data_config.dataset_info.features}
+                    meta_features = {f.name: f for f in meta_config.dataset_info.features}
+                    for col in sorted(list(common_columns)):
+                        data_type = getattr(data_features[col], "type", None)
+                        meta_type = getattr(meta_features[col], "type", None)
+                        if data_type != meta_type:
+                            self.logger.warning(
+                                f"Join key '{col}' has mismatched types between configs: "
+                                f"{data_config.config_name} ({data_type}) vs {meta_config.config_name} ({meta_type})"
+                            )
+                        else:
+                            join_keys.append(col)
+                        # Heuristic: warn if join key is generic
+                        if col.lower() in {"id", "name", "key"}:
+                            self.logger.warning(
+                                f"Join key '{col}' is generic and may represent unrelated entities between configs: "
+                                f"{data_config.config_name} and {meta_config.config_name}"
+                            )
+                    if not join_keys:
+                        join_keys = None

Join metatdata to data in duckdb #48

Are you sure you want to change the base?

Join metatdata to data in duckdb #48

Uh oh!

Conversation

cmatKhan commented Nov 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant