Skip to content

Fixes 21329: exclude temporal table period columns from autoClassification sampling#27960

Open
edg956 wants to merge 4 commits intomainfrom
issue-21329
Open

Fixes 21329: exclude temporal table period columns from autoClassification sampling#27960
edg956 wants to merge 4 commits intomainfrom
issue-21329

Conversation

@edg956
Copy link
Copy Markdown
Contributor

@edg956 edg956 commented May 7, 2026

Summary

  • Fixes list index out of range error in autoClassification ingestion for AzureSQL tables with Temporal Tables enabled (issue autoClassification ingestion fails due to AzureSQL Temporal Table fields #21329)
  • SQL Server temporal tables add GENERATED ALWAYS AS ROW START/END period columns (ValidFrom/ValidTo) that break the autoClassification sampler in two ways: TABLESAMPLE is not supported on temporal tables (SQL Server error 13541), and pyodbc cannot reliably handle period column types
  • Fix 1 (mssql/utils.py): filters generated_always_type IN (1, 2) in get_columns so temporal period columns are never stored in entity metadata
  • Fix 2 (azuresql/sampler.py): adds _get_temporal_column_names() querying sys.columns to detect and exclude temporal period columns in fetch_sample_data, as a defense-in-depth guard

Test plan

  • Unit test added to test_azuresql_sampling.py: verifies fetch_sample_data filters out columns returned by _get_temporal_column_names (mocked)
  • PII processor test case added (test_cases/azuresql_temporal_table.py): table entity has 5 columns including ValidFrom/ValidTo, sample data has only the 3 regular columns — verifies the processor handles this correctly and detects PII on email
  • Integration test added (test_azuresql_temporal_table.py): skipped unless AZURE_SQL_HOST, AZURE_SQL_DATABASE, AZURE_SQL_USERNAME, AZURE_SQL_PASSWORD env vars are set; creates a real temporal table and asserts temporal columns are absent from sample data

Closes #21329

🤖 Generated with Claude Code


Summary by Gitar

  • Dependency management:
    • Added spacy to ingestion/setup.py to pin compatibility and resolve issues with spacy<3.8.
  • Database ingestion/sampling fixes:
    • Modified mssql/utils.py to filter out temporal period columns using generated_always_type.
    • Added _get_temporal_column_names to AzureSQLSampler to identify and exclude period columns from fetch_sample_data.
  • Testing:
    • Added integration tests for AzureSQL temporal tables covering metadata and classification workflows.
    • Added unit tests to verify proper column exclusion and handling of empty column sets in AzureSQLSampler.
    • Added PII processor test case for temporal table scenarios.

This will update automatically on new commits.

@edg956 edg956 requested a review from a team as a code owner May 7, 2026 10:04
@github-actions github-actions Bot added Ingestion safe to test Add this label to run secure Github workflows on PRs labels May 7, 2026
Comment thread ingestion/src/metadata/sampler/sqlalchemy/azuresql/sampler.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Comment thread ingestion/src/metadata/sampler/sqlalchemy/azuresql/sampler.py Outdated
@edg956 edg956 changed the title fix(azuresql): exclude temporal table period columns from autoClassification sampling Fixes 21329: exclude temporal table period columns from autoClassification sampling May 7, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🟡 Playwright Results — all passed (15 flaky)

✅ 4013 passed · ❌ 0 failed · 🟡 15 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 751 0 4 8
🟡 Shard 3 756 0 3 7
🟡 Shard 4 788 0 2 18
🟡 Shard 5 686 0 1 41
🟡 Shard 6 733 0 5 8
🟡 15 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › Activity event shows the actor who made the change (shard 2, 1 retry)
  • Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Create new Bundle Suite with bulk selected test cases (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 2 retries)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Flow/ObservabilityAlerts.spec.ts › Alert operations for a user with and without permissions (shard 3, 2 retries)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 3, 1 retry)
  • Pages/Domains.spec.ts › Domain Rbac (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Tag Add, Update and Remove for child entities (shard 4, 1 retry)
  • Pages/Entity.spec.ts › Announcement create, edit & delete (shard 5, 1 retry)
  • Pages/GlossaryImportExport.spec.ts › Glossary CSV import preserves typed relations (shard 6, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for mlModel -> mlModel (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/UserDetails.spec.ts › Admin user can edit teams from the user profile (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@edg956 edg956 added To release Will cherry-pick this PR into the release branch connectors labels May 8, 2026
edg956 and others added 4 commits May 8, 2026 18:24
Query sys.columns for generated_always_type to detect SYSTEM_TIME period
columns (ValidFrom/ValidTo) and skip them in both schema reflection
(mssql/utils.py) and sample data fetching (AzureSQLSampler). Also moves
the catalog round-trip inside the `if columns` guard to avoid the query
when column filtering is not in use.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds sampler unit tests covering period-column filtering and NOT_COMPUTE_PYODBC
exclusion. Adds a PII processor test case for temporal tables using single
first-names to avoid non-deterministic NER matches. Corrects customers_sensitive
expected tags to include address→PII.NonSensitive, which the classifier now
correctly detects.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replaces the isolated sampler unit test with an end-to-end integration test
that registers the AzureSQL service, creates a system-versioned table, runs
MetadataWorkflow then AutoClassificationWorkflow, and asserts that sample
data excludes ValidFrom/ValidTo. Includes SQL permission prerequisites and
troubleshooting guide in the module docstring. Teardown controlled by
AZURE_SQL_CLEANUP env var (default: true).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 8, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Excludes temporal table period columns from autoClassification sampling to resolve SQL Server compatibility issues and pins spacy versioning. Addresses bypass logic for empty column lists and redundant database queries.

✅ 2 resolved
Bug: Filtered columns bypass when all are excluded (empty list falsy)

📄 ingestion/src/metadata/sampler/sqlalchemy/azuresql/sampler.py:87
In fetch_sample_data, if columns is a non-empty list but every column is filtered out (all temporal or all in NOT_COMPUTE_PYODBC), sqa_columns ends up as []. Because empty list is falsy in Python, sqa_columns or columns evaluates to columns — the original unfiltered list — completely defeating the filter. This means the temporal/unsupported columns will still be passed to super().fetch_sample_data(), likely causing the same pyodbc error this PR aims to fix.

The fix should use an explicit None check rather than relying on truthiness.

Performance: Unnecessary DB query when columns is None

📄 ingestion/src/metadata/sampler/sqlalchemy/azuresql/sampler.py:78
_get_temporal_column_names() executes a database round-trip (opening a new session) on every call to fetch_sample_data, even when columns is None. When columns is None, the temporal column names are never used since the filtering loop is skipped entirely. Moving the call inside the if columns: block would avoid the unnecessary query.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 8, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

connectors Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

autoClassification ingestion fails due to AzureSQL Temporal Table fields

1 participant