Skip to content

Remove supported entity field for context, exact terms, custom and pattern recognizers#25720

Open
edg956 wants to merge 5 commits intomainfrom
recognizers-without-supported-entity
Open

Remove supported entity field for context, exact terms, custom and pattern recognizers#25720
edg956 wants to merge 5 commits intomainfrom
recognizers-without-supported-entity

Conversation

@edg956
Copy link
Contributor

@edg956 edg956 commented Feb 5, 2026

Describe your changes:

This removes fields we agreed we wouldn't need in recognizers anymore

Type of change:

  • Bug fix

Checklist:


Summary by Gitar

  • Removes redundant supportedEntity field from four PII recognizer schemas (PatternRecognizer, ContextRecognizer, ExactTermsRecognizer, CustomRecognizer), simplifying the configuration model
  • Replaces enum-based entity identification with tag FQN - recognizers now use the tag's fully qualified name (e.g., "PII.EmailAddress") instead of PIIEntity enum values
  • Adds v1.1.20 database migrations for MySQL and PostgreSQL to update existing tag recognizers in the database with the new structure
  • Updates Python recognizer factory to accept tag_fqn parameter and pass it to Presidio recognizers as supported_entity
  • Regenerates TypeScript models to reflect schema changes across the frontend codebase

@edg956 edg956 requested review from a team as code owners February 5, 2026 19:35
@edg956 edg956 added the safe to test Add this label to run secure Github workflows on PRs label Feb 5, 2026
@edg956 edg956 added WIP and removed Ingestion labels Feb 5, 2026
@edg956 edg956 changed the title Remove supported entity field for context, exact terms, custom and pattern recognizers [WIP] Remove supported entity field for context, exact terms, custom and pattern recognizers Feb 5, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

TypeScript types have been updated based on the JSON schema changes in the PR

@github-actions github-actions bot requested a review from a team as a code owner February 5, 2026 19:40
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2026

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

TypeScript types have been updated based on the JSON schema changes in the PR

@edg956
Copy link
Contributor Author

edg956 commented Feb 6, 2026

@gitar-bot review

@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

TypeScript types have been updated based on the JSON schema changes in the PR

@gitar-bot
Copy link

gitar-bot bot commented Feb 6, 2026

🔍 CI failure analysis for 7847f87: Three distinct failures: (1) Python unit test across 6 runs (test needs fullyQualifiedName - PR-related), (2) Playwright E2E test on KPI widget (environmental/flaky), (3) Maven AwsCredentialsUtilTest errors (environmental - AWS config, unrelated to PR).

Three Distinct CI Failures

Failure 1: Python Unit Test (PR-Related - Requires Fix)

Test: test_presidio_recognizer_factory.py::test_create_recognizers_for_tag_multiple

Error: assert len(recognizers) == 2 → got 0

Affected: 6 runs across Python 3.10 (5 runs) and 3.11 (1 run)

Root Cause: Test creates Tag without fullyQualifiedName. New None validation returns empty list.

Fix: Add fullyQualifiedName="PII.Sensitive.test_tag" to test Tag object

Status: Deterministic test bug

Failure 2: Playwright E2E Test (Environmental)

Job: 62785804406 - playwright-ci-postgresql (shard 2/6)

Results: 363 passed / 1 failed / 20 skipped

Test: Flow-CustomizeWidgets-KPI-Widget

Error: expect(locator).toBeVisible() failed

Status: Flaky E2E test unrelated to PR's backend changes

Failure 3: Maven Java Tests (Environmental - Unrelated)

Job: 62785836393 - maven-postgresql-ci

Results: 7,918 passed / 3 errors / 701 skipped

Failed Tests: All in AwsCredentialsUtilTest:

  • testBuildCredentialsProviderWithEmptyCredentials (line 77)
  • testBuildCredentialsProviderWithNoCredentials (line 52)
  • testBuildCredentialsProviderWithOnlyAccessKey (line 64)

Error: IllegalArgumentException: AWS credentials not configured

Assessment: AWS credential configuration tests unrelated to PR changes. This PR modifies:

  • PII recognizer schemas (removing supportedEntity)
  • Python recognizer factory logic
  • Database migrations for tag recognizers
  • TypeScript types

None affect AWS credential utility classes.

Status: Environmental infrastructure/configuration issue

Code Review ✅ Approved 7 resolved / 7 findings

Clean removal of the supportedEntity field from recognizer schemas, replacing it with tag FQN-based identification. Previous findings have been addressed or dismissed per user feedback. No new issues found.

✅ 7 resolved
Edge Case: Fallback to "Unknown" entity type creates silent misclassification

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:249 📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282 📄 ingestion/src/metadata/pii/scanners/custom_ner_scanner.py:219
When tag.fullyQualifiedName is None, the code falls back to "Unknown" as the tag FQN (lines 249 and 282). This value is then passed through to Presidio as supported_entity, meaning:

  1. Presidio recognizers will report entity type "Unknown" in analysis results
  2. In custom_ner_scanner.py line 219, tag.fullyQualifiedName == entity_type will never match "Unknown" for any real tag, so detected entities will be silently dropped

If fullyQualifiedName is None, it's likely a data integrity issue that should be logged as a warning and the tag should be skipped rather than creating a recognizer with an unmatchable entity type. Consider:

name = tag.fullyQualifiedName
if not name:
    logger.warning("Tag has no fullyQualifiedName, skipping recognizer creation")
    return recognizers
Quality: Parameter named name should be tag_fqn for consistency

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:43-44
The public method create_recognizer uses the parameter name name, but all internal helper methods (_create_pattern_recognizer, _create_exact_terms_recognizer, _create_context_recognizer) use tag_fqn. The docstring also says "fully qualified name of the tag", and the callers set it to tag.fullyQualifiedName.

Using name is ambiguous (could be confused with the recognizer's own name) and inconsistent with the rest of the codebase. Suggest renaming:

def create_recognizer(
    recognizer_config: Recognizer, tag_fqn: str = "Unknown"
) -> Optional[EntityRecognizer]:
Bug: Tuple assignment instead of string: tag_fqn = str, ...

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282
Line 282 reads:

tag_fqn = str, tag.fullyQualifiedName or "Unknown"

This is a broken refactor of the original tag_fqn = cast(str, tag.fullyQualifiedName). The comma makes this a tuple assignment: tag_fqn becomes (<class 'str'>, "PII.Sensitive") instead of the string "PII.Sensitive".

This tag_fqn is then used as a dictionary key at line 283-284:

self.recognizers[tag_fqn] = ...

And later in lookups, which will silently fail to match since keys are tuples but lookups would use strings.

Fix: Change to match the pattern used on line 249:

tag_fqn = tag.fullyQualifiedName or "Unknown"
Edge Case: Unsafe cast of potentially None fullyQualifiedName

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:251
The code uses cast(str, tag.fullyQualifiedName) which is a type hint only and doesn't perform runtime validation. If tag.fullyQualifiedName is None, this will pass None to create_recognizer() which expects a str, potentially causing errors downstream when Presidio tries to use it as the supported_entity.

Impact: Tags without a fullyQualifiedName would cause unexpected behavior or errors in PII detection.

Suggested fix:

if tag.fullyQualifiedName:
    recognizer = PresidioRecognizerFactory.create_recognizer(
        recognizer_config, str(tag.fullyQualifiedName)
    )
else:
    logger.warning(f"Tag {tag.name} missing fullyQualifiedName, skipping recognizer creation")
    continue

Alternatively, use an assertion or explicit check before the cast.

Bug: Missing required tag_fqn argument in invalid predefined test

📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:204 📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:199
The test test_create_predefined_recognizer_invalid_name also calls PresidioRecognizerFactory.create_recognizer() without the required tag_fqn parameter (visible at line 204 of the original file, though not shown in the diff as it wasn't modified).

Based on the pattern of changes in this PR, this test was likely missed and will fail with:
"TypeError: create_recognizer() missing 1 required positional argument: 'tag_fqn'"

Suggested fix:
Add the tag_fqn parameter to all remaining create_recognizer() calls in the test file that weren't updated in this PR.

  • ✅ Predefined recognizer test lost supported_entities assertion
  • ✅ v1120 migration references v1110 MigrationUtil directly

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@edg956
Copy link
Contributor Author

edg956 commented Feb 6, 2026

@gitar-bot

Code Review 👍 Approved with suggestions 4 resolved / 7 findings
Clean removal of redundant supportedEntity field with proper runtime replacement via tag_fqn. Previous parameter naming issue is resolved; other prior findings (test assertion gap, "Unknown" default, cross-version migration reference) remain unaddressed.

⚠️ Edge Case: Fallback to "Unknown" entity type creates silent misclassification
📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:249 📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282 📄 ingestion/src/metadata/pii/scanners/custom_ner_scanner.py:219

When tag.fullyQualifiedName is None, the code falls back to "Unknown" as the tag FQN (lines 249 and 282). This value is then passed through to Presidio as supported_entity, meaning:

  1. Presidio recognizers will report entity type "Unknown" in analysis results
  2. In custom_ner_scanner.py line 219, tag.fullyQualifiedName == entity_type will never match "Unknown" for any real tag, so detected entities will be silently dropped

If fullyQualifiedName is None, it's likely a data integrity issue that should be logged as a warning and the tag should be skipped rather than creating a recognizer with an unmatchable entity type. Consider:

name = tag.fullyQualifiedName
if not name:
    logger.warning("Tag has no fullyQualifiedName, skipping recognizer creation")
    return recognizers

This is already applied

⚠️ Edge Case: Predefined recognizer test lost supported_entities assertion
📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:193-198

The test test_create_predefined_recognizer removed the assertion assert "EMAIL_ADDRESS" in result.supported_entities but did not replace it with any equivalent assertion. The test now only checks isinstance(result, EntityRecognizer) which is a much weaker assertion.

Since predefined recognizers (like Presidio's built-in EmailRecognizer) have their own supported_entities defined internally, the test should still verify that the recognizer was correctly created with the expected entity type. The tag_fqn parameter "Some.Tag" is passed but it's unclear whether predefined recognizers use it or their own built-in entity type.

Suggested fix: Add back an assertion verifying result.supported_entities contains the expected value (either the built-in entity type or the passed tag FQN, depending on intended behavior).

This is on purpose. We don't care what the supported entities in the recognizer are because our classification mechanism already maps tags to recognizers

💡 Quality: v1120 migration references v1110 MigrationUtil directly
📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/mysql/v1120/Migration.java:19 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/postgres/v1120/Migration.java:19

Both the MySQL and PostgreSQL v1120 Migration classes instantiate org.openmetadata.service.migration.utils.v1110.MigrationUtil. While reusing utilities across migration versions can work, referencing a previous version's utility class by its fully qualified name rather than extracting shared logic into a common location creates a fragile cross-version dependency.

If v1110's MigrationUtil is refactored or removed in the future, v1120 migrations will break. Consider either:

  1. Moving the shared setRecognizersForSensitiveTags logic to a version-independent utility, or
  2. Adding a comment explaining why v1110's utility is deliberately reused here.

This is a minor concern since migrations are typically run once and frozen.

For this one-off things I'd prefer not to extract for now. If I've got to do this again then I'll do it in the TagDAO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

governance Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant