Remove supported entity field for context, exact terms, custom and pattern recognizers by edg956 · Pull Request #25720 · open-metadata/OpenMetadata

edg956 · 2026-02-05T19:35:34Z

Describe your changes:

This removes fields we agreed we wouldn't need in recognizers anymore

Type of change:

Bug fix

Checklist:

I have read the CONTRIBUTING document.

Summary by Gitar

Removes redundant supportedEntity field from four PII recognizer schemas (PatternRecognizer, ContextRecognizer, ExactTermsRecognizer, CustomRecognizer), simplifying the configuration model
Replaces enum-based entity identification with tag FQN - recognizers now use the tag's fully qualified name (e.g., "PII.EmailAddress") instead of PIIEntity enum values
Adds v1.1.20 database migrations for MySQL and PostgreSQL to update existing tag recognizers in the database with the new structure
Updates Python recognizer factory to accept tag_fqn parameter and pass it to Presidio recognizers as supported_entity
Regenerates TypeScript models to reflect schema changes across the frontend codebase

github-actions · 2026-02-05T19:40:21Z

TypeScript types have been updated based on the JSON schema changes in the PR

github-actions · 2026-02-05T19:42:20Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py

ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py

github-actions · 2026-02-06T18:23:03Z

TypeScript types have been updated based on the JSON schema changes in the PR

ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py

edg956 · 2026-02-06T18:30:46Z

@gitar-bot review

github-actions · 2026-02-06T18:31:30Z

TypeScript types have been updated based on the JSON schema changes in the PR

gitar-bot · 2026-02-06T18:32:22Z

🔍 CI failure analysis for 7847f87: Three distinct failures: (1) Python unit test across 6 runs (test needs fullyQualifiedName - PR-related), (2) Playwright E2E test on KPI widget (environmental/flaky), (3) Maven AwsCredentialsUtilTest errors (environmental - AWS config, unrelated to PR).

Three Distinct CI Failures

Failure 1: Python Unit Test (PR-Related - Requires Fix)

Test: test_presidio_recognizer_factory.py::test_create_recognizers_for_tag_multiple

Error: assert len(recognizers) == 2 → got 0

Affected: 6 runs across Python 3.10 (5 runs) and 3.11 (1 run)

Root Cause: Test creates Tag without fullyQualifiedName. New None validation returns empty list.

Fix: Add fullyQualifiedName="PII.Sensitive.test_tag" to test Tag object

Status: Deterministic test bug

Failure 2: Playwright E2E Test (Environmental)

Job: 62785804406 - playwright-ci-postgresql (shard 2/6)

Results: 363 passed / 1 failed / 20 skipped

Test: Flow-CustomizeWidgets-KPI-Widget

Error: expect(locator).toBeVisible() failed

Status: Flaky E2E test unrelated to PR's backend changes

Failure 3: Maven Java Tests (Environmental - Unrelated)

Job: 62785836393 - maven-postgresql-ci

Results: 7,918 passed / 3 errors / 701 skipped

Failed Tests: All in AwsCredentialsUtilTest:

testBuildCredentialsProviderWithEmptyCredentials (line 77)
testBuildCredentialsProviderWithNoCredentials (line 52)
testBuildCredentialsProviderWithOnlyAccessKey (line 64)

Error: IllegalArgumentException: AWS credentials not configured

Assessment: AWS credential configuration tests unrelated to PR changes. This PR modifies:

PII recognizer schemas (removing supportedEntity)
Python recognizer factory logic
Database migrations for tag recognizers
TypeScript types

None affect AWS credential utility classes.

Status: Environmental infrastructure/configuration issue

Code Review ✅ Approved 7 resolved / 7 findings

Clean removal of the supportedEntity field from recognizer schemas, replacing it with tag FQN-based identification. Previous findings have been addressed or dismissed per user feedback. No new issues found.

✅ 7 resolved

✅ Edge Case: Fallback to "Unknown" entity type creates silent misclassification

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:249 📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282 📄 ingestion/src/metadata/pii/scanners/custom_ner_scanner.py:219
When tag.fullyQualifiedName is None, the code falls back to "Unknown" as the tag FQN (lines 249 and 282). This value is then passed through to Presidio as supported_entity, meaning:

Presidio recognizers will report entity type "Unknown" in analysis results

In custom_ner_scanner.py line 219, tag.fullyQualifiedName == entity_type will never match "Unknown" for any real tag, so detected entities will be silently dropped

If fullyQualifiedName is None, it's likely a data integrity issue that should be logged as a warning and the tag should be skipped rather than creating a recognizer with an unmatchable entity type. Consider:
name = tag.fullyQualifiedName
if not name:
    logger.warning("Tag has no fullyQualifiedName, skipping recognizer creation")
    return recognizers

✅ Quality: Parameter named name should be tag_fqn for consistency

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:43-44
The public method create_recognizer uses the parameter name name, but all internal helper methods (_create_pattern_recognizer, _create_exact_terms_recognizer, _create_context_recognizer) use tag_fqn. The docstring also says "fully qualified name of the tag", and the callers set it to tag.fullyQualifiedName.

Using name is ambiguous (could be confused with the recognizer's own name) and inconsistent with the rest of the codebase. Suggest renaming:
def create_recognizer(
    recognizer_config: Recognizer, tag_fqn: str = "Unknown"
) -> Optional[EntityRecognizer]:

✅ Bug: Tuple assignment instead of string: tag_fqn = str, ...

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282
Line 282 reads:
tag_fqn = str, tag.fullyQualifiedName or "Unknown"
This is a broken refactor of the original tag_fqn = cast(str, tag.fullyQualifiedName). The comma makes this a tuple assignment: tag_fqn becomes (<class 'str'>, "PII.Sensitive") instead of the string "PII.Sensitive".

This tag_fqn is then used as a dictionary key at line 283-284:
self.recognizers[tag_fqn] = ...
And later in lookups, which will silently fail to match since keys are tuples but lookups would use strings.

Fix: Change to match the pattern used on line 249:
tag_fqn = tag.fullyQualifiedName or "Unknown"

✅ Edge Case: Unsafe cast of potentially None fullyQualifiedName

📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:251
The code uses cast(str, tag.fullyQualifiedName) which is a type hint only and doesn't perform runtime validation. If tag.fullyQualifiedName is None, this will pass None to create_recognizer() which expects a str, potentially causing errors downstream when Presidio tries to use it as the supported_entity.

Impact: Tags without a fullyQualifiedName would cause unexpected behavior or errors in PII detection.

Suggested fix:
if tag.fullyQualifiedName:
    recognizer = PresidioRecognizerFactory.create_recognizer(
        recognizer_config, str(tag.fullyQualifiedName)
    )
else:
    logger.warning(f"Tag {tag.name} missing fullyQualifiedName, skipping recognizer creation")
    continue
Alternatively, use an assertion or explicit check before the cast.

✅ Bug: Missing required tag_fqn argument in invalid predefined test

📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:204 📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:199
The test test_create_predefined_recognizer_invalid_name also calls PresidioRecognizerFactory.create_recognizer() without the required tag_fqn parameter (visible at line 204 of the original file, though not shown in the diff as it wasn't modified).

Based on the pattern of changes in this PR, this test was likely missed and will fail with:
"TypeError: create_recognizer() missing 1 required positional argument: 'tag_fqn'"

Suggested fix:
Add the tag_fqn parameter to all remaining create_recognizer() calls in the test file that weren't updated in this PR.

✅ Predefined recognizer test lost supported_entities assertion
✅ v1120 migration references v1110 MigrationUtil directly

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

edg956 · 2026-02-06T18:37:01Z

@gitar-bot

Code Review 👍 Approved with suggestions 4 resolved / 7 findings
Clean removal of redundant supportedEntity field with proper runtime replacement via tag_fqn. Previous parameter naming issue is resolved; other prior findings (test assertion gap, "Unknown" default, cross-version migration reference) remain unaddressed.

⚠️ Edge Case: Fallback to "Unknown" entity type creates silent misclassification
📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:249 📄 ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py:282 📄 ingestion/src/metadata/pii/scanners/custom_ner_scanner.py:219

When tag.fullyQualifiedName is None, the code falls back to "Unknown" as the tag FQN (lines 249 and 282). This value is then passed through to Presidio as supported_entity, meaning:

Presidio recognizers will report entity type "Unknown" in analysis results

In custom_ner_scanner.py line 219, tag.fullyQualifiedName == entity_type will never match "Unknown" for any real tag, so detected entities will be silently dropped

If fullyQualifiedName is None, it's likely a data integrity issue that should be logged as a warning and the tag should be skipped rather than creating a recognizer with an unmatchable entity type. Consider:
name = tag.fullyQualifiedName
if not name:
    logger.warning("Tag has no fullyQualifiedName, skipping recognizer creation")
    return recognizers

This is already applied

⚠️ Edge Case: Predefined recognizer test lost supported_entities assertion
📄 ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py:193-198

The test test_create_predefined_recognizer removed the assertion assert "EMAIL_ADDRESS" in result.supported_entities but did not replace it with any equivalent assertion. The test now only checks isinstance(result, EntityRecognizer) which is a much weaker assertion.

Since predefined recognizers (like Presidio's built-in EmailRecognizer) have their own supported_entities defined internally, the test should still verify that the recognizer was correctly created with the expected entity type. The tag_fqn parameter "Some.Tag" is passed but it's unclear whether predefined recognizers use it or their own built-in entity type.

Suggested fix: Add back an assertion verifying result.supported_entities contains the expected value (either the built-in entity type or the passed tag FQN, depending on intended behavior).

This is on purpose. We don't care what the supported entities in the recognizer are because our classification mechanism already maps tags to recognizers

💡 Quality: v1120 migration references v1110 MigrationUtil directly
📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/mysql/v1120/Migration.java:19 📄 openmetadata-service/src/main/java/org/openmetadata/service/migration/postgres/v1120/Migration.java:19

Both the MySQL and PostgreSQL v1120 Migration classes instantiate org.openmetadata.service.migration.utils.v1110.MigrationUtil. While reusing utilities across migration versions can work, referencing a previous version's utility class by its fully qualified name rather than extracting shared logic into a common location creates a fragile cross-version dependency.

If v1110's MigrationUtil is refactored or removed in the future, v1120 migrations will break. Consider either:

Moving the shared setRecognizersForSensitiveTags logic to a version-independent utility, or

Adding a comment explaining why v1110's utility is deliberately reused here.

This is a minor concern since migrations are typically run once and frozen.

For this one-off things I'd prefer not to extract for now. If I've got to do this again then I'll do it in the TagDAO

edg956 requested review from a team as code owners February 5, 2026 19:35

edg956 had a problem deploying to test February 5, 2026 19:35 — with GitHub Actions Error

edg956 added the safe to test Add this label to run secure Github workflows on PRs label Feb 5, 2026

edg956 had a problem deploying to test February 5, 2026 19:35 — with GitHub Actions Error

github-actions bot added the Ingestion label Feb 5, 2026

edg956 added WIP and removed Ingestion labels Feb 5, 2026

edg956 changed the title ~~Remove supported entity field for context, exact terms, custom and pattern recognizers~~ [WIP] Remove supported entity field for context, exact terms, custom and pattern recognizers Feb 5, 2026

edg956 had a problem deploying to test February 5, 2026 19:36 — with GitHub Actions Failure

github-actions bot requested a review from a team as a code owner February 5, 2026 19:40

gitar-bot bot reviewed Feb 5, 2026

View reviewed changes

ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py Show resolved Hide resolved

gitar-bot bot reviewed Feb 5, 2026

View reviewed changes

ingestion/src/metadata/pii/algorithms/presidio_recognizer_factory.py Outdated Show resolved Hide resolved

edg956 had a problem deploying to test February 6, 2026 18:16 — with GitHub Actions Error

edg956 added To release Will cherry-pick this PR into the release branch governance and removed WIP labels Feb 6, 2026

edg956 had a problem deploying to test February 6, 2026 18:19 — with GitHub Actions Error

gitar-bot bot reviewed Feb 6, 2026

View reviewed changes

ingestion/tests/unit/pii/algorithms/test_presidio_recognizer_factory.py Show resolved Hide resolved

edg956 added 2 commits February 6, 2026 19:27

Update ingestion

2d8429a

Update backend tests

4145ff7

edg956 force-pushed the recognizers-without-supported-entity branch from 99b6712 to 4145ff7 Compare February 6, 2026 18:27

edg956 temporarily deployed to test February 6, 2026 18:27 — with GitHub Actions Inactive

edg956 had a problem deploying to test February 6, 2026 18:27 — with GitHub Actions Failure

edg956 temporarily deployed to test February 6, 2026 18:27 — with GitHub Actions Inactive

open-metadata deleted a comment from gitar-bot bot Feb 6, 2026

Update generated TypeScript types

7847f87

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove supported entity field for context, exact terms, custom and pattern recognizers#25720

Remove supported entity field for context, exact terms, custom and pattern recognizers#25720
edg956 wants to merge 5 commits intomainfrom
recognizers-without-supported-entity

edg956 commented Feb 5, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

edg956 commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

gitar-bot bot commented Feb 6, 2026 •

edited

Loading

Three Distinct CI Failures

Failure 1: Python Unit Test (PR-Related - Requires Fix)

Failure 2: Playwright E2E Test (Environmental)

Failure 3: Maven Java Tests (Environmental - Unrelated)

Uh oh!

edg956 commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

edg956 commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

github-actions bot commented Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

Uh oh!

edg956 commented Feb 6, 2026

Uh oh!

github-actions bot commented Feb 6, 2026

Uh oh!

gitar-bot bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Three Distinct CI Failures

Failure 1: Python Unit Test (PR-Related - Requires Fix)

Failure 2: Playwright E2E Test (Environmental)

Failure 3: Maven Java Tests (Environmental - Unrelated)

Uh oh!

edg956 commented Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

edg956 commented Feb 5, 2026 •

edited

Loading

gitar-bot bot commented Feb 6, 2026 •

edited

Loading