Skip to content

Fix pinot sampling#25667

Open
edg956 wants to merge 6 commits intomainfrom
fix-pinot-random
Open

Fix pinot sampling#25667
edg956 wants to merge 6 commits intomainfrom
fix-pinot-random

Conversation

@edg956
Copy link
Contributor

@edg956 edg956 commented Feb 2, 2026

Describe your changes:

While testing Pinot profiler for 1.12 I discovered that PinotDB does not support RANDOM() function used for other DBs to build the sampling query.

This PR creates a sampler specific for trino that uses hashes and modular arithmetic to achieve a sampling mechanism that doesn't rely on random generator functions.

Here's the error output when running the profiler against trino with sampling enabled
partitioned_test.duration_ms metric_type.value: (pinotdb.exceptions.DatabaseError) {'errorCode': 720,                                                                                                                                                                                                                                                                                                                                           | Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                                                                                                              |
 'message': 'QueryPlanningError:\n'                                                                                                                                                                                                                                                                                                                                                                                                             |   File "/Users/eugenio/repos/work/openmetadata-collate/ingestion-extension/venv/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context                                                                                                                                                                                                                                                                         |
            'Error composing query plan for \'/* {"app": "OpenMetadata", '                                                                                                                                                                                                                                                                                                                                                                      |     self.dialect.do_execute(                                                                                                                                                                                                                                                                                                                                                                                                                    |
            '"version": "1.12.0.0"} */\n'                                                                                                                                                                                                                                                                                                                                                                                                       |   File "/Users/eugenio/repos/work/openmetadata-collate/ingestion-extension/venv/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute                                                                                                                                                                                                                                                                             |
            'WITH "04cb45497a65712e6ea11a441f3b4011_rnd" AS \n'                                                                                                                                                                                                                                                                                                                                                                                 |     cursor.execute(statement, parameters)                                                                                                                                                                                                                                                                                                                                                                                                       |
            '(SELECT "default".partitioned_test.duration_ms AS duration_ms, '                                                                                                                                                                                                                                                                                                                                                                   |   File "/Users/eugenio/repos/work/openmetadata-collate/ingestion-extension/venv/lib/python3.11/site-packages/pinotdb/db.py", line 57, in g                                                                                                                                                                                                                                                                                                      |
            'ABS(RANDOM()) * 100 % 100 AS random \n'                                                                                                                                                                                                                                                                                                                                                                                            |     return f(self, *args, **kwargs)                                                                                                                                                                                                                                                                                                                                                                                                             |
            'FROM "default".partitioned_test \n'                                                                                                                                                                                                                                                                                                                                                                                                |            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                                                                                             |
            "WHERE region IN ('us-east', 'us-west')), \n"                                                                                                                                                                                                                                                                                                                                                                                       |   File "/Users/eugenio/repos/work/openmetadata-collate/ingestion-extension/venv/lib/python3.11/site-packages/pinotdb/db.py", line 513, in execute                                                                                                                                                                                                                                                                                               |
            '"04cb45497a65712e6ea11a441f3b4011_sample" AS \n'                                                                                                                                                                                                                                                                                                                                                                                   |     return self.normalize_query_response(query, r)                                                                                                                                                                                                                                                                                                                                                                                              |
            '(SELECT "04cb45497a65712e6ea11a441f3b4011_rnd".duration_ms AS '                                                                                                                                                                                                                                                                                                                                                                    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                                                                                                                                                                                                                              |
            'duration_ms, "04cb45497a65712e6ea11a441f3b4011_rnd".random AS '                                                                                                                                                                                                                                                                                                                                                                    |   File "/Users/eugenio/repos/work/openmetadata-collate/ingestion-extension/venv/lib/python3.11/site-packages/pinotdb/db.py", line 445, in normalize_query_response                                                                                                                                                                                                                                                                              |
            'random \n'                                                                                                                                                                                                                                                                                                                                                                                                                         |     raise exceptions.DatabaseError(msg)                                                                                                                                                                                                                                                                                                                                                                                                         |
            'FROM "04cb45497a65712e6ea11a441f3b4011_rnd" \n'                                                                                                                                                                                                                                                                                                                                                                                    | pinotdb.exceptions.DatabaseError: {'errorCode': 720,                                                                                                                                                                                                                                                                                                                                                                                            |
            'WHERE "04cb45497a65712e6ea11a441f3b4011_rnd".random <= 50)\n'                                                                                                                                                                                                                                                                                                                                                                      |  'message': 'QueryPlanningError:\n'                                                                                                                                                                                                                                                                                                                                                                                                             |
            ' SELECT avg(duration_ms) AS mean, count(duration_ms) AS '                                                                                                                                                                                                                                                                                                                                                                          |             'Error composing query plan for \'/* {"app": "OpenMetadata", '                                                                                                                                                                                                                                                                                                                                                                      |
            '"valuesCount", count(DISTINCT duration_ms) AS "distinctCount", '                                                                                                                                                                                                                                                                                                                                                                   |             '"version": "1.12.0.0"} */\n'                                                                                                                                                                                                                                                                                                                                                                                                       |
            'MIN(duration_ms) AS "min", MAX(duration_ms) AS "max", '                                                                                                                                                                                                                                                                                                                                                                            |             'WITH "04cb45497a65712e6ea11a441f3b4011_rnd" AS \n'                                                                                                                                                                                                                                                                                                                                                                                 |
            'coalesce(SUM(CAST(CASE WHEN (duration_ms IS NULL) THEN 1 ELSE 0 '                                                                                                                                                                                                                                                                                                                                                                  |             '(SELECT "default".partitioned_test.duration_ms AS duration_ms, '                                                                                                                                                                                                                                                                                                                                                                   |
            'END AS BIGINT)), 0) AS "nullCount", STDDEV_POP(duration_ms) AS '                                                                                                                                                                                                                                                                                                                                                                   |             'ABS(RANDOM()) * 100 % 100 AS random \n'                                                                                                                                                                                                                                                                                                                                                                                            |
            'stddev, SUM(CAST(duration_ms AS BIGINT)) AS "sum" \n'                                                                                                                                                                                                                                                                                                                                                                              |             'FROM "default".partitioned_test \n'                                                                                                                                                                                                                                                                                                                                                                                                |
            'FROM "04cb45497a65712e6ea11a441f3b4011_sample"\n'                                                                                                                                                                                                                                                                                                                                                                                  |             "WHERE region IN ('us-east', 'us-west')), \n"                                                                                                                                                                                                                                                                                                                                                                                       |
            " LIMIT 1': From line 3, column 68 to line 3, column 75: No match "                                                                                                                                                                                                                                                                                                                                                                 |             '"04cb45497a65712e6ea11a441f3b4011_sample" AS \n'                                                                                                                                                                                                                                                                                                                                                                                   |
            "found for function signature RANDOM()'\n"                                                                                                                                                                                                                                                                                                                                                                                          |             '(SELECT "04cb45497a65712e6ea11a441f3b4011_rnd".duration_ms AS '                                                                                                                                                                                                                                                                                                                                                                    |
            'org.apache.pinot.query.QueryEnvironment.planQuery(QueryEnvironment.java:136)\n'                                                                                                                                                                                                                                                                                                                                                    |             'duration_ms, "04cb45497a65712e6ea11a441f3b4011_rnd".random AS '                                                                                                                                                                                                                                                                                                                                                                    |
            'org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.handleRequest(MultiStageBrokerRequestHandler.java:158)\n'                                                                                                                                                                                                                                                                                                    |             'random \n'                                                                                                                                                                                                                                                                                                                                                                                                                         |
            'org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:133)\n'                                                                                                                                                                                                                                                                                                                |             'FROM "04cb45497a65712e6ea11a441f3b4011_rnd" \n'                                                                                                                                                                                                                                                                                                                                                                                    |
            'org.apache.pinot.broker.requesthandler.BrokerRequestHandlerDelegate.handleRequest(BrokerRequestHandlerDelegate.java:86)\n'                                                                                                                                                                                                                                                                                                         |             'WHERE "04cb45497a65712e6ea11a441f3b4011_rnd".random <= 50)\n'                                                                                                                                                                                                                                                                                                                                                                      |
            'From line 3, column 68 to line 3, column 75: No match found for '                                                                                                                                                                                                                                                                                                                                                                  |             ' SELECT avg(duration_ms) AS mean, count(duration_ms) AS '                                                                                                                                                                                                                                                                                                                                                                          |
            'function signature RANDOM()\n'                                                                                                                                                                                                                                                                                                                                                                                                     |             '"valuesCount", count(DISTINCT duration_ms) AS "distinctCount", '                                                                                                                                                                                                                                                                                                                                                                   |
            'jdk.internal.reflect.GeneratedConstructorAccessor67.newInstance(Unknown '                                                                                                                                                                                                                                                                                                                                                          |             'MIN(duration_ms) AS "min", MAX(duration_ms) AS "max", '                                                                                                                                                                                                                                                                                                                                                                            |
            'Source)\n'                                                                                                                                                                                                                                                                                                                                                                                                                         |             'coalesce(SUM(CAST(CASE WHEN (duration_ms IS NULL) THEN 1 ELSE 0 '                                                                                                                                                                                                                                                                                                                                                                  |
            'java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n'                                                                                                                                                                                                                                                                                                         |             'END AS BIGINT)), 0) AS "nullCount", STDDEV_POP(duration_ms) AS '                                                                                                                                                                                                                                                                                                                                                                   |
            'java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)\n'                                                                                                                                                                                                                                                                                                                                             |             'stddev, SUM(CAST(duration_ms AS BIGINT)) AS "sum" \n'                                                                                                                                                                                                                                                                                                                                                                              |
            'java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)\n'                                                                                                                                                                                                                                                                                                                                                       |             'FROM "04cb45497a65712e6ea11a441f3b4011_sample"\n'                                                                                                                                                                                                                                                                                                                                                                                  |
            'No match found for function signature RANDOM()\n'                                                                                                                                                                                                                                                                                                                                                                                  |             " LIMIT 1': From line 3, column 68 to line 3, column 75: No match "                                                                                                                                                                                                                                                                                                                                                                 |
            'jdk.internal.reflect.GeneratedConstructorAccessor66.newInstance(Unknown '                                                                                                                                                                                                                                                                                                                                                          |             "found for function signature RANDOM()'\n"                                                                                                                                                                                                                                                                                                                                                                                          |
            'Source)\n'                                                                                                                                                                                                                                                                                                                                                                                                                         |             'org.apache.pinot.query.QueryEnvironment.planQuery(QueryEnvironment.java:136)\n'                                                                                                                                                                                                                                                                                                                                                    |
            'java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n'                                                                                                                                                                                                                                                                                                         |             'org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.handleRequest(MultiStageBrokerRequestHandler.java:158)\n'                                                                                                                                                                                                                                                                                                    |
            'java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)\n'                                                                                                                                                                                                                                                                                                                                             |             'org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:133)\n'                                                                                                                                                                                                                                                                                                                |
            'java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)\n'}                                                                                                                                                                                                                                                                                                                                                      |             'org.apache.pinot.broker.requesthandler.BrokerRequestHandlerDelegate.handleRequest(BrokerRequestHandlerDelegate.java:86)\n'                                                                                                                                                                                                                                                                                                         |
[SQL: /* {"app": "OpenMetadata", "version": "1.12.0.0"} */                                                                                                                                                                                                                                                                                                                                                                                      |             'From line 3, column 68 to line 3, column 75: No match found for '                                                                                                                                                                                                                                                                                                                                                                  |
WITH "04cb45497a65712e6ea11a441f3b4011_rnd" AS                                                                                                                                                                                                                                                                                                                                                                                                  |             'function signature RANDOM()\n'                                                                                                                                                                                                                                                                                                                                                                                                     |
(SELECT "default".partitioned_test.duration_ms AS duration_ms, ABS(RANDOM()) * 100 %% %(param_3)s AS random                                                                                                                                                                                                                                                                                                                                     |             'jdk.internal.reflect.GeneratedConstructorAccessor67.newInstance(Unknown '                                                                                                                                                                                                                                                                                                                                                          |
FROM "default".partitioned_test                                                                                                                                                                                                                                                                                                                                                                                                                 |             'Source)\n'                                                                                                                                                                                                                                                                                                                                                                                                                         |
WHERE region IN (%(region_1_1)s, %(region_1_2)s)),                                                                                                                                                                                                                                                                                                                                                                                              |             'java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n'                                                                                                                                                                                                                                                                                                         |
"04cb45497a65712e6ea11a441f3b4011_sample" AS                                                                                                                                                                                                                                                                                                                                                                                                    |             'java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)\n'                                                                                                                                                                                                                                                                                                                                             |
(SELECT "04cb45497a65712e6ea11a441f3b4011_rnd".duration_ms AS duration_ms, "04cb45497a65712e6ea11a441f3b4011_rnd".random AS random                                                                                                                                                                                                                                                                                                              |             'java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)\n'                                                                                                                                                                                                                                                                                                                                                       |
FROM "04cb45497a65712e6ea11a441f3b4011_rnd"                                                                                                                                                                                                                                                                                                                                                                                                     |             'No match found for function signature RANDOM()\n'                                                                                                                                                                                                                                                                                                                                                                                  |
WHERE "04cb45497a65712e6ea11a441f3b4011_rnd".random <= %(random_1)s)                                                                                                                                                                                                                                                                                                                                                                            |             'jdk.internal.reflect.GeneratedConstructorAccessor66.newInstance(Unknown '                                                                                                                                                                                                                                                                                                                                                          |
 SELECT avg(duration_ms) AS mean, count(duration_ms) AS "valuesCount", count(DISTINCT duration_ms) AS "distinctCount", MIN(duration_ms) AS "min", MAX(duration_ms) AS "max", coalesce(SUM(CAST(CASE WHEN (duration_ms IS NULL) THEN %(param_1)s ELSE %(param_2)s END AS BIGINT)), %(coalesce_1)s) AS "nullCount", STDDEV_POP(duration_ms) AS stddev, SUM(CAST(duration_ms AS BIGINT)) AS "sum"                                                  |             'Source)\n'                                                                                                                                                                                                                                                                                                                                                                                                                         |
FROM "04cb45497a65712e6ea11a441f3b4011_sample"                                                                                                                                                                                                                                                                                                                                                                                                  |             'java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)\n'                                                                                                                                                                                                                                                                                                         |
 LIMIT %(param_4)s]                                                                                                                                                                                                                                                                                                                                                                                                                             |             'java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)\n'                                                                                                                                                                                                                                                                                                                                             |
[parameters: {'param_1': 1, 'param_2': 0, 'coalesce_1': 0, 'param_3': 100, 'random_1': 50, 'param_4': 1, 'region_1_1': 'us-east', 'region_1_2': 'us-west'}]                                                                                                                                                                                                                                                                                     |             'java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)\n'}                                                                                                                                                                                                                                                                                                                                                      |
(Background on this error at: https://sqlalche.me/e/14/4xp6)  
After fixing the issue above, I encountered the following issue with histograms, which is also fixed in this PR ```text 'Received error query execution result block: ' '{200=QueryExecutionError:\n' 'org.apache.pinot.spi.exception.BadQueryRequestException: ' 'java.lang.NumberFormatException: For input string: "195.0"\n' '\tat ' 'org.apache.pinot.core.operator.filter.predicate.PredicateEvaluatorProvider.getPredicateEvaluator(PredicateEvaluatorProvider.java:94)\n' '\tat ' 'org.apache.pinot.core.operator.filter.predicate.PredicateEvaluatorProvider.getPredicateEvaluator(PredicateEvaluatorProvider.java:100)\n' '\tat ' 'org.apache.pinot.core.plan.FilterPlanNode.constructPhysicalOperator(FilterPlanNode.java:310)\n' '\tat ' 'org.apache.pinot.core.plan.FilterPlanNode.constructPhysicalOperator(FilterPlanNode.java:201)\n' '...\n' 'Caused by: java.lang.NumberFormatException: For input string: ' '"195.0"\n' '\tat ' 'java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67)\n' '\tat java.base/java.lang.Long.parseLong(Long.java:711)\n' '\tat java.base/java.lang.Long.parseLong(Long.java:836)\n' '\tat ' 'org.apache.pinot.segment.local.segment.index.readers.LongDictionary.insertionIndexOf(LongDictionary.java:44)}\n' 'org.apache.pinot.query.service.dispatch.QueryDispatcher.runReducer(QueryDispatcher.java:306)\n' 'org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:96)\n' 'org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.handleRequest(MultiStageBrokerRequestHandler.java:219)\n' 'org.apache.pinot.broker.requesthandler.BaseBrokerRequestHandler.handleRequest(BaseBrokerRequestHandler.java:133)\n'} `` `

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • I have commented on my code, particularly in hard-to-understand areas.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

@edg956 edg956 self-assigned this Feb 2, 2026
@edg956 edg956 requested a review from a team as a code owner February 2, 2026 16:20
@edg956 edg956 added Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch labels Feb 2, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion-base-slim:trivy (debian 12.13)

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (10)

Package Vulnerability ID Severity Installed Version Fixed Version
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/extended_sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/lineage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_data_aut.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.json

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage.yaml

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /ingestion/pipelines/sample_usage_aut.yaml

No Vulnerabilities Found

@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2026

🛡️ TRIVY SCAN RESULT 🛡️

Target: openmetadata-ingestion:trivy (debian 12.12)

Vulnerabilities (4)

Package Vulnerability ID Severity Installed Version Fixed Version
libpam-modules CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-modules-bin CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam-runtime CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2
libpam0g CVE-2025-6020 🚨 HIGH 1.5.2-6+deb12u1 1.5.2-6+deb12u2

🛡️ TRIVY SCAN RESULT 🛡️

Target: Java

Vulnerabilities (33)

Package Vulnerability ID Severity Installed Version Fixed Version
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.12.7 2.15.0
com.fasterxml.jackson.core:jackson-core CVE-2025-52999 🚨 HIGH 2.13.4 2.15.0
com.fasterxml.jackson.core:jackson-databind CVE-2022-42003 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4.2
com.fasterxml.jackson.core:jackson-databind CVE-2022-42004 🚨 HIGH 2.12.7 2.12.7.1, 2.13.4
com.google.code.gson:gson CVE-2022-25647 🚨 HIGH 2.2.4 2.8.9
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.3.0 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.3.0 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.3.0 3.25.5, 4.27.5, 4.28.2
com.google.protobuf:protobuf-java CVE-2021-22569 🚨 HIGH 3.7.1 3.16.1, 3.18.2, 3.19.2
com.google.protobuf:protobuf-java CVE-2022-3509 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2022-3510 🚨 HIGH 3.7.1 3.16.3, 3.19.6, 3.20.3, 3.21.7
com.google.protobuf:protobuf-java CVE-2024-7254 🚨 HIGH 3.7.1 3.25.5, 4.27.5, 4.28.2
com.nimbusds:nimbus-jose-jwt CVE-2023-52428 🚨 HIGH 9.8.1 9.37.2
com.squareup.okhttp3:okhttp CVE-2021-0341 🚨 HIGH 3.12.12 4.9.2
commons-beanutils:commons-beanutils CVE-2025-48734 🚨 HIGH 1.9.4 1.11.0
commons-io:commons-io CVE-2024-47554 🚨 HIGH 2.8.0 2.14.0
dnsjava:dnsjava CVE-2024-25638 🚨 HIGH 2.1.7 3.6.0
io.netty:netty-codec-http2 CVE-2025-55163 🚨 HIGH 4.1.96.Final 4.2.4.Final, 4.1.124.Final
io.netty:netty-codec-http2 GHSA-xpw8-rcwv-8f8p 🚨 HIGH 4.1.96.Final 4.1.100.Final
io.netty:netty-handler CVE-2025-24970 🚨 HIGH 4.1.96.Final 4.1.118.Final
net.minidev:json-smart CVE-2021-31684 🚨 HIGH 1.3.2 1.3.3, 2.4.4
net.minidev:json-smart CVE-2023-1370 🚨 HIGH 1.3.2 2.4.9
org.apache.avro:avro CVE-2024-47561 🔥 CRITICAL 1.7.7 1.11.4
org.apache.avro:avro CVE-2023-39410 🚨 HIGH 1.7.7 1.11.3
org.apache.derby:derby CVE-2022-46337 🔥 CRITICAL 10.14.2.0 10.14.3, 10.15.2.1, 10.16.1.2, 10.17.1.0
org.apache.ivy:ivy CVE-2022-46751 🚨 HIGH 2.5.1 2.5.2
org.apache.mesos:mesos CVE-2018-1330 🚨 HIGH 1.4.3 1.6.0
org.apache.thrift:libthrift CVE-2019-0205 🚨 HIGH 0.12.0 0.13.0
org.apache.thrift:libthrift CVE-2020-13949 🚨 HIGH 0.12.0 0.14.0
org.apache.zookeeper:zookeeper CVE-2023-44981 🔥 CRITICAL 3.6.3 3.7.2, 3.8.3, 3.9.1
org.eclipse.jetty:jetty-server CVE-2024-13009 🚨 HIGH 9.4.56.v20240826 9.4.57.v20241219
org.lz4:lz4-java CVE-2025-12183 🚨 HIGH 1.8.0 1.8.1

🛡️ TRIVY SCAN RESULT 🛡️

Target: Node.js

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: Python

Vulnerabilities (20)

Package Vulnerability ID Severity Installed Version Fixed Version
Werkzeug CVE-2024-34069 🚨 HIGH 2.2.3 3.0.3
aiohttp CVE-2025-69223 🚨 HIGH 3.12.12 3.13.3
aiohttp CVE-2025-69223 🚨 HIGH 3.13.2 3.13.3
apache-airflow CVE-2025-68438 🚨 HIGH 3.1.5 3.1.6
apache-airflow CVE-2025-68675 🚨 HIGH 3.1.5 3.1.6
azure-core CVE-2026-21226 🚨 HIGH 1.37.0 1.38.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 5.3.0 6.1.0
jaraco.context CVE-2026-23949 🚨 HIGH 6.0.1 6.1.0
protobuf CVE-2026-0994 🚨 HIGH 4.25.8 6.33.5
pyasn1 CVE-2026-23490 🚨 HIGH 0.6.1 0.6.2
python-multipart CVE-2026-24486 🚨 HIGH 0.0.20 0.0.22
ray CVE-2025-62593 🔥 CRITICAL 2.47.1 2.52.0
starlette CVE-2025-62727 🚨 HIGH 0.48.0 0.49.1
urllib3 CVE-2025-66418 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2025-66471 🚨 HIGH 1.26.20 2.6.0
urllib3 CVE-2026-21441 🚨 HIGH 1.26.20 2.6.3
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2
wheel CVE-2026-24049 🚨 HIGH 0.45.1 0.46.2

🛡️ TRIVY SCAN RESULT 🛡️

Target: /etc/ssl/private/ssl-cert-snakeoil.key

No Vulnerabilities Found

🛡️ TRIVY SCAN RESULT 🛡️

Target: /home/airflow/openmetadata-airflow-apis/openmetadata_managed_apis.egg-info/PKG-INFO

No Vulnerabilities Found

@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 3, 2026

@gitar-bot
Copy link

gitar-bot bot commented Feb 4, 2026

🔍 CI failure analysis for 03ba57c: Python integration test failing across all Python versions (3.10, 3.11) due to histogram frequency order reversal - requires code fix

Issue - COMPLETE CI STATUS FOR COMMIT 03ba57c

Python Test Failures:

  • py-run-tests (3.10) - Job ID: 62527204697 - FAILED
  • py-run-tests (3.10) - Job ID: 62527204756 - FAILED
  • py-run-tests (3.11) - Job ID: 62527204701 - FAILED

All 3 Python test jobs failed with the SAME test:

FAILED ingestion/tests/integration/postgres/test_profiler.py::test_profiler[financial_transactions-]

Error (consistent across all 3 jobs):

AssertionError: objects mismatched on field: [histogram.frequencies], 
expected: [[3, 2]], actual: [[2, 3]]

Test Results: 1 failed, 532 passed, 21 skipped, 1 xfailed

Playwright Tests:

  • Only 3 flaky E2E UI test failures (Domain, Pipeline features)
  • 4 of 5 jobs succeeded (80% success rate)
  • Unrelated to PR changes

Root Cause

THIS IS DIRECTLY RELATED TO THE PR CHANGES - BLOCKING ISSUE

The test failure is in the profiler integration test, directly related to code modified in this PR:

What Changed:

Commit 03ba57c titled "Fix histogram using values breaking Pinot" modified:

  • ingestion/src/metadata/profiler/metrics/hybrid/histogram.py
  • Changed histogram bucket generation logic to fix PinotDB compatibility

Impact:

The histogram fix for PinotDB has changed the frequency ordering behavior, causing the postgres integration test to fail because it expects the original frequency order.

Why This Is a Real Issue (Not Flaky):

  1. Direct Code Connection: PR modified histogram.py → histogram test fails
  2. Functional Change: Histogram frequency ordering has changed from [3, 2] to [2, 3]
  3. Reproducible: Same failure across all 3 Python test runs (3.10, 3.10, 3.11)
  4. Not Timing-Related: Integration test, not UI/browser test
  5. Consistent Pattern: 100% failure rate in Python tests for this specific test
  6. Only 1 Test Affected: All other 532 tests pass

Analysis

Test Details:

  • Test File: ingestion/tests/integration/postgres/test_profiler.py
  • Test Function: test_profiler[financial_transactions-]
  • Table: financial_transactions (postgres integration test database)
  • Metric: Histogram frequencies
  • Expected Value: [[3, 2]]
  • Actual Value: [[2, 3]]

What This Means:

The histogram frequency counts are being reversed. This could indicate:

  1. The histogram buckets are being generated in reverse order, OR
  2. The frequency counting logic has changed, OR
  3. The bucket-to-frequency mapping has been inverted

Histogram Context:

From the PR description, the histogram fix addresses:

java.lang.NumberFormatException: For input string: "195.0"

This suggests the fix converts floating point values to integers for PinotDB compatibility. This conversion or the associated bucket logic changes may have affected the frequency ordering.

Why This Blocks the Merge

Unlike the 3 flaky Playwright failures, this is a functional regression:

  1. Affects All Python Versions: Fails on 3.10 and 3.11
  2. Integration Test: Tests actual profiler behavior, not UI
  3. Reproducible 100%: Not random or timing-related
  4. Code-Related: Directly caused by histogram.py changes
  5. Semantic Concern: Frequency order matters for histogram interpretation

Current CI Summary:

  • Python tests: 3 failed (same issue across all Python versions)
  • Playwright tests: 3 flaky failures (unrelated to PR)
  • Total blocking issues: 1 (histogram frequency order)
Code Review ✅ Approved 5 resolved / 5 findings

Both previously identified issues have been resolved: the debug log is now correctly placed inside the fallback condition, and the intersperse function now handles empty sequences properly with an early return.

✅ 5 resolved
Bug: Debug log always printed regardless of column selection result

📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:134-140
In _get_sampling_columns(), the debug message "No PRIMARY KEY or UNIQUE constraints found, using all columns for sampling" is logged unconditionally at line 138, even when PRIMARY KEY or UNIQUE columns were found earlier. The log statement should be inside a conditional branch or the logic should be restructured.

Current code:

if not columns:
    columns = [col.name.root for col in self.entity.columns]

logger.debug(
    "No PRIMARY KEY or UNIQUE constraints found, using all columns for sampling"
)

The log message should only be printed when the fallback to all columns is actually used:

if not columns:
    columns = [col.name.root for col in self.entity.columns]
    logger.debug(
        "No PRIMARY KEY or UNIQUE constraints found, using all columns for sampling"
    )
Edge Case: intersperse function fails with empty sequence

📄 ingestion/src/metadata/utils/lists.py:33-37
The intersperse function will raise an error when passed an empty sequence because len(sequence) * 2 - 1 equals -1 when len(sequence) is 0, and attempting to create a list with negative length will fail.

final_list = [item] * (len(sequence) * 2 - 1)  # -1 when sequence is empty

While there is a test test_intersperse_empty_list that expects an empty list return, the current implementation will actually raise a ValueError for empty input.

Suggested fix:
Add an early return for empty sequences:

def intersperse(sequence: Sequence[Any], item: Any) -> List[Any]:
    if len(sequence) == 0:
        return []
    final_list = [item] * (len(sequence) * 2 - 1)
    final_list[::2] = list(sequence)
    return final_list
Bug: Unused variable `table_query` in row sampling code path

📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:225-231
In the get_sample_query method, when using row count sampling (not percentage), the code creates a table_query variable that is never used:

table_query = client.query(self.raw_dataset)
if self.partition_details:
    table_query = self.get_partitioned_query(table_query)

session_query = self._base_sample_query(column)
query = session_query.limit(self.sample_config.profileSample)

The table_query is constructed and potentially modified for partitioning, but then session_query is created independently from _base_sample_query(column), which appears to ignore the partitioned query entirely.

Suggested fix:
Either use table_query appropriately or remove the dead code if partition handling is done differently in the base class.

Bug: Negative hash values cause incorrect sampling percentages

📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:195 📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:202
The _build_hash_expression method uses modulo 100 on hash values:

return f"HASH(CAST({columns[0]} AS VARCHAR), 'MURMUR3') % 100"

MURMUR3 hash can return negative integers in PinotDB. When performing modulo on negative numbers, the result can also be negative (e.g., -5 % 100 = -5 in many systems). This could cause the sampling query to fail or behave unexpectedly since the condition random <= profileSample (where profileSample is 0-100) would never match negative values.

Suggested fix:
Wrap with ABS() to ensure non-negative values:

return f"ABS(HASH(CAST({columns[0]} AS VARCHAR), 'MURMUR3')) % 100"
Security: SQL injection risk in hash expression construction

📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:195 📄 ingestion/src/metadata/sampler/sqlalchemy/pinot/sampler.py:199
The _build_hash_expression method constructs SQL directly using column names via string formatting without any escaping or validation:

return f"HASH(CAST({columns[0]} AS VARCHAR), 'MURMUR3') % 100"
concat_parts.append(f"CAST({col} AS VARCHAR)")

While column names typically come from trusted sources (entity metadata), if a column name contains SQL metacharacters or injection patterns, this could lead to SQL injection. Column names should be properly quoted/escaped using SQLAlchemy's quoting utilities.

Suggested fix:
Use SQLAlchemy's quoted_name or dialect-specific identifier quoting:

from sqlalchemy.sql import quoted_name
# Or use column().compile() to properly escape identifiers

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ingestion safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant