Skip to content

[Search] Upgrade Clients#25719

Open
mohityadav766 wants to merge 16 commits intomainfrom
upgrade-clients
Open

[Search] Upgrade Clients#25719
mohityadav766 wants to merge 16 commits intomainfrom
upgrade-clients

Conversation

@mohityadav766
Copy link
Member

@mohityadav766 mohityadav766 commented Feb 5, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Updated Elasticsearch client:
    • 8.11.0 → 9.2.4 with migration to Rest5Client and Rest5ClientTransport for Apache HttpClient 5.x compatibility
    • Excluded Jackson 3.x dependencies requiring Java 23, maintaining Jackson 2.x for Java 21
  • Updated OpenSearch client:
    • 2.26.0 → 3.5.0 with migration to ApacheHttpClient5Transport replacing deprecated RestClient transport
  • Transport layer migration:
    • Replaced all Apache HttpClient 4.x imports with HttpClient 5.x equivalents across 41 production files
    • Updated connection management to use PoolingAsyncClientConnectionManagerBuilder pattern
    • Added keep-alive timeout configuration for both Elasticsearch and OpenSearch clients
  • New AWS authentication:
    • SigV4Hc5RequestSigningInterceptor provides HC5-compatible AWS IAM authentication for managed Elasticsearch/OpenSearch
    • Throws IllegalStateException when target host is unavailable instead of silently skipping signing
    • Properly handles async requests using UNSIGNED-PAYLOAD header (AWS skips body verification over HTTPS)
  • New utility method:
    • buildHttpHostsForHc5 in SearchUtils.java constructs HC5 HttpHost arrays from configuration
  • Updated Docker images:
    • Elasticsearch: 8.11.4 → 9.3.0 (upgraded from 9.0.0 to fix media-type header validation bug)
    • OpenSearch: 2.7.0/2.11.1/2.19.0/latest → 3.4.0 (upgraded from 3.0.0 for bug fixes)
    • Applied across development, quickstart, and distributed test environments
  • Fixed test files:
    • Updated 25 service test files with correct client library imports (Rest5Client for Elasticsearch, HC5 for both)
    • Updated 5 integration test files with correct client library imports
    • Updated test container images to Elasticsearch 9.3.0 and OpenSearch 3.4.0
  • Code review improvements:
    • Fixed resource leaks: Added try-with-resources for InputStream handling in ElasticSearchClient and OpenMetadataOperations
    • Enhanced SigV4 signing: Proper UNSIGNED-PAYLOAD support for async requests where body can't be extracted (safe over HTTPS to AWS-managed services)
    • Added exception logging in OpenMetadataOperations for better error visibility

This will update automatically on new commits.


akash-jain-10
akash-jain-10 previously approved these changes Feb 6, 2026
RestClientBuilder builder =
RestClient.builder(HttpHost.create(ELASTIC_SEARCH_CONTAINER.getHttpHostAddress()))
String hostAddress = ELASTIC_SEARCH_CONTAINER.getHttpHostAddress();
String[] parts = hostAddress.split(":");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Edge Case: Host address parsing fails if address contains IPv6 or scheme

In multiple test files, getHttpHostAddress() return value is split naively on : to extract host and port:

String hostAddress = ELASTIC_SEARCH_CONTAINER.getHttpHostAddress();
String[] parts = hostAddress.split(":");
HttpHost httpHost = new HttpHost("http", parts[0], Integer.parseInt(parts[1]));

This parsing will break if:

  1. getHttpHostAddress() returns a value containing a scheme prefix (e.g., http://localhost:9200) — parts[0] would be http and parts[1] would be //localhost
  2. The host is an IPv6 address (e.g., [::1]:9200)

While Testcontainers typically returns host:port format, this is fragile. A safer approach would be to use URI parsing:

String hostAddress = ELASTIC_SEARCH_CONTAINER.getHttpHostAddress();
URI uri = URI.create("http://" + hostAddress);
HttpHost httpHost = new HttpHost("http", uri.getHost(), uri.getPort());

Or use HttpHost.create("http://" + hostAddress) if available in HC5.

Was this helpful? React with 👍 / 👎

os.org.opensearch.client.Response response = lowLevelClient.performRequest(request);
} else if (searchClient instanceof OpenSearchClient openSearchClient) {
os.org.opensearch.client.opensearch.generic.OpenSearchGenericClient genericClient =
openSearchClient.getNewClient().generic();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: NPE risk: getNewClient() may return null in OpenMetadataOperations

In OpenMetadataOperations.java line 2012, openSearchClient.getNewClient().generic() calls .generic() directly on the result of getNewClient(). However, OpenSearchClient.newClient is set by createOpenSearchNewClient(transport) which returns null on failure (line 136 of OpenSearchClient.java). The Lombok @Getter exposes this as getNewClient().

If the client initialization failed for any reason, this would throw a NullPointerException.

Suggested fix: Add a null check before using:

} else if (searchClient instanceof OpenSearchClient openSearchClient) {
    var osClient = openSearchClient.getNewClient();
    if (osClient == null) {
        LOG.error("OpenSearch client not available. Cannot retrieve indices.");
        return indices;
    }
    OpenSearchGenericClient genericClient = osClient.generic();
    // ...
}

Alternatively, use the existing isClientAvailable check.

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link

gitar-bot bot commented Feb 6, 2026

🔍 CI failure analysis for aefe33b: Eight failures: infrastructure (validate-docker-compose); external repo (maven-collate-ci); UI tests (playwright); database constraints (integration-tests); Python tests (no Python modified); backend unit tests have 5 of 7,918 errors (3 AWS credential tests, 2 workflow timeouts) - 99.94% pass rate.

Issue

Eight CI failures detected:

  1. validate-docker-compose: Docker image pull failure
  2. maven-collate-ci: Build failure in external openmetadata-collate repository
  3. playwright-ci-postgresql (2, 6): Frontend UI test failures
  4. integration-tests-postgres-opensearch: Build failure despite successful tests
  5. py-run-tests (3.10): Python integration test failures
  6. py-run-tests (3.11): Python integration test failures
  7. maven-postgresql-ci: Backend test errors (3 AWS credential tests)
  8. maven-sonarcloud-ci: Backend test errors (5 total: 3 AWS credential + 2 workflow timeout)

Root Cause

validate-docker-compose Failure

Infrastructure issue - snapshot Docker images aren't published to the registry (persistent known issue).

maven-collate-ci Failure

The external openmetadata-collate repository still uses old OpenSearch client imports (persistent known issue).

playwright-ci-postgresql (2, 6) Failure

3 of 359 frontend UI tests failed. Unrelated to backend Elasticsearch/OpenSearch client upgrade.

integration-tests-postgres-opensearch Failure

All 9,017 backend integration tests passed (0 failures, 0 errors), but build failed due to database constraint errors in workflow code. Unrelated to search client changes.

py-run-tests Failures (3.10 and 3.11)

Python integration test failures. This PR modifies only Java files and does not modify any Python code. Completely unrelated to this PR.

maven-postgresql-ci & maven-sonarcloud-ci Failures

5 backend test errors out of 7,918 tests (99.94% pass rate):

3 AWS Credentials Tests:

AwsCredentialsUtilTest.testBuildCredentialsProviderWithEmptyCredentials
AwsCredentialsUtilTest.testBuildCredentialsProviderWithNoCredentials
AwsCredentialsUtilTest.testBuildCredentialsProviderWithOnlyAccessKey

Error: IllegalArgumentException: AWS credentials not configured. Either provide accessKeyId/secretAccessKey or set enabled=true to use IAM authentication via the default credential provider chain.

2 Workflow Timeout Tests:

WorkflowDefinitionResourceTest.test_CustomApprovalWorkflowForNewEntities
WorkflowDefinitionResourceTest.test_reviewerChangeUpdatesApprovalTasks

Error: ConditionTimeout - Condition was not fulfilled within 1-3 minutes

The AWS credential test failures may be related to AWS credential handling changes merged from main (multiple AWS IAM-related commits in branch history). The workflow timeout errors appear to be test environment timing issues unrelated to search client changes.

Details

Most failures are external to this PR's core functionality. The Elasticsearch/OpenSearch client upgrade demonstrates 99.94% backend test pass rate (7,913 of 7,918 tests passed) with the 5 errors being AWS credential validation tests and workflow timing issues, neither directly related to the search client migration itself.

Code Review 👍 Approved with suggestions 8 resolved / 12 findings

Major client library upgrade (ES 9.x, OS 3.x, HC5 transport) is well-executed. One potential NPE in OpenMetadataOperations where getNewClient() may return null, and OpenSearch generic Response objects should ideally be closed to prevent connection pool exhaustion.

⚠️ Edge Case: Host address parsing fails if address contains IPv6 or scheme

📄 openmetadata-service/src/test/java/org/openmetadata/service/JwtAuthOpenMetadataApplicationTest.java:418

In multiple test files, getHttpHostAddress() return value is split naively on : to extract host and port:

String hostAddress = ELASTIC_SEARCH_CONTAINER.getHttpHostAddress();
String[] parts = hostAddress.split(":");
HttpHost httpHost = new HttpHost("http", parts[0], Integer.parseInt(parts[1]));

This parsing will break if:

  1. getHttpHostAddress() returns a value containing a scheme prefix (e.g., http://localhost:9200) — parts[0] would be http and parts[1] would be //localhost
  2. The host is an IPv6 address (e.g., [::1]:9200)

While Testcontainers typically returns host:port format, this is fragile. A safer approach would be to use URI parsing:

String hostAddress = ELASTIC_SEARCH_CONTAINER.getHttpHostAddress();
URI uri = URI.create("http://" + hostAddress);
HttpHost httpHost = new HttpHost("http", uri.getHost(), uri.getPort());

Or use HttpHost.create("http://" + hostAddress) if available in HC5.

⚠️ Bug: NPE risk: getNewClient() may return null in OpenMetadataOperations

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2012

In OpenMetadataOperations.java line 2012, openSearchClient.getNewClient().generic() calls .generic() directly on the result of getNewClient(). However, OpenSearchClient.newClient is set by createOpenSearchNewClient(transport) which returns null on failure (line 136 of OpenSearchClient.java). The Lombok @Getter exposes this as getNewClient().

If the client initialization failed for any reason, this would throw a NullPointerException.

Suggested fix: Add a null check before using:

} else if (searchClient instanceof OpenSearchClient openSearchClient) {
    var osClient = openSearchClient.getNewClient();
    if (osClient == null) {
        LOG.error("OpenSearch client not available. Cannot retrieve indices.");
        return indices;
    }
    OpenSearchGenericClient genericClient = osClient.generic();
    // ...
}

Alternatively, use the existing isClientAvailable check.

💡 Performance: OpenSearch generic Response objects not closed after use

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchGenericManager.java:98 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/search/opensearch/OpenSearchDataInsightsClient.java:32 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java:2013

In OpenSearch 3.x, OpenSearchGenericClient.execute() returns a Response that may hold HTTP resources (connection, body stream). Multiple call sites in the new code discard these responses without closing them, which could lead to connection pool exhaustion under load.

Affected locations:

  1. OpenSearchGenericManager.deleteILMPolicy() (line 98-103) - response used for status check but not closed
  2. OpenSearchGenericManager.dettachIlmPolicyFromIndexes() (line 188-194) - response used for status but not closed
  3. OpenSearchDataInsightsClient.performRequest() (lines 32-42) - callers like createComponentTemplate(), createDataStream(), etc. discard the returned response entirely
  4. OpenMetadataOperations._catIndices (line 2013-2018) - response body is read via .getBody() but response itself not closed

Suggested fix: Use try-with-resources:

try (var response = genericClient.execute(request)) {
    int statusCode = response.getStatus();
    // ...
}

Note: This may be a minor concern if the OpenSearch 3.x generic Response implementation auto-releases resources when the body is fully consumed, but try-with-resources would be the safest approach.

💡 Edge Case: decodeCursorAsFieldValues returns null on error, may cause NPE

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java

The new decodeCursorAsFieldValues method returns null when cursor decoding fails (line 803). The caller at line 547-548 uses this return value directly without null checking before passing to the composite aggregation builder:

Map<String, FieldValue> afterKey =
    request.getCursor() != null ? decodeCursorAsFieldValues(request.getCursor()) : null;

If the cursor string exists but is malformed, the method returns null and this may lead to unexpected behavior in the aggregation. While the existing decodeCursor method has the same pattern, this is a missed opportunity to improve error handling.

Suggested fix: Consider returning an empty map instead of null on decode failure, or throw an exception for invalid cursor format so the caller can provide a meaningful error message to the user.

✅ 8 resolved
Bug: SigV4 interceptor may fail to sign request bodies in async HC5

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SigV4Hc5RequestSigningInterceptor.java:70
In SigV4Hc5RequestSigningInterceptor.extractRequestBody(), the body extraction relies on request instanceof ClassicHttpRequest. However, the ApacheHttpClient5Transport used by the OpenSearch client operates an async HTTP pipeline (CloseableHttpAsyncClient). In this async context, the HttpRequest passed to the interceptor is typically not a ClassicHttpRequest — it's usually a BasicHttpRequest or similar async wrapper.

This means extractRequestBody would return null for all requests with bodies (POST/PUT), resulting in the SigV4 signature being computed without the request body. AWS SigV4 requires the body hash in the signature for non-streaming requests. This would cause authentication failures for any write operations (indexing, search requests, bulk operations, etc.) against AWS-managed OpenSearch.

Impact: AWS IAM-authenticated OpenSearch deployments would fail on any request containing a body, which includes the most common operations (search, index, bulk).

Suggested fix: For async HC5, the body may need to be extracted from EntityDetails or the request entity differently. One common approach is to use a ContentType and body bytes from the async entity producer. Alternatively, consider using the AWS SDK's AwsSdk2Transport which handles SigV4 natively for OpenSearch, or adapt the interceptor to work with the async pipeline. A simpler approach might be to hash an empty body and set x-amz-content-sha256: UNSIGNED-PAYLOAD if body extraction isn't feasible in the async interceptor.

Edge Case: InputStream from getContent().readAllBytes() not closed

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchClient.java:818
Throughout the diff, EntityUtils.toString(response.getEntity()) was replaced with new String(response.getEntity().getContent().readAllBytes(), StandardCharsets.UTF_8). While readAllBytes() fully consumes the stream, it does not close it. EntityUtils.toString() previously handled closing the entity's stream internally.

Not closing the InputStream returned by getContent() can cause connection leaks in the HTTP connection pool when used in production code paths (e.g., isElasticsearch7Version, OpenMetadataOperations.getAllSearchIndexes).

Suggested fix: Use try-with-resources:

try (var is = response.getEntity().getContent()) {
    String responseBody = new String(is.readAllBytes(), StandardCharsets.UTF_8);
}

Or use EntityUtils.toByteArray(entity) / EntityUtils.toString(entity) from HC5 which handles stream closing properly.

This is minor because most production occurrences are on short-lived clients or one-time operations, but it's a pattern that could cause connection pool exhaustion if used in hot paths.

Bug: OpenSearchClient lost compression and chunked encoding settings

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchClient.java:713
The old getLowLevelRestClient() method in OpenSearchClient had:

restClientBuilder.setCompressionEnabled(true);
restClientBuilder.setChunkedEnabled(true);

The new createApacheHttpClient5Transport() method doesn't configure compression at all. The ApacheHttpClient5TransportBuilder supports .setCompressionEnabled(true) — this should be re-added to maintain the same behavior.

Losing compression will increase network bandwidth usage and latency for large search responses, which could cause a noticeable performance regression in production environments.

Fix: Add builder.setCompressionEnabled(true); before builder.build() in createApacheHttpClient5Transport() (around line 712).

Edge Case: ElasticSearch client missing keep-alive timeout configuration

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchClient.java:718
During the HC5 migration, the keepAliveTimeoutSecs configuration was removed from ElasticSearchClient.getLowLevelRestClient() but was not re-implemented with the HC5 equivalent. The old code had:

if (esConfig.getKeepAliveTimeoutSecs() != null && esConfig.getKeepAliveTimeoutSecs() > 0) {
    httpAsyncClientBuilder.setKeepAliveStrategy(
        (response, context) -> esConfig.getKeepAliveTimeoutSecs() * 1000);
}

This is also missing from the OpenSearch createApacheHttpClient5Transport() method.

Impact: Deployments relying on keepAliveTimeoutSecs for long-lived connections to remote search clusters (e.g., AWS-managed services) will silently lose this configuration after the upgrade, potentially leading to premature connection closures or excessive connection churn.

Suggested fix: Add the HC5 equivalent keep-alive configuration using connectionManagerBuilder.setDefaultConnectionConfig() or httpAsyncClientBuilder.setKeepAliveStrategy() from HC5.

Quality: Exception swallowed silently, returns empty array string

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/OpenMetadataOperations.java
In the OpenSearch branch of getAllSearchIndexes, when the response body cannot be read, the exception is caught and silently returns "[]". This could mask I/O errors or other issues when fetching index information, making debugging difficult.

} catch (Exception e) {
  return "[]";  // Exception details lost
}

Suggested fix: At minimum, log the exception before returning the fallback value:

} catch (Exception e) {
  LOG.warn("Failed to read response body for indices", e);
  return "[]";
}

...and 3 more resolved from earlier reviews

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants