Skip to content

Better netty inspection#17674

Open
gortiz wants to merge 8 commits intoapache:masterfrom
gortiz:better-netty-inspection
Open

Better netty inspection#17674
gortiz wants to merge 8 commits intoapache:masterfrom
gortiz:better-netty-inspection

Conversation

@gortiz
Copy link
Contributor

@gortiz gortiz commented Feb 10, 2026

This PR adds a new machinery to inspect, log and monitor the memory used by Netty. As is explained on the NettyInstance javadoc, a Pinot installation has between 2 copies of the Netty code:

  1. The version shade by gRPC
  2. The unshaded version, used by Brokers in SSE to fetch data from servers.

Netty uses static attributes to keep some important information, like whether it can use off-heap memory or not, or how much memory is being allocated. Given we have 2 copies of Netty, we have 2 independent copies of these attributes. The applied shade doesn't just change the package of the classes, it also changes the system properties we need to set in order to customize Netty. We always need to set at least one JAVA_OPT in order to let Netty use off-heap memory in Java 21: io.netty.tryReflectionSetAccessible

As a corollary, for each JAVA_OPT we plan to use to customize Netty we need to provide that option 2 times with different prefixes.

Given that different copies of the same classes are involved here, this is an error-prone process. This is why we introduce a new class, NettyInstance, that offers clean access to the most important Netty properties. Two classes are provided, one for each known Netty copy we include in the classpath.

Another class, NettyInspector, is used to add some checks on the state of important properties of each known NettyInstance. Right now, the only check we have is whether it uses onheap of offheap memory, but more may be added in the future.

This PR also adds 2 new metrics per NettyInstance: memory usage and maximum memory. This is important **because by default each independent Netty copy will consume as much direct memory as specified by XX:MaxDirectMemorySize, which means we may be consuming a maximum of:

  1. MaxDirectMemorySize bytes by normal ByteBuffers
  2. MaxDirectMemorySize bytes by unshade Netty (used in SSE)
  3. MaxDirectMemorySize bytes by grpc-shaded Netty (used in MSE)

Finally, this PR also adds logs when Brokers and Servers start. Specifically:

  • For each NettyInstance, we add a warning log if we are using on-heap memory.
  • For each NettyInstance, we add an info log indicating how much memory it is using (usually 0) and how much memory it can use.
  • A single node indicating the sum of off-heap memory all NettyInstances are using and can use.

@codecov-commenter
Copy link

codecov-commenter commented Feb 10, 2026

Codecov Report

❌ Patch coverage is 55.95238% with 37 lines in your changes missing coverage. Please review.
✅ Project coverage is 34.05%. Comparing base (0f93d52) to head (a07eedd).
⚠️ Report is 19 commits behind head on master.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...rg/apache/pinot/core/transport/NettyInspector.java 61.66% 21 Missing and 2 partials ⚠️
...org/apache/pinot/core/transport/NettyInstance.java 40.00% 10 Missing and 2 partials ⚠️
.../pinot/server/starter/helix/BaseServerStarter.java 0.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (0f93d52) and HEAD (a07eedd). Click for more details.

HEAD has 8 uploads less than BASE
Flag BASE (0f93d52) HEAD (a07eedd)
java-21 5 4
unittests1 2 0
unittests 4 2
temurin 10 8
java-11 5 4
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #17674       +/-   ##
=============================================
- Coverage     63.25%   34.05%   -29.21%     
+ Complexity     1499      780      -719     
=============================================
  Files          3174     3181        +7     
  Lines        190373   190794      +421     
  Branches      29089    29160       +71     
=============================================
- Hits         120419    64970    -55449     
- Misses        60610   120428    +59818     
+ Partials       9344     5396     -3948     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 34.04% <55.95%> (-29.17%) ⬇️
java-21 34.04% <55.95%> (-29.19%) ⬇️
temurin 34.05% <55.95%> (-29.21%) ⬇️
unittests 34.04% <55.95%> (-29.21%) ⬇️
unittests1 ?
unittests2 34.04% <55.95%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@suvodeep-pyne suvodeep-pyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @gortiz

This will be super useful and alerting on this should give us a much better idea of the state vs relying on string contains checks.

Q: Just to understand: the memory stats are per static instance of netty, as in per pool? I'm guessing 1 for SSE and 1 for MSE? any others?

<!-- Solve NoClassDefFoundError. Borrowed from https://github.com/prometheus/jmx_exporter/issues/802 -->
<exclude>META-INF/versions/9/org/yaml/snakeyaml/internal/**</exclude>

<!-- Exclude NettyInstacne because it includes Netty package literals -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks a bit fragile. Should we load the strings from somewhere instead? I understand the motivation but thing is 1 rename or package move and it might break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is as fragile as the other exclusions, right?

Anyway, I'm going to modify the PR because I'm wrong. I was sure we were shading Netty, but we don't do that, so we are double-safe

@gortiz
Copy link
Contributor Author

gortiz commented Feb 11, 2026

Q: Just to understand: the memory stats are per static instance of netty, as in per pool? I'm guessing 1 for SSE and 1 for MSE? any others?

We use Netty to implement SSE communication between brokers and servers and gRPC uses their own shade version of Netty under the hood. We use gRPC in 4 different places:

  • The client gRPC protocol, an alternative to the REST API. Here, the gRPC server side is our broker.
  • The direct client to server gRPC protocol, used for example by Trino/Presto. Here the gRPC server side is our servers
  • The MSE plan protocol, where Pinot servers are the gRPC servers and brokers are gRPC clients.
  • The MSE mailbox protocol, where Pinot servers and brokers play the role of gRPC servers (and only Pinot servers play the role of gRPC clients).

SSE, Client gRPC, direct client to serve gRPC and MSE mailbox protocols; expose their own metrics to report memory usage (note we don't have metrics for the plan protocol, but that is super simple and doesn't consume much).

But each of these gRPC servers is also constrained by the max memory Netty can use (where SSE uses the limits of the unshaded Netty instance and all gRPC use the limits defined on the gRPC shaded Netty instance).

So the metric for used memory by SSE (assuming the fix in #17667 is merged) should be similar to the used memory shown by the new metric for unshaded Netty, but for gRPC it will be the sum of the used memory by the MSE mailbox, MSE plan, and client gRPC protocols.


@Override
public boolean isExplicitTryReflectionSetAccessible() {
return io.netty.util.internal.ReflectionUtil.trySetAccessible(CONSTRUCTOR, true) == null;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still tricky: instead of calling a getter (which is private in Netty), we call the method they use to check whether they can call the private constructors of ByteBuffers. As long as Netty keeps using the same code (which seems very likely), this should be a safe, reflection-free way to know whether Netty can use offheap buffers or not.

@gortiz gortiz force-pushed the better-netty-inspection branch from 13242b8 to b13d346 Compare February 11, 2026 14:08
@gortiz
Copy link
Contributor Author

gortiz commented Feb 11, 2026

I've rewritten NettyInstance in a way we don't have to use reflection to access most properties. I also removed some of the properties we are not using right now. I'm not proud of the way I implemented isExplicitTryReflectionSetAccessible, but it is the best solution I found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants