Skip to content

Ext_proc filter#12792

Open
kannanjgithub wants to merge 378 commits intogrpc:masterfrom
kannanjgithub:ext-proc
Open

Ext_proc filter#12792
kannanjgithub wants to merge 378 commits intogrpc:masterfrom
kannanjgithub:ext-proc

Conversation

@kannanjgithub
Copy link
Copy Markdown
Contributor

@kannanjgithub kannanjgithub commented May 4, 2026

Implements ext_proc filter from A93 (internal design doc)
Metrics aren't implemented yet.
Includes commits from unmerged Filter API enhancements and channel caching PRs.
Only the ExternaProcessingFilter.java, ExternaProcessingFilterTest.java and the envoy xds proto import and generated code need to be reviewed.
Rebasing commit history caused all received and merged commits to show my name as the committer, ignore all commits for which I'm not shown as the author.

…PlaneRequestIsDrained to use in-process server
…estsAreForwardedImmediately to use in-process server
…roc response but only sent a mode override. As per envoy ext-processing ext-proc cannot omit request_headers (even if empty) in response to a request_headers processing request.
… track close already called for handling immediate response.
…ubHasCorrectDeadline to use real dataPlaneChannel
…xtProcStreamSendsMetadata to use real dataPlaneChannel
…sHeadersAndCallIsBuffered to use real dataPlaneChannel
…henMutationsAreAppliedAndCallIsActivated to use real dataPlaneChannel
…sActivatedImmediately to use real dataPlaneChannel and ServerInterceptor
…henMessagesAreDiscarded to use real dataPlaneChannel
…ExtProcAndSuperHalfCloseIsDeferred to use real dataPlaneChannel
…nSuperHalfCloseIsCalled to use real dataPlaneChannel
…stsAreForwardedImmediately to use real dataPlaneChannel
…sHeadersAndCallIsBuffered to use real dataPlaneChannel
…henMutationsAreAppliedAndCallIsActivated to use real dataPlaneChannel
…sActivatedImmediately to use real dataPlaneChannel
kannanjgithub and others added 20 commits May 4, 2026 07:54
       * Updated io.grpc.xds.internal.MatcherParser to support parsing envoy.type.matcher.v3.ListStringMatcher into an internal list of matchers.
       * Implemented a nested HeaderForwardingRulesConfig class within ExternalProcessorFilter to encapsulate the allow/disallow logic.
       * Refactored ExternalProcessorInterceptor.toHeaderMap to perform case-insensitive header name filtering according to the configured rules.
       * Applied this filtering to all headers sent to the sidecar (initial request, response headers, and trailers).

   * Fixed givenClientStreamingRpc and givenBidiStreamingRpc failures.
       * Threading Fixes: Replaced complex thread pools with directExecutor() for components where appropriate and used dedicated single-thread executors for async responses. This resolved the IllegalStateException: call is closed and the resource leak issues (AssertionError: Resources could not be released in time).
Often there is no cause, but connect(), channel credentials, and call
credentials failures on the control plane RPC can include a useful
causal exception.

This was triggered by seeing an error like below, but it didn't include
the cause, which would have included HTTP error information from the
failure fetching the credential.
```
UNAVAILABLE: Error retrieving LDS resource xdstp://traffic-director-c2p.xds.googleapis.com/envoy.config.listener.v3.Listener/bigtable.googleapis.com: UNAUTHENTICATED: Failed computing credential metadata nodeID: C2P-798500073
```
Metadata was accidentally being retained after the start of the call.
That can be an overwhelming percentage of memory for an idle RPC; don't
do that. The other changes are considerably smaller, but I happened to
notice them and the changes are straight-forward without magic numbers
(e.g., there's many arrays that could be tuned).

The regular interop server uses 4600 bytes per full duplex stream while
idle, but much of that is Census recorded events hanging around. Keeping
the Census integration but removing the Census impl (so a noop is used)
drops that to 3000 bytes. This change brings that down to ~2450 bytes
(which is still including stuff from TestServiceImpl). But there's very
little Metadata in the interop tests, so absolute real-life savings
would be much higher (but relative real-life savings may be lower,
because the application will often have more state).

The measurements were captured using a modified
timeout_on_sleeping_server client that had 100,000 concurrent full
duplex calls on one connection.
This does reduce the largest supported integer from just less than 2^32
to slightly more than 2^29, which does not seem a significant loss.

It would previously produce a corrupted integer, which makes debugging
annoying. Note that continuations can contain just zeros and should
still be detected as resulting in overflow, without waiting for any
eventual 1.

We could leave the encoder supporting up to 2^32-1, but it just seems
wrong to encode values that the same implementation couldn't decode.

Noticed by @August829
Align DelayedClientCall.DelayedListener with ClientCallImpl's existing
behavior for listener exceptions. When the application listener throws
from onHeaders/onMessage/onReady, catch the Throwable, cancel the call
with CANCELLED (cause = the throwable), and swallow subsequent
callbacks. When onClose throws, log and continue, matching
ClientCallImpl.closeObserver. If onClose arrives from the transport
after a prior callback threw, override its status/trailers with the
captured CANCELLED so a server-supplied OK can't mask the local failure.

Previously, a throw from the application listener escaped to the
callExecutor's uncaught-exception handler. The real call was not
cancelled and the transport kept delivering callbacks to an already
broken listener, different from how the same bug behaves on a normal
ClientCallImpl, and a timing-dependent inconsistency depending on
whether callbacks arrived before or after setCall + drain completed.

Trade-off: listener-callback throws are no longer visible to the
executor's UncaughtExceptionHandler (they're attached as Status.cause
instead). This matches ClientCallImpl and is the intended behavior.

Exception handling for the outer drainPendingCalls loop
(realCall.sendMessage/request/halfClose/cancel) remains unaddressed;
that TODO is preserved.

**Note:**
This change only handles exceptions thrown by the application listener.
I don't try and solve the problems that grpc#12737 is attempting to fix. My
motivation is to fix the root cause behind
bazelbuild/bazel#29316

---------

Co-authored-by: Kannan J <kannanjgithub@google.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
This was already being done for servlet, so it was just copied to
jakarta.
This commit introduces a library for handling header mutations as
specified by the xDS protocol. This library provides the core
functionality for modifying request and response headers based on a set
of rules.

The main components of this library are:
- `HeaderMutator`: Applies header mutations to `Metadata` objects.
- `HeaderMutationFilter`: Filters header mutations based on a set of
configurable rules, such as disallowing mutations of system headers.
- `HeaderMutations`: A value class that represents the set of mutations
to be applied to request and response headers.
- `HeaderMutationDisallowedException`: An exception that is thrown when
a disallowed header mutation is attempted.

This commit also includes comprehensive unit tests for the new library.
ExternalProcessorFilter.java
   - Trailers-Only Detection: Added state tracking to ExtProcListener to identify when a gRPC "trailers-only" response occurs (i.e., when onClose is called without a preceding onHeaders).
   - Protocol Compliance: Updated the state machine to send a RESPONSE_HEADERS message to the sidecar with the end_of_stream flag set for trailers-only responses. This satisfies the requirement that headers must be the first message in any response phase.
   - Handshake Handling: Modified onNext to correctly apply sidecar mutations to gRPC trailers and terminate the interaction when a trailers-only handshake is completed.
   - Robustness: Added null checks in header mutation logic to prevent NullPointerException during edge-case state transitions.

  ExternalProcessorFilterTest.java
   - Forward Compatibility: Updated the createBaseProto helper to default to SKIP mode for response headers and trailers. This ensures that the 60+ existing tests (which primarily focus on the request phase) continue to pass without being blocked by the new response handshake.
   - Streaming Robustness: Refactored Category 11 (Client/Bidi Streaming) mock sidecars to handle and acknowledge the full sequence of protocol phases (including the newly added response phases).
   - Category 15 (New): Added a dedicated test case givenTrailersOnly_whenResponseReceived_thenResponseHeadersSentWithEos which validates that:
       1. Trailers are correctly sent to the sidecar as headers when the data plane server returns an immediate error.
       2. The sidecar receives the end_of_stream signal.
       3. Mutated trailers from the sidecar are correctly applied to the final RPC state.
Implemented a FIFO queue in ExternalProcessorFilter to ensure that responses
from the external processor server arrive in the same order as the events
sent by the filter, as required by gRFC A93. Added unit tests to verify that
out-of-order responses correctly trigger a protocol error and fail the stream.
- Added rejection of CONTINUE_AND_REPLACE status in HeadersResponse for
  both request and response headers, treating it as a stream failure.
- Fixed potential hangs by ensuring proceedWithClose() is called upon
  stream failure, especially in fail-open scenarios.
- Added explicit sidecar notification via requestStream.onError() upon
  detecting protocol errors to ensure robust stream termination.
- Added new unit test categories 17 in ExternalProcessorFilterTest
  to verify status enforcement.
Updated ExternalProcessorFilter to include the `protocol_config` field in
 the very first `ProcessingRequest` sent to the sidecar (RequestHeaders).
The configuration includes the `request_body_mode` and `response_body_mode`
derived from the filter's processing mode, as required by gRFC A93.
Added a unit test in Category 4 to verify that `protocol_config` is correctly
populated on the first message and omitted from all subsequent messages on
the stream.

Add tests for trailers HeaderSendMode default to SKIP for DEFAULT.

nit: Renumbered out of order test categories.
…ment rather than a granular merge of its fields.
# Conflicts:
#	xds/src/test/java/io/grpc/xds/internal/headermutations/HeaderMutationsTest.java
@kannanjgithub kannanjgithub marked this pull request as draft May 4, 2026 11:39
# Conflicts:
#	xds/src/main/java/io/grpc/xds/ExternalProcessorFilter.java
@kannanjgithub kannanjgithub marked this pull request as ready for review May 4, 2026 12:39
@kannanjgithub kannanjgithub requested a review from sauravzg May 4, 2026 12:39
Fix style and warning as errors.
static final String TYPE_URL =
"type.googleapis.com/envoy.extensions.filters.http.ext_proc.v3.ExternalProcessor";

final String filterInstanceName;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need filterInstanceName string? Maybe we just remove this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

if (perRoute.hasOverrides()) {
return ExternalProcessorFilterConfig.create(perRoute.getOverrides(), context);
}
return ConfigOrError.fromError("ExtProcPerRoute must have overrides");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/grpc/proposal/pull/484/changes#diff-5b8fb70550f5331ffbe7083143468b63ee8f27a4b6a464fe3f6713f9711b0addR435-R451 documents overrides as an optional field.

I don't think we are supposed to nack the resource on empty overrides. Maybe we need a notion of empty FilterConfig.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using proto empty instance for overrides config now. In mergeConfigs we have has checks for each proto field before using it from the override config, so it will just ignore the empty proto fields.

config = over;
}
}
checkNotNull(config, "config");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?
If my understanding of the grfc is correct a successfully parsed regular config and a successfullly parsed override config should for all practical purposes be infallible and we need not handle errors.

That aside, what's the expected behavior here. I couldn't conclude it from the , I assume checkNotNull will crash the thread? Should we be returning null (which skips this interceptor) or FailingInterceptor(which always fails all calls) or just crash?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkNotNull is just to throw a more graceful NPE at known checkpoints rather than when trying to dereference it. It is not supposed to fail and if it fails it indicates code error in the caller site or the expected return value etc. After the refactorings I have removed the checkNotNull here as it is not necessary to use it everywhere. I have retained other places where I'm using it like in constructor arguments etc. A lot of grpc java code does have checkNotNull usages also.


ConfigOrError<ExternalProcessorFilterConfig> merged =
ExternalProcessorFilterConfig.create(mergedProtoBuilder.build(), parent.getFilterContext());
checkNotNull(merged, "merged");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above , I am not entirely sure why we would fail given that validated configs should always be mergeable? Is this just defensive programming which we don't expect to happen?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied above.

return merged.config;
}

static final class ExternalProcessorFilterConfig implements FilterConfig {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, this answers some initial questions that I had.

So, we've merged the override config and the regular config into the same object. This seems to have led to various null or error checks, because we store the original proto , then merge the proto with override and then recreate the object from merged proto which may be a fallible operations.

I briefly tried to skim at the other Filters and it seems to be the first override type whose message is different from the base types message.

So, here's my recommendation ,

  • we need two FilterConfigs , one for base and one for override each having their own distinct structure.
  • Ideally, the FilterConfig should be independent of proto or at least not directly contain the entire ExternalProcesor or ExternalProcessorOverride proto.
  • Create a merge function that doesn't reparse protos and merges them together. Maybe a builder pattern with autovalue might be useful here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created separate filter configs for the top level filter config and the override filter config now.
About making the parsed configs not depend on envoy protos, we have the flexibility of not doing that here since we are not passing around these parsed configs to other code, which is where we have tried to avoid passing envoy protos around. We can think of the filter config classes as blackboxes, and not care how the getters work internally, whether it is simply returning a pre-parsed field, or accessing a field directly from a stored envoy proto. I wouldn't have minded doing it if the types were all primitive data types, but it gets annoying when we have to create mirror proto enums for ProcessingMode.
I have kept the envoy protos in ExternalProcessorFilterConfig and ExternalProcessorFilterConfig encapsulated and have get and has methods for each field we want to access.
With this approach I'm still reparsing envoy ExternalProcessor proto in mergeConfigs. proto.toBuilder already provides a builder pattern so why not just use it? The reasoning here again is that the usages are localized anyway.

private final Optional<HeaderForwardingRulesConfig> forwardRulesConfig;
private final ImmutableList<String> requestAttributes;
private final boolean disableImmediateResponse;
private final long deferredCloseTimeoutNanos;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I am not sure if that's intentional , but there's a small bit of inconcistency, I'd expect the failuremode.. and observability.. fields to be present here as well like the others. I am assuming we plan to access them directly from externalProcessor proto?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to access them all from the internal proto field. I have made it uniform now.


static final class ExternalProcessorFilterConfig implements FilterConfig {

private final ExternalProcessor externalProcessor;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we need nullable annotations here? or is it acceptable from just the constructor callsite?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are proto fields taken from proto messages so they can never be null.

private final ImmutableList<String> requestAttributes;
private final boolean disableImmediateResponse;
private final long deferredCloseTimeoutNanos;
private final FilterContext filterContext;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need it as a data member? It's only used in creation of GrpcServiceConfig and if we already have that we likely won't need to use this anymore.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

private final ImmutableList<Matchers.StringMatcher> disallowedHeaders;

HeaderForwardingRulesConfig(
@Nullable ImmutableList<Matchers.StringMatcher> allowedHeaders,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go/java-practices/null#collection - Nullable collections are usually not preferable.

https://github.com/envoyproxy/envoy/blob/08f7c15da7f2e4077404829402bd8465536864f2/api/envoy/type/matcher/string.proto#L78-L80 seems to indicate that the number of elements >=1 , so essentially I don't think we need to treat null and empty containers differently.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using empty list now if the proto field is absent.

return new HeaderForwardingRulesConfig(allowedHeaders, disallowedHeaders);
}

boolean isAllowed(String headerName) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this somehow by preferring allowed? When combined with the nullability removal , could this probably become like the following which greatly reduces cyclomatic omplexity.

allowed(){
for(disallow: diss..headers){
  if (matches) return false;
}
if(allowed.empty()) return true;
for (allow: allwe..headers){
   if (matches) return true;
}
return false;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the logic. If allowed is not empty, it must have the header to be considered allowed.

@sauravzg
Copy link
Copy Markdown
Collaborator

sauravzg commented May 6, 2026

Reviewed approximately ~500 LOC of source(not tests), which currently mostly only covers the config part of things. Sending comments incrementally to kick start progress, since it seems that I'll need probably about 3 days to review all of source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.