Use UnsafeBoundedQueue and avoid dispatcher #803

fredfp · 2025-10-03T10:16:04Z

Leverage typelevel/cats-effect#3975 (released with CE 3.6.0) to remove some dispatcher calls.

It seems to work, but I was hoping to measure the actual impact with benchmarks before submitting a PR.

fredfp · 2025-10-03T10:25:43Z

@seigert I'd love to run the benchmarks you mentioned here, could you do it or share the setup? I think instructions to bench the client side of fs2-grpc are sorely missing.

seigert · 2025-10-06T10:41:03Z

@fredfp, I cannot push my test client/server right now as it uses our internal libraries for configuration, starup and stuff. But I'll try to cut them out in a few days and upload it to github.

Basically, it is just simple metrics for request/response times over server and client implementation of this gRPC API:

syntax = "proto3";

package fs2.grpc.bench;

service TestService {
  rpc Identity (Message) returns (Message);
  rpc IdentityStream (stream Message) returns (stream Message);

  rpc Unary (UnaryRequest) returns (Message);

  rpc ClientStreaming (stream UnaryRequest) returns (Message);
  rpc ServerStreaming (StreamingRequest) returns (stream Message);
  rpc BothStreaming (stream StreamingRequest) returns (stream Message);
}

message Message {
  bytes payload = 1;
}

message RequestParams {
  int32 length = 1;
  bool  random_length = 2;
  optional double random_factor_min = 3;
  optional double random_factor_max = 4;
}

message UnaryRequest {
  RequestParams params = 1;
}

message StreamingRequest {
  RequestParams stream_params = 1;
  RequestParams chunk_params = 2;
  RequestParams message_params = 3;
}

Server implementation receives request and either

sends its back in case of Message;
generates simngle message defined by UnaryRequest.RequestParams
(length is payload length, random bytes);
generates number of messages defined by StreamingRequest
(RequestParams define number of total messages in a stream, messages in a single chunk, length of single payload).

seigert · 2025-10-13T10:53:10Z

@fredfp, I've published our benchmark setup here: https://github.com/seigert/fs2-grpc-bench

fredfp · 2025-12-06T20:00:01Z

@seigert thank you for publishing your benchmark setup!

I ran benchmarks locally and the current changes seem to bring a substantial improvement for streaming calls.

Method

Here's what I did:

compare fs2-grpc v3.0.0, with a locally built version that had the current MR applied on top of v3.0.0
run fs2-grpc-bench server and client at the same version (i.e., not testing v3.0.0 server against client with MR above)
server using default options, same instance for 5 consecutive benchmark runs: fs2-grpc-bench server
client using default options, fs2-grpc-bench client ids, as my changes only affect streaming calls.

Summary of the results

Sample run of fs2-grpc v3.0.0 (baseline)

Test Results (22.426 s elapsed)
===============================

+---------------+-----------+-------------+
|       Counter |     Value |  Per second |
+---------------+-----------+-------------+
|         bytes | 209715200 | 9532509.091 |
|        chunks |    208173 |    9462.409 |
|      messages |    819200 |   37236.364 |
| response (ok) |       200 |       9.091 |
+---------------+-----------+-------------+

Sample run of fs2-grpc v3.0.0 + the current MR

Test Results (5.633 s elapsed)
==============================

+---------------+-----------+--------------+
|       Counter |     Value |   Per second |
+---------------+-----------+--------------+
|         bytes | 209715200 | 41943040.000 |
|        chunks |     54675 |    10935.000 |
|      messages |    819200 |   163840.000 |
| response (ok) |       200 |       40.000 |
+---------------+-----------+--------------+

fredfp · 2025-12-06T20:02:47Z

runtime/src/main/scala/fs2/grpc/shared/StreamIngest.scala

              case Some(Left(err)) =>
                if (acc.isEmpty) loop(err.asLeft)
-                else F.pure((acc.toIndexedChunk, err.asLeft).some)
+                else F.pure((acc, err.asLeft).some)


@seigert notice the removal of .toIndexedChunk here and below. This seems to slightly improve the performance. Was there a specific reason for using .toIndexedChunk?

@fredfp, as I remember it was an attempt to increase "locality" of buffered received messages and to not pass them along as Chunk.Queue produced by acc ++ Chunk.singleton(value) in the worst case.

It was used as alternative to .compact to not copy Chunk.Singleton unnecessary if there was only one element buffered.

Even better would be, if course, to drain all incoming buffer into array slice, but Queue still cannot do this.

I think removal of this call is indeed beneficial for throughput benchmark but maybe not so for actual access of GRPC stream elements if processing is done in chunks. 🤔

Thank you, I see 3 possibilities:

leave the decision to the user (.chunks.map(_.toIndexedChunk).unchunks)

use .toIndexedChunk as you did

make the loop accumulate elements into an Array of some sort, instead of Chunk. This allows avoiding the extra traversal and copy which was certainly slowing down the benchmark, but it is not space efficient as we'd allocate arrays of size prefetchN (and leverage Chunk.ArraySlice when partially filled).

3 is certainly better throughput wise, but I don't think it's worth the added complexity. I'm fine keeping (2) .toIndexedChunk, what do you think?

I agree with keeping (2) if performance improvements show, say, less than 10% difference with .toIndexedChunk and without it.

Also, for (3) we could use same strategy as in fs2.io.readInputStream implementation: allocate and reuse Array-backed buffer of size prefetchN > 1 and pass it along as Chunk.ArraySlice with manually set offset/limit when there is a time to emit.

Also I remembered why I did it with Chunk instead of Array in the first place: we have no ClassTag[T] evidence, so additional machinery will be required to cast Array[AnyRef] to Array[T] even if this is safe (T is a protobuf message and so, an object and not a primitive).

P.S.: And Chunk.ArraySlice[T] also requires ClassTag[T] for construction. :(

@fredfp, I've created a PR into your fork with array-backend buffer implementation: fredfp#1 -- but I did not had a chance to run any benchmarks against your version as a base line.

Could you open a separate PR to discuss this? I think it's complex enough to deserve its own discussion threads.

I've created a separate PR: #819 -- still had no time to do the benchmarks. :(

fredfp · 2025-12-12T13:03:10Z

I've added the .toIndexedChunk back, let's deal with it in another MR.

Below are the updated benchmark results.

Sample run of fs2-grpc v3.0.0 (baseline)

Test Results (22.426 s elapsed)
===============================

+---------------+-----------+-------------+
|       Counter |     Value |  Per second |
+---------------+-----------+-------------+
|         bytes | 209715200 | 9532509.091 |
|        chunks |    208173 |    9462.409 |
|      messages |    819200 |   37236.364 |
| response (ok) |       200 |       9.091 |
+---------------+-----------+-------------+

Sample run of fs2-grpc v3.0.0 + the current MR (calls .toIndexedChunk)

Test Results (6.871 s elapsed)
==============================

+---------------+-----------+--------------+
|       Counter |     Value |   Per second |
+---------------+-----------+--------------+
|         bytes | 209715200 | 34952533.333 |
|        chunks |     57837 |     9639.500 |
|      messages |    819200 |   136533.333 |
| response (ok) |       200 |       33.333 |
+---------------+-----------+--------------+

seigert · 2025-12-12T14:03:13Z

|        chunks |    208173 |    9462.409 |

|        chunks |     57837 |    9639.500 |

3.5-4x throughput is very good but I wonder why chunk size changes between runs?

fredfp · 2025-12-12T15:04:41Z

My understanding was that, without dispatcher calls, we waste much less time adding received messages to the CE queue i.e., we are much more effective at it, so StreamIngest manages to extract many more messages before emitting a single chunk, resulting in fewer chunks (but the same number of messages). Does that make sense?

fredfp force-pushed the no-dispatcher branch from 2997b80 to 99594a3 Compare October 3, 2025 11:24

fredfp force-pushed the no-dispatcher branch from 99594a3 to 8fc9915 Compare December 6, 2025 19:47

fredfp commented Dec 6, 2025

View reviewed changes

Use UnsafeBoundedQueue and avoid dispatcher

d63ff35

fredfp force-pushed the no-dispatcher branch from 8fc9915 to d63ff35 Compare December 12, 2025 12:58

Uh oh!

Use UnsafeBoundedQueue and avoid dispatcher #803

Are you sure you want to change the base?

Use UnsafeBoundedQueue and avoid dispatcher #803

Uh oh!

Conversation

fredfp commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fredfp commented Oct 3, 2025

Uh oh!

seigert commented Oct 6, 2025

Uh oh!

seigert commented Oct 13, 2025

Uh oh!

fredfp commented Dec 6, 2025

Method

Summary of the results

Uh oh!

fredfp Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

seigert Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fredfp Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

seigert Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seigert Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

seigert Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

fredfp Dec 12, 2025

Choose a reason for hiding this comment

Uh oh!

seigert Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

fredfp commented Dec 12, 2025

Uh oh!

seigert commented Dec 12, 2025

Uh oh!

fredfp commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fredfp commented Oct 3, 2025 •

edited

Loading

seigert Dec 7, 2025 •

edited

Loading

seigert Dec 9, 2025 •

edited

Loading

seigert Dec 9, 2025 •

edited

Loading