Pattern of Dead socket, rise of thread pool, ~20-40s Connection Restored & full recover #3070

petrroll · 2026-05-05T10:16:26Z

petrroll
May 5, 2026

Hi, (me again). This time it's less of a question and more of a "FYI, we're seeing this pattern if it is at all relevant for any of the changes you're making with v3" (I assume some of them might disappear); and if by any chance you have any thoughts - we'd be more than happy to hear them.

We run K8s pods on Linux (r.n.) it's 8 core instances (limited by cgroups) and when they get request they reach out to so called RT redis, the timeout for those req. is 3 seconds, we retry after additional 3 seconds. We use azure cache on redis as the redis provider for this typeof redis (migration to AMR pending). We use SER with the .NET threadpool settings (as noted before). There are multiple separate RT redis instances, each with 5+ shards; so multiple multiplexers, each multiplexing over multiple shards.

We're observing following pattern:

Node gets overloaded with RPS (nothing to do with redis), CPU starvation
RT reads start failing (the read task likely doesn't get scheduled, I get that it can be very sensitive due to multiple steps pipeline as explained in SocketManager.Shared & lock contention vs SocketManager.ThreadPool and starvation - recommendation #3060); now all RT reads are failing (literally 0 success across all RT redis instances, network is fine - we're sure of it)
Almost immediately we see Dead socket detected
The errors rise, rise, rise, threadpool gets filled (likely RT read retries ~somewhat matches the expected order)
20-40seconds
Sudden batch of Connection restored and the whole system immediately recovers

What I get, what i don't get:

I get: The fact that when we overload CPU, some RT reads would fail
I get, we should have CB / throttling on the retries and not drive threadpool queue so high when we're failing at 100 %.
I somewhat get: The fact that when we overload the CPU, all RT reads / SER reads would fail due to how finishing those tasks is pipelined across shared theadpool
I somewhat get: how the re-connect drains the threadpool pool (cancels all pending tasks on those raw connections -> super fast immediate drain of the queue)
I don't get: Why does the reconnect take 30-40 seconds almost always
I don't get: Why are the failures 100 % for the 30-40 seconds while the rest of the app somehow works and there's at least some drainage of threadpool tasks. I'd expect the reads to at least sometimes get correctly scheduled though if it all goes through the global queue it might just get unlucky and never succeed within 3 seconds.

E.g. on the graph bellow the non-crossed "pods":

blue is incoming RPS
purple is RT redis read success
Green and orange is also proxy for success (you can ignore those)
red is RT redis read failure
magenta (thick) is Threadpool state estimated from SER exceptions (thanks for that!)
black x on top is dead socket detected logline from SER (always when issues start)
golden thingy (rotated square) is connection restored (always when recovers)
dotted red line horizontal - RT errors start,
pink shading: only RT read errors happening no successes

I tired to figure why 20-40 seconds from start of dead sockets to recovered and I couldn't figure it out. We don't override SER connect timeout and even if we did, it should be capped at 10seconds. So I'm not sure where that delay is coming from.

Sometimes this also happens to freshly started pods that rughly 40s after they start even if they successfuly take traffic in the first 40 seconds... (and before starting having issues, there's no indication of CPU pressure or anything).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pattern of Dead socket, rise of thread pool, ~20-40s Connection Restored & full recover #3070

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Pattern of Dead socket, rise of thread pool, ~20-40s Connection Restored & full recover #3070

Uh oh!

Uh oh!

petrroll May 5, 2026

Replies: 0 comments

petrroll
May 5, 2026