You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, (me again). This time it's less of a question and more of a "FYI, we're seeing this pattern if it is at all relevant for any of the changes you're making with v3" (I assume some of them might disappear); and if by any chance you have any thoughts - we'd be more than happy to hear them.
We run K8s pods on Linux (r.n.) it's 8 core instances (limited by cgroups) and when they get request they reach out to so called RT redis, the timeout for those req. is 3 seconds, we retry after additional 3 seconds. We use azure cache on redis as the redis provider for this typeof redis (migration to AMR pending). We use SER with the .NET threadpool settings (as noted before). There are multiple separate RT redis instances, each with 5+ shards; so multiple multiplexers, each multiplexing over multiple shards.
We're observing following pattern:
Node gets overloaded with RPS (nothing to do with redis), CPU starvation
The errors rise, rise, rise, threadpool gets filled (likely RT read retries ~somewhat matches the expected order)
20-40seconds
Sudden batch of Connection restored and the whole system immediately recovers
What I get, what i don't get:
I get: The fact that when we overload CPU, some RT reads would fail
I get, we should have CB / throttling on the retries and not drive threadpool queue so high when we're failing at 100 %.
I somewhat get: The fact that when we overload the CPU, all RT reads / SER reads would fail due to how finishing those tasks is pipelined across shared theadpool
I somewhat get: how the re-connect drains the threadpool pool (cancels all pending tasks on those raw connections -> super fast immediate drain of the queue)
I don't get: Why does the reconnect take 30-40 seconds almost always
I don't get: Why are the failures 100 % for the 30-40 seconds while the rest of the app somehow works and there's at least some drainage of threadpool tasks. I'd expect the reads to at least sometimes get correctly scheduled though if it all goes through the global queue it might just get unlucky and never succeed within 3 seconds.
E.g. on the graph bellow the non-crossed "pods":
blue is incoming RPS
purple is RT redis read success
Green and orange is also proxy for success (you can ignore those)
red is RT redis read failure
magenta (thick) is Threadpool state estimated from SER exceptions (thanks for that!)
black x on top is dead socket detected logline from SER (always when issues start)
golden thingy (rotated square) is connection restored (always when recovers)
dotted red line horizontal - RT errors start,
pink shading: only RT read errors happening no successes
I tired to figure why 20-40 seconds from start of dead sockets to recovered and I couldn't figure it out. We don't override SER connect timeout and even if we did, it should be capped at 10seconds. So I'm not sure where that delay is coming from.
Sometimes this also happens to freshly started pods that rughly 40s after they start even if they successfuly take traffic in the first 40 seconds... (and before starting having issues, there's no indication of CPU pressure or anything).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, (me again). This time it's less of a question and more of a "FYI, we're seeing this pattern if it is at all relevant for any of the changes you're making with v3" (I assume some of them might disappear); and if by any chance you have any thoughts - we'd be more than happy to hear them.
We run K8s pods on Linux (r.n.) it's 8 core instances (limited by cgroups) and when they get request they reach out to so called RT redis, the timeout for those req. is 3 seconds, we retry after additional 3 seconds. We use azure cache on redis as the redis provider for this typeof redis (migration to AMR pending). We use SER with the .NET threadpool settings (as noted before). There are multiple separate RT redis instances, each with 5+ shards; so multiple multiplexers, each multiplexing over multiple shards.
We're observing following pattern:
Dead socket detectedConnection restoredand the whole system immediately recoversWhat I get, what i don't get:
E.g. on the graph bellow the non-crossed "pods":
xon top is dead socket detected logline from SER (always when issues start)I tired to figure why 20-40 seconds from start of dead sockets to recovered and I couldn't figure it out. We don't override SER connect timeout and even if we did, it should be capped at 10seconds. So I'm not sure where that delay is coming from.
Sometimes this also happens to freshly started pods that rughly 40s after they start even if they successfuly take traffic in the first 40 seconds... (and before starting having issues, there's no indication of CPU pressure or anything).
Beta Was this translation helpful? Give feedback.
All reactions