Kernel-enforced TCP timeouts for Ruby. Threads that wedge in network syscalls die at the deadline you set β at the kernel level, not at the Ruby level.
Verified on Ruby 3.3 / Linux. Full suite: 49 unit + safety + integration tests, 4 kernel-enforcement tests, 0 failures. The kernel suite forces a wedged TCP write against a non-reading server and proves the deadline fires within ~1.5s of the configured 1s. See
test/linux/kernel_enforcement_test.rb.
TcpUserTimeout.with_timeout(30) do
Net::HTTP.get(URI("https://example.com"))
end
# If the server stops making forward progress, the Linux kernel kills the
# connection at ~30 seconds and Ruby raises Errno::ETIMEDOUT.A Ruby worker process sat at 100% memory and 0% CPU for 45 minutes. SolidQueue's reaper logged "claimed_executions: 12" and refused to release them. kill -QUIT produced no thread dump. The supervisor eventually OOM-killed the worker. Twelve user-visible jobs lost in flight.
Root cause: an upstream API stopped reading from its TCP socket. The Ruby write call blocked in sendmsg(2) waiting for an ACK that never came. The job's outer Timeout.timeout(60) did exactly nothing β Timeout.timeout works by raising in the calling thread when control returns to user space, and the calling thread had not returned to user space. It was parked in the kernel.
This is not a Ruby bug, a SolidQueue bug, or a Net::HTTP bug. It's the well-documented limit of MRI's threading model: a thread blocked in a syscall cannot be interrupted from Ruby. Thread#kill, Thread#raise, and Timeout.timeout all set flags that the blocked thread will check the next time it returns from the kernel. If it never returns, the flag is never read.
The fix has to live in the kernel itself. Linux's TCP_USER_TIMEOUT socket option tells the kernel: if data sent on this connection goes unacknowledged for N milliseconds, forcibly close the connection and return ETIMEDOUT to userspace. The kernel drops the connection. The blocking syscall returns. The Ruby thread unblocks. Your rescue Errno::ETIMEDOUT runs. The worker recovers.
This gem makes TCP_USER_TIMEOUT easy to apply β globally, per-block, per-request, per-job β without you having to remember the optname, the level constant, or the platform fallback.
| Mechanism | How it cancels | Wedged-syscall outcome |
|---|---|---|
Timeout.timeout(N) { ... } |
Background thread raises in target thread when it returns to user space | Hangs forever. The target thread is in the kernel and never checks the flag. |
Thread#kill / Thread#raise |
Same flag-based interrupt mechanism | Hangs forever. Same reason. |
| TCP keepalive | Kernel probes idle connections; closes after ~2 hours of inactivity | "Works" eventually. Production-unusable. |
SO_RCVTIMEO / SO_SNDTIMEO |
Per-socket recv/send timeout | Works β but only covers reads/writes, not the actual wedge condition (data sent, no ACK). |
TCP_USER_TIMEOUT |
Kernel forcibly closes the socket when transmitted data goes unacknowledged for N ms | Works. The syscall returns with ETIMEDOUT. Ruby unblocks. |
TCP_USER_TIMEOUT is the only mechanism that addresses the actual failure mode (forward progress stops at the network layer) at the layer where the thread is actually stuck (the kernel).
Run on Linux (or via the included Dockerfile.test from macOS):
$ ruby examples/before_after.rb before
server listening on 127.0.0.1:54321
[hangs indefinitely; ctrl-C after 60s]
$ ruby examples/before_after.rb after
server listening on 127.0.0.1:54321
got IO::TimeoutError: Blocking operation timed out! after 1.0s
Pre-1.0. The core mechanism is small (one setsockopt call + thread-local state + a Module#prepend) and verified end-to-end on Linux. The Rack middleware and ActiveJob concern have unit tests. The Sidekiq middleware is unit-tested with synthesized job hashes; running it inside a real Sidekiq server has not yet been validated by this maintainer. The integrations are thin glue around the same with_timeout block β if one of them misbehaves for you, calling with_timeout directly inside your job's perform is the always-works escape hatch.
To set expectations clearly:
- It does not interrupt CPU-bound or non-network code. A pure-Ruby infinite loop is unaffected. This gem only addresses socket-level wedges.
- It does not cover DNS resolution.
getaddrinfo(3)happens before the TCP socket exists. A wedged resolver bypassesTCP_USER_TIMEOUTentirely. Configureresolv.conf(options timeout:1 attempts:2) or use a DNS client with its own timeout. - It does not cover the TCP connect phase.
TCP_USER_TIMEOUTonly applies once the socket is established. Use the host library's connect-timeout (Net::HTTP#open_timeout, libpqconnect_timeout, etc.) for that phase. - It does not cover FFI-based clients. libcurl (curb), C-level MySQL clients, libsodium, etc., open their own sockets at the C layer and bypass Ruby's
TCPSocket. Pure-Ruby HTTP clients (Net::HTTP,httpx, most Faraday adapters,excon) are covered. - It does not retry. When the kernel kills the connection, your code receives
Errno::ETIMEDOUT(or one of the related exception classes β see Catching the timeout). Retry policy is yours. - It does not work on macOS, BSD, or Windows. No equivalent socket option exists. The gem silently no-ops on those platforms so dev workflows are unaffected, but production must run on Linux for the deadline to actually fire.
Out-of-the-box coverage for a typical Rails-on-Linux production stack:
| Layer | Tool | Mechanism |
|---|---|---|
| Web server | Puma, Falcon, Passenger, Thin, Unicorn | Rack middleware bounds the request |
| Job queue | Sidekiq | Server middleware (this gem) |
| Job queue | SolidQueue | require "tcp_user_timeout/solid_queue" |
| Job queue | GoodJob | ActiveJob concern |
| Job queue | Resque, DelayedJob | Wrap perform with with_timeout |
| HTTP client | Net::HTTP, httpx, Faraday (Net::HTTP/httpx adapters), Excon, RestClient |
Socket-layer hook |
| HTTP client | curb (libcurl) |
Not covered β use libcurl's own timeouts |
| Database | PostgreSQL via libpq | Native tcp_user_timeout connection param |
| Database | MySQL via Trilogy (Rails 7.1+) | Socket-layer hook |
| Database | MySQL via mysql2 | Not covered β use adapter read_timeout |
| Database | MongoDB Ruby driver | Socket-layer hook |
| Cache / store | Redis (redis-rb), Memcached (dalli) |
Socket-layer hook |
Net::SMTP, Mail, ActionMailer |
Socket-layer hook | |
| RPC | gRPC (grpc gem) |
Not covered β use gRPC deadline: |
| WebSockets / SSE / pub/sub | Action Cable, Redis pub/sub | Use exempt_hosts to skip |
If your stack is in the "Not covered" rows, either use that layer's own timeout primitive or rely on global_default_seconds as a coarse safety net for everything else.
| Requirement | Version |
|---|---|
| Ruby | 3.2+ (uses Fiber[] inheritable storage so child threads/fibers spawned inside with_timeout get the deadline) |
| Linux kernel | 2.6.37+ (TCP_USER_TIMEOUT was added in 2010) |
| Rails | 7.0+ (optional β only needed for the Railtie and ActiveJob concern) |
| Sidekiq | any version with server-middleware support (optional) |
| SolidQueue | any version (optional β uses ActiveJob middleware) |
macOS, BSD, and Windows are supported as silent no-ops so dev workflows are unaffected. Kernel-level enforcement requires Linux.
# Gemfile
gem "tcp_user_timeout"For non-Rails apps, install the Socket hooks once at app boot:
require "tcp_user_timeout"
TcpUserTimeout.install!For Rails apps, the included Railtie handles install at boot. No initializer required unless you want to set a global ceiling:
# config/initializers/tcp_user_timeout.rb
TcpUserTimeout.global_default_seconds = 600 # 10 minute safety netA complete production initializer covering all three tiers:
# config/initializers/tcp_user_timeout.rb
# Global ceiling: every outbound TCP socket gets at most 10 minutes.
# This is your "no thread wedges forever" insurance policy.
TcpUserTimeout.global_default_seconds = 600
# Connections that legitimately stay idle for long periods.
# (Action Cable, Server-Sent Events, Redis pub/sub, message-broker subscribers.)
TcpUserTimeout.exempt_hosts = [
/\.internal\z/, # service mesh (managed elsewhere)
/actioncable/, # WebSocket endpoints
/redis-pubsub/ # subscriber connections
]# config/application.rb
# Web tier: bound every request below Puma's worker_timeout.
config.middleware.use TcpUserTimeout::Middleware, timeout: 30# app/jobs/application_job.rb
class ApplicationJob < ActiveJob::Base
include ActiveJob::MaxExecutionTime
end# Any job:
class FetchUpstreamJob < ApplicationJob
self.max_execution_time = 30.seconds
def perform(url)
Net::HTTP.get(URI(url))
end
endThat's it. The web request, the job, and any specific call inside either of them are all bounded by the kernel.
Scope a deadline to a specific operation:
TcpUserTimeout.with_timeout(30) do
result = AnthropicClient.create_message(...)
end
# Inside the block, every newly opened TCP socket gets TCP_USER_TIMEOUT = 30s.
# Outside the block, the previous setting (or the global default) applies.with_timeout is thread-local and exception-safe. It nests cleanly:
TcpUserTimeout.with_timeout(60) do # outer bound: 60s
TcpUserTimeout.with_timeout(5) do # tighter inner: 5s
risky_call
end
# back to 60s here
endBound every web request:
# config/application.rb
config.middleware.use TcpUserTimeout::Middleware, timeout: 30Works with any Rack server: Puma, Falcon, Passenger, Thin, Unicorn. Set the timeout below your web server's request kill threshold (Puma's worker_timeout, Falcon's deadline, Passenger's max_request_time, NGINX's proxy_read_timeout) so the kernel-level kill happens before the supervisor takes the worker down. A common shape:
| Layer | Bound |
|---|---|
NGINX proxy_read_timeout |
60s |
Puma worker_timeout (or Passenger max_request_time) |
60s |
TcpUserTimeout::Middleware |
30s |
Per-call TcpUserTimeout.with_timeout |
5β15s |
Skip the bound for specific requests:
class StreamingController < ApplicationController
before_action { request.env["tcp_user_timeout.skip"] = true }
endMost queue libraries treat job timeouts as observability β fire an alert if the job runs too long, but don't actually bound anything. TcpUserTimeout makes the contract real:
class FetchUpstreamJob < ApplicationJob
include ActiveJob::MaxExecutionTime
self.max_execution_time = 30.seconds
def perform(url)
Net::HTTP.get(URI(url))
end
endIf the upstream wedges, the kernel closes the socket at ~25s (5s headroom for the rescue handler) and the job fails cleanly instead of leaking the worker thread until the process restarts.
To apply this to every job:
class ApplicationJob < ActiveJob::Base
include ActiveJob::MaxExecutionTime
endIf you're on SolidQueue, one require wires everything up:
# config/initializers/tcp_user_timeout.rb
require "tcp_user_timeout/solid_queue"This installs the Socket hooks and includes ActiveJob::MaxExecutionTime into ActiveJob::Base, so any job that declares self.max_execution_time = N.seconds gets enforced bounds without further changes.
Sidekiq doesn't go through ActiveJob by default, so the binding mechanism is a server middleware that reads max_execution_time from sidekiq_options:
# config/initializers/sidekiq.rb
require "tcp_user_timeout/sidekiq"
class FetchUpstreamWorker
include Sidekiq::Worker
sidekiq_options max_execution_time: 30
def perform(url)
Net::HTTP.get(URI(url))
end
endIf you also use ActiveJob on top of Sidekiq, additionally include ActiveJob::MaxExecutionTime into ApplicationJob β both layers compose (innermost with_timeout wins).
Same pattern as Sidekiq β wrap each job's perform in TcpUserTimeout.with_timeout(seconds). For GoodJob (which uses ActiveJob), ActiveJob::MaxExecutionTime is enough; just install the hooks at boot:
# config/initializers/tcp_user_timeout.rb
require "tcp_user_timeout"
TcpUserTimeout.install!
ActiveSupport.on_load :active_job do
include ActiveJob::MaxExecutionTime
endFor Resque (or any queue with a worker-class hook), wrap perform directly:
class FetchUpstreamWorker
MAX_EXECUTION_TIME = 30
def self.perform(*args)
TcpUserTimeout.with_timeout(MAX_EXECUTION_TIME) do
new.do_work(*args)
end
end
endThe general principle: any queue with a "run this code" extension point can wrap that point in with_timeout. The Sidekiq middleware and ActiveJob concern in this gem are convenience wrappers around exactly this pattern.
Some connections legitimately stay idle for long periods and should not be torn down by the kernel. Configure exempt hosts to skip those:
TcpUserTimeout.exempt_hosts = [
/\.internal\z/, # internal mesh β managed elsewhere
"kafka-broker-1", # specific broker
/redis-pubsub/, # subscriber connections
/actioncable/ # WebSocket / SSE endpoints
]Strings match exactly. Regexps match via =~. The exempt list applies to any socket opened with a known host (i.e., everything that goes through Socket.tcp / TCPSocket.new β which is almost everything pure-Ruby).
TCP_USER_TIMEOUT forces the kernel to kill connections that haven't made forward progress within the deadline. That's wrong for:
-
WebSockets and Action Cable. Idle gaps between messages are normal. Add the host to
exempt_hosts, or set the per-block timeout to something larger than the maximum expected idle period. For Action Cable specifically:# config/initializers/tcp_user_timeout.rb TcpUserTimeout.exempt_hosts = [ /actioncable/, # Action Cable mount point /\.cable\./ # any host containing ".cable." ]
-
Server-Sent Events (SSE). Same as WebSockets β long-lived idle stream.
-
Message broker subscribers (Kafka consumer connections, Redis pub/sub subscribers). The connection is supposed to sit there waiting for messages. Either exempt the host, or set the timeout to something well above your producer's max-idle.
-
Long-poll endpoints.
/poll?wait=300etc. β exempt the host or scope the timeout to be larger than your poll deadline. -
Pooled connections you don't control. See the persistent-pool note in Failure modes below.
Production-quality timeout configs operate at three tiers, each tightening as you get closer to the actual call:
| Tier | Mechanism | Typical bound |
|---|---|---|
| Outer (transport) | Web server / queue supervisor | 60s |
| Middle (request/job scope) | TcpUserTimeout::Middleware / max_execution_time |
30s |
| Inner (per call) | TcpUserTimeout.with_timeout(N) around specific calls |
5β15s |
Inner bounds are tighter than outer bounds so the inner failure raises a recoverable exception (your rescue runs, the request returns 504, the job retries) before the outer supervisor pulls the plug on the whole worker.
The TCP_USER_TIMEOUT we set is slightly less than the declared deadline so the kernel kills the socket before any outer guard fires:
max_execution_time |
TCP_USER_TIMEOUT |
|---|---|
| 1s | 0.9s |
| 10s | 5s |
| 30s | 25s |
| 90s | 85s |
| 10min | 9min 55s |
5s headroom at production scales (β₯10s); 90% of max below that so very short timeouts in tests still get enforced. Implementation: ActiveJob::MaxExecutionTime.headroom_seconds.
The block API is safe under MRI's threading, fiber, and fork models. State is held in Fiber[] (Ruby 3.2+ inheritable fiber storage) with a Thread.current[] fallback for older Rubies.
- Threads spawned inside
with_timeoutinherit the deadline. If a job spawns helper threads to do parallel I/O, those threads are bounded by the same kernel deadline as the job itself. (This is what you want β the alternative leaves sub-thread I/O unbounded.) - Fibers spawned inside
with_timeoutinherit the deadline. Falcon, Async, and any fiber-per-task pattern get the deadline propagated to spawned subtasks. - Threads/fibers spawned outside
with_timeoutnever see it. No global leakage. - Concurrent
with_timeoutcalls on different top-level threads are isolated. Each thread's main fiber has its own storage slot. - Fork.
install!is idempotent and survivesfork(2)β the child inherits the prepended hooks, and a secondinstall!in the child does not double-prepend. - Concurrent
install!. Callinginstall!from multiple threads at boot is safe; the prepend will not happen more than once. - Pre-existing sockets are never rebound. Hooks fire on socket creation, not on every operation. Boot-time pool connections (ActiveRecord, persistent HTTP pools, Redis) keep their original behavior β only new sockets opened inside a
with_timeoutblock get bound. This is the property that makes the gem safe to drop into a Rails app with long-lived pools.
global_default_seconds and exempt_hosts are unsynchronized class-level state and should be set once at boot, before threads are accepting work. Both are immutable in steady state, so there is no read/write contention concern after boot.
When TCP_USER_TIMEOUT fires, the exception class your code sees depends on which Ruby IO layer was on top of the socket. Production rescues should catch all four:
require "net/http"
require "openssl"
begin
TcpUserTimeout.with_timeout(30) do
Net::HTTP.get(URI("https://upstream.example.com"))
end
rescue Errno::ETIMEDOUT, # raw read/write surface
IO::TimeoutError, # Ruby 3.2+ wrapper for some IO operations
Net::ReadTimeout, # Net::HTTP's wrapping of either of the above
OpenSSL::SSL::SSLError # TLS connections may surface socket death as a TLS error
=> e
Rails.logger.warn("upstream wedged: #{e.class}: #{e.message}")
# Retry, return a stale cache, surface a 504, etc.
endOpenSSL::SSL::SSLError is included because OpenSSL on top of TCP can wrap the underlying socket's ETIMEDOUT as a TLS error rather than passing it through cleanly. Both behaviors are "valid" kernel-enforced timeouts; you just need to catch both classes.
This gem is a sharp tool. Read these before relying on it.
-
Exception class translation. When the kernel kills the connection, Ruby surfaces it as either
Errno::ETIMEDOUT(lower-level reads/writes) orIO::TimeoutError(Ruby 3.2+'s wrapping of certain IO operations).Net::HTTPmay further wrap asNet::ReadTimeout. Production rescues should catch all three, plusOpenSSL::SSL::SSLErrorfor TLS-wrapped sockets where the SSL layer surfaces the underlying socket death as a TLS error. -
TLS-wrapped sockets. OpenSSL on top of TCP β when the kernel kills the underlying socket mid-
SSL_read, the SSL layer can surface a cleanErrno::ETIMEDOUTor anOpenSSL::SSL::SSLError. Both are valid kernel-enforced timeouts; rescues should catch both classes. -
Persistent connection pools. Pooled connections retain the
TCP_USER_TIMEOUTvalue from whichever request created them. Per-block tightening is best-effort for pooled connections;global_default_secondsis the operative ceiling. See the pooled-client section for client-specific guidance. -
DNS not covered.
TCP_USER_TIMEOUTonly applies after the TCP socket is established.getaddrinfowedges (slow/wedged resolver) bypass this entirely. Mitigate viaresolv.conf(options timeout:1 attempts:2) or via your platform's DNS settings. -
Connect timeout vs read timeout.
TCP_USER_TIMEOUTcovers post-connection wedges. The connect phase has its own timeout βNet::HTTP#open_timeout, libpqconnect_timeout, etc. Both layers needed. -
FFI-based clients bypass the hooks. This gem prepends
Socket.tcp,TCPSocket.new, andTCPSocket.open. Anything wrapping libcurl (curb), or other FFI-based HTTP clients, bypasses Ruby's socket layer and isn't covered. Pure-Ruby HTTP clients (Net::HTTP,httpx, most Faraday adapters,excon) all go through the hooked methods. -
Linux only. macOS, BSD, and Windows don't support
TCP_USER_TIMEOUT. The gem silently no-ops viaErrno::ENOPROTOOPTon those platforms, which keeps dev work unaffected but also means you can't validate kernel enforcement locally on a Mac. Use the includedDockerfile.testto run the kernel tests against Linux from anywhere.
This gem hooks Ruby's socket layer, which means it covers any pure-Ruby client (Net::HTTP, httpx, most Faraday adapters, redis-rb, excon, mongo ruby driver). It does not cover clients whose sockets are managed at the C level (libpq, libmysqlclient, libcurl). For those, the underlying library typically exposes its own equivalent of TCP_USER_TIMEOUT β set it there, and let this gem cover everything else.
libpq supports tcp_user_timeout as a connection parameter directly (PG 12+). For Postgres connections specifically, set it server-side without going through this gem's Ruby socket hooks:
# config/database.yml
default: &default
adapter: postgresql
connect_timeout: 5
variables:
statement_timeout: '120s'
tcp_user_timeout: 600000 # ms; 10 min safety net for non-job DB callsThe C client doesn't expose TCP_USER_TIMEOUT directly, but Trilogy (Ruby 3.0+, the new default for Rails 7.1+) goes through Ruby sockets and is covered by this gem. If you're still on the legacy mysql2 adapter, rely on read_timeout / write_timeout at the adapter level β they don't kill wedged connections at the kernel layer, but they do bound TCP-active reads.
The Ruby driver opens sockets via TCPSocket.new, so this gem covers them. The driver also exposes socket_timeout / connect_timeout as connection options β set both for belt-and-suspenders.
redis-rb uses Ruby sockets β covered. Set connect_timeout, read_timeout, write_timeout at client construction as a separate layer. Note that pub/sub subscribers (subscribe, psubscribe) hold connections idle for arbitrary durations β add those hosts to exempt_hosts.
dalli opens TCP sockets via TCPSocket.new for non-Unix-socket connections β covered by the gem's hooks. The dalli client also supports socket_timeout: at construction (default 0.5s); both layers compose. Memcached connections are typically short and fast, so the kernel deadline rarely fires in practice but is there as the safety net.
Net::SMTP opens sockets via TCPSocket.open β covered. Mail's SMTP delivery method goes through Net::SMTP, and ActionMailer uses Mail underneath, so the entire mail-delivery stack is covered. SMTP servers can stop responding mid-transaction (relay troubles, greylisting); the deadline catches that.
The official grpc gem links libgrpc at the C level β its sockets are NOT visible to this gem's hooks. gRPC has its own deadline/timeout API (deadline: on every call); use it. The grpc proposal A18 documents kernel-level TCP_USER_TIMEOUT for gRPC at the C layer, but you have to enable it via channel args.
Pooled clients (ActiveRecord, redis-rb with a connection pool, custom HTTP keep-alive pools) retain whatever TCP_USER_TIMEOUT was set when the connection was created. with_timeout only affects sockets opened inside the block β pooled connections opened earlier keep their original setting.
In practice this means:
global_default_secondsis the operative ceiling for all pooled connections.with_timeoutis best-effort for pooled clients β if the pool already has a warm connection, it'll be used as-is.- For tighter per-call bounds against a pooling client, either force a fresh connection, or live with the pool's existing bound and rely on application-level retries.
The cross-platform unit tests run anywhere:
bundle exec rake testThe Linux-only kernel enforcement tests need a Linux box. Use the bundled Docker image from a macOS dev machine:
docker build -f Dockerfile.test -t tcp_user_timeout:linux .
docker run --rm -v $PWD:/app -w /app tcp_user_timeout:linux \
bash -c "bundle install && bundle exec rake test:linux"Expected: 0 failures, 0 errors on Linux.
tcp(7)man page β canonical Linux reference- Cloudflare: When TCP sockets refuse to die β definitive blog on the underlying problem
- Instacart: The Vanishing Thread and PostgreSQL TCP Connection Parameters β production war story
- gRPC proposal A18-tcp-user-timeout β design rationale
- Linux kernel commit
- Ankane: The Ultimate Guide to Ruby Timeouts β reference for how various Ruby gems surface timeout exceptions
See CHANGELOG.md for the full release history.
MIT. See LICENSE.