Skip to content

rubymonolith/tcp_user_timeout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

tcp_user_timeout

Kernel-enforced TCP timeouts for Ruby. Threads that wedge in network syscalls die at the deadline you set β€” at the kernel level, not at the Ruby level.

Verified on Ruby 3.3 / Linux. Full suite: 49 unit + safety + integration tests, 4 kernel-enforcement tests, 0 failures. The kernel suite forces a wedged TCP write against a non-reading server and proves the deadline fires within ~1.5s of the configured 1s. See test/linux/kernel_enforcement_test.rb.

TcpUserTimeout.with_timeout(30) do
  Net::HTTP.get(URI("https://example.com"))
end
# If the server stops making forward progress, the Linux kernel kills the
# connection at ~30 seconds and Ruby raises Errno::ETIMEDOUT.

The wedge that motivated this gem

A Ruby worker process sat at 100% memory and 0% CPU for 45 minutes. SolidQueue's reaper logged "claimed_executions: 12" and refused to release them. kill -QUIT produced no thread dump. The supervisor eventually OOM-killed the worker. Twelve user-visible jobs lost in flight.

Root cause: an upstream API stopped reading from its TCP socket. The Ruby write call blocked in sendmsg(2) waiting for an ACK that never came. The job's outer Timeout.timeout(60) did exactly nothing β€” Timeout.timeout works by raising in the calling thread when control returns to user space, and the calling thread had not returned to user space. It was parked in the kernel.

This is not a Ruby bug, a SolidQueue bug, or a Net::HTTP bug. It's the well-documented limit of MRI's threading model: a thread blocked in a syscall cannot be interrupted from Ruby. Thread#kill, Thread#raise, and Timeout.timeout all set flags that the blocked thread will check the next time it returns from the kernel. If it never returns, the flag is never read.

The fix has to live in the kernel itself. Linux's TCP_USER_TIMEOUT socket option tells the kernel: if data sent on this connection goes unacknowledged for N milliseconds, forcibly close the connection and return ETIMEDOUT to userspace. The kernel drops the connection. The blocking syscall returns. The Ruby thread unblocks. Your rescue Errno::ETIMEDOUT runs. The worker recovers.

This gem makes TCP_USER_TIMEOUT easy to apply β€” globally, per-block, per-request, per-job β€” without you having to remember the optname, the level constant, or the platform fallback.

Why this works when Timeout.timeout does not

Mechanism How it cancels Wedged-syscall outcome
Timeout.timeout(N) { ... } Background thread raises in target thread when it returns to user space Hangs forever. The target thread is in the kernel and never checks the flag.
Thread#kill / Thread#raise Same flag-based interrupt mechanism Hangs forever. Same reason.
TCP keepalive Kernel probes idle connections; closes after ~2 hours of inactivity "Works" eventually. Production-unusable.
SO_RCVTIMEO / SO_SNDTIMEO Per-socket recv/send timeout Works β€” but only covers reads/writes, not the actual wedge condition (data sent, no ACK).
TCP_USER_TIMEOUT Kernel forcibly closes the socket when transmitted data goes unacknowledged for N ms Works. The syscall returns with ETIMEDOUT. Ruby unblocks.

TCP_USER_TIMEOUT is the only mechanism that addresses the actual failure mode (forward progress stops at the network layer) at the layer where the thread is actually stuck (the kernel).

Before / after

Run on Linux (or via the included Dockerfile.test from macOS):

$ ruby examples/before_after.rb before
server listening on 127.0.0.1:54321
[hangs indefinitely; ctrl-C after 60s]

$ ruby examples/before_after.rb after
server listening on 127.0.0.1:54321
got IO::TimeoutError: Blocking operation timed out! after 1.0s

Status

Pre-1.0. The core mechanism is small (one setsockopt call + thread-local state + a Module#prepend) and verified end-to-end on Linux. The Rack middleware and ActiveJob concern have unit tests. The Sidekiq middleware is unit-tested with synthesized job hashes; running it inside a real Sidekiq server has not yet been validated by this maintainer. The integrations are thin glue around the same with_timeout block β€” if one of them misbehaves for you, calling with_timeout directly inside your job's perform is the always-works escape hatch.

What this gem does NOT do

To set expectations clearly:

  • It does not interrupt CPU-bound or non-network code. A pure-Ruby infinite loop is unaffected. This gem only addresses socket-level wedges.
  • It does not cover DNS resolution. getaddrinfo(3) happens before the TCP socket exists. A wedged resolver bypasses TCP_USER_TIMEOUT entirely. Configure resolv.conf (options timeout:1 attempts:2) or use a DNS client with its own timeout.
  • It does not cover the TCP connect phase. TCP_USER_TIMEOUT only applies once the socket is established. Use the host library's connect-timeout (Net::HTTP#open_timeout, libpq connect_timeout, etc.) for that phase.
  • It does not cover FFI-based clients. libcurl (curb), C-level MySQL clients, libsodium, etc., open their own sockets at the C layer and bypass Ruby's TCPSocket. Pure-Ruby HTTP clients (Net::HTTP, httpx, most Faraday adapters, excon) are covered.
  • It does not retry. When the kernel kills the connection, your code receives Errno::ETIMEDOUT (or one of the related exception classes β€” see Catching the timeout). Retry policy is yours.
  • It does not work on macOS, BSD, or Windows. No equivalent socket option exists. The gem silently no-ops on those platforms so dev workflows are unaffected, but production must run on Linux for the deadline to actually fire.

Production Linux Rails coverage

Out-of-the-box coverage for a typical Rails-on-Linux production stack:

Layer Tool Mechanism
Web server Puma, Falcon, Passenger, Thin, Unicorn Rack middleware bounds the request
Job queue Sidekiq Server middleware (this gem)
Job queue SolidQueue require "tcp_user_timeout/solid_queue"
Job queue GoodJob ActiveJob concern
Job queue Resque, DelayedJob Wrap perform with with_timeout
HTTP client Net::HTTP, httpx, Faraday (Net::HTTP/httpx adapters), Excon, RestClient Socket-layer hook
HTTP client curb (libcurl) Not covered β€” use libcurl's own timeouts
Database PostgreSQL via libpq Native tcp_user_timeout connection param
Database MySQL via Trilogy (Rails 7.1+) Socket-layer hook
Database MySQL via mysql2 Not covered β€” use adapter read_timeout
Database MongoDB Ruby driver Socket-layer hook
Cache / store Redis (redis-rb), Memcached (dalli) Socket-layer hook
Email Net::SMTP, Mail, ActionMailer Socket-layer hook
RPC gRPC (grpc gem) Not covered β€” use gRPC deadline:
WebSockets / SSE / pub/sub Action Cable, Redis pub/sub Use exempt_hosts to skip

If your stack is in the "Not covered" rows, either use that layer's own timeout primitive or rely on global_default_seconds as a coarse safety net for everything else.

Compatibility

Requirement Version
Ruby 3.2+ (uses Fiber[] inheritable storage so child threads/fibers spawned inside with_timeout get the deadline)
Linux kernel 2.6.37+ (TCP_USER_TIMEOUT was added in 2010)
Rails 7.0+ (optional β€” only needed for the Railtie and ActiveJob concern)
Sidekiq any version with server-middleware support (optional)
SolidQueue any version (optional β€” uses ActiveJob middleware)

macOS, BSD, and Windows are supported as silent no-ops so dev workflows are unaffected. Kernel-level enforcement requires Linux.

Install

# Gemfile
gem "tcp_user_timeout"

For non-Rails apps, install the Socket hooks once at app boot:

require "tcp_user_timeout"
TcpUserTimeout.install!

For Rails apps, the included Railtie handles install at boot. No initializer required unless you want to set a global ceiling:

# config/initializers/tcp_user_timeout.rb
TcpUserTimeout.global_default_seconds = 600  # 10 minute safety net

Recommended Rails setup

A complete production initializer covering all three tiers:

# config/initializers/tcp_user_timeout.rb

# Global ceiling: every outbound TCP socket gets at most 10 minutes.
# This is your "no thread wedges forever" insurance policy.
TcpUserTimeout.global_default_seconds = 600

# Connections that legitimately stay idle for long periods.
# (Action Cable, Server-Sent Events, Redis pub/sub, message-broker subscribers.)
TcpUserTimeout.exempt_hosts = [
  /\.internal\z/,         # service mesh (managed elsewhere)
  /actioncable/,          # WebSocket endpoints
  /redis-pubsub/          # subscriber connections
]
# config/application.rb

# Web tier: bound every request below Puma's worker_timeout.
config.middleware.use TcpUserTimeout::Middleware, timeout: 30
# app/jobs/application_job.rb

class ApplicationJob < ActiveJob::Base
  include ActiveJob::MaxExecutionTime
end
# Any job:

class FetchUpstreamJob < ApplicationJob
  self.max_execution_time = 30.seconds

  def perform(url)
    Net::HTTP.get(URI(url))
  end
end

That's it. The web request, the job, and any specific call inside either of them are all bounded by the kernel.

Block API

Scope a deadline to a specific operation:

TcpUserTimeout.with_timeout(30) do
  result = AnthropicClient.create_message(...)
end
# Inside the block, every newly opened TCP socket gets TCP_USER_TIMEOUT = 30s.
# Outside the block, the previous setting (or the global default) applies.

with_timeout is thread-local and exception-safe. It nests cleanly:

TcpUserTimeout.with_timeout(60) do  # outer bound: 60s
  TcpUserTimeout.with_timeout(5) do # tighter inner: 5s
    risky_call
  end
  # back to 60s here
end

Rack middleware

Bound every web request:

# config/application.rb
config.middleware.use TcpUserTimeout::Middleware, timeout: 30

Works with any Rack server: Puma, Falcon, Passenger, Thin, Unicorn. Set the timeout below your web server's request kill threshold (Puma's worker_timeout, Falcon's deadline, Passenger's max_request_time, NGINX's proxy_read_timeout) so the kernel-level kill happens before the supervisor takes the worker down. A common shape:

Layer Bound
NGINX proxy_read_timeout 60s
Puma worker_timeout (or Passenger max_request_time) 60s
TcpUserTimeout::Middleware 30s
Per-call TcpUserTimeout.with_timeout 5–15s

Skip the bound for specific requests:

class StreamingController < ApplicationController
  before_action { request.env["tcp_user_timeout.skip"] = true }
end

ActiveJob β€” max_execution_time as an enforced contract

Most queue libraries treat job timeouts as observability β€” fire an alert if the job runs too long, but don't actually bound anything. TcpUserTimeout makes the contract real:

class FetchUpstreamJob < ApplicationJob
  include ActiveJob::MaxExecutionTime
  self.max_execution_time = 30.seconds

  def perform(url)
    Net::HTTP.get(URI(url))
  end
end

If the upstream wedges, the kernel closes the socket at ~25s (5s headroom for the rescue handler) and the job fails cleanly instead of leaking the worker thread until the process restarts.

To apply this to every job:

class ApplicationJob < ActiveJob::Base
  include ActiveJob::MaxExecutionTime
end

SolidQueue integration

If you're on SolidQueue, one require wires everything up:

# config/initializers/tcp_user_timeout.rb
require "tcp_user_timeout/solid_queue"

This installs the Socket hooks and includes ActiveJob::MaxExecutionTime into ActiveJob::Base, so any job that declares self.max_execution_time = N.seconds gets enforced bounds without further changes.

Sidekiq integration

Sidekiq doesn't go through ActiveJob by default, so the binding mechanism is a server middleware that reads max_execution_time from sidekiq_options:

# config/initializers/sidekiq.rb
require "tcp_user_timeout/sidekiq"

class FetchUpstreamWorker
  include Sidekiq::Worker
  sidekiq_options max_execution_time: 30

  def perform(url)
    Net::HTTP.get(URI(url))
  end
end

If you also use ActiveJob on top of Sidekiq, additionally include ActiveJob::MaxExecutionTime into ApplicationJob β€” both layers compose (innermost with_timeout wins).

GoodJob, Resque, and other queues

Same pattern as Sidekiq β€” wrap each job's perform in TcpUserTimeout.with_timeout(seconds). For GoodJob (which uses ActiveJob), ActiveJob::MaxExecutionTime is enough; just install the hooks at boot:

# config/initializers/tcp_user_timeout.rb
require "tcp_user_timeout"
TcpUserTimeout.install!

ActiveSupport.on_load :active_job do
  include ActiveJob::MaxExecutionTime
end

For Resque (or any queue with a worker-class hook), wrap perform directly:

class FetchUpstreamWorker
  MAX_EXECUTION_TIME = 30

  def self.perform(*args)
    TcpUserTimeout.with_timeout(MAX_EXECUTION_TIME) do
      new.do_work(*args)
    end
  end
end

The general principle: any queue with a "run this code" extension point can wrap that point in with_timeout. The Sidekiq middleware and ActiveJob concern in this gem are convenience wrappers around exactly this pattern.

Per-host exempt list

Some connections legitimately stay idle for long periods and should not be torn down by the kernel. Configure exempt hosts to skip those:

TcpUserTimeout.exempt_hosts = [
  /\.internal\z/,         # internal mesh β€” managed elsewhere
  "kafka-broker-1",       # specific broker
  /redis-pubsub/,         # subscriber connections
  /actioncable/           # WebSocket / SSE endpoints
]

Strings match exactly. Regexps match via =~. The exempt list applies to any socket opened with a known host (i.e., everything that goes through Socket.tcp / TCPSocket.new β€” which is almost everything pure-Ruby).

Where you don't want this

TCP_USER_TIMEOUT forces the kernel to kill connections that haven't made forward progress within the deadline. That's wrong for:

  • WebSockets and Action Cable. Idle gaps between messages are normal. Add the host to exempt_hosts, or set the per-block timeout to something larger than the maximum expected idle period. For Action Cable specifically:

    # config/initializers/tcp_user_timeout.rb
    TcpUserTimeout.exempt_hosts = [
      /actioncable/,                    # Action Cable mount point
      /\.cable\./                       # any host containing ".cable."
    ]
  • Server-Sent Events (SSE). Same as WebSockets β€” long-lived idle stream.

  • Message broker subscribers (Kafka consumer connections, Redis pub/sub subscribers). The connection is supposed to sit there waiting for messages. Either exempt the host, or set the timeout to something well above your producer's max-idle.

  • Long-poll endpoints. /poll?wait=300 etc. β€” exempt the host or scope the timeout to be larger than your poll deadline.

  • Pooled connections you don't control. See the persistent-pool note in Failure modes below.

Deployment pattern: layered timeouts

Production-quality timeout configs operate at three tiers, each tightening as you get closer to the actual call:

Tier Mechanism Typical bound
Outer (transport) Web server / queue supervisor 60s
Middle (request/job scope) TcpUserTimeout::Middleware / max_execution_time 30s
Inner (per call) TcpUserTimeout.with_timeout(N) around specific calls 5–15s

Inner bounds are tighter than outer bounds so the inner failure raises a recoverable exception (your rescue runs, the request returns 504, the job retries) before the outer supervisor pulls the plug on the whole worker.

Headroom math

The TCP_USER_TIMEOUT we set is slightly less than the declared deadline so the kernel kills the socket before any outer guard fires:

max_execution_time TCP_USER_TIMEOUT
1s 0.9s
10s 5s
30s 25s
90s 85s
10min 9min 55s

5s headroom at production scales (β‰₯10s); 90% of max below that so very short timeouts in tests still get enforced. Implementation: ActiveJob::MaxExecutionTime.headroom_seconds.

Concurrency safety

The block API is safe under MRI's threading, fiber, and fork models. State is held in Fiber[] (Ruby 3.2+ inheritable fiber storage) with a Thread.current[] fallback for older Rubies.

  • Threads spawned inside with_timeout inherit the deadline. If a job spawns helper threads to do parallel I/O, those threads are bounded by the same kernel deadline as the job itself. (This is what you want β€” the alternative leaves sub-thread I/O unbounded.)
  • Fibers spawned inside with_timeout inherit the deadline. Falcon, Async, and any fiber-per-task pattern get the deadline propagated to spawned subtasks.
  • Threads/fibers spawned outside with_timeout never see it. No global leakage.
  • Concurrent with_timeout calls on different top-level threads are isolated. Each thread's main fiber has its own storage slot.
  • Fork. install! is idempotent and survives fork(2) β€” the child inherits the prepended hooks, and a second install! in the child does not double-prepend.
  • Concurrent install!. Calling install! from multiple threads at boot is safe; the prepend will not happen more than once.
  • Pre-existing sockets are never rebound. Hooks fire on socket creation, not on every operation. Boot-time pool connections (ActiveRecord, persistent HTTP pools, Redis) keep their original behavior β€” only new sockets opened inside a with_timeout block get bound. This is the property that makes the gem safe to drop into a Rails app with long-lived pools.

global_default_seconds and exempt_hosts are unsynchronized class-level state and should be set once at boot, before threads are accepting work. Both are immutable in steady state, so there is no read/write contention concern after boot.

Catching the timeout in your rescue blocks

When TCP_USER_TIMEOUT fires, the exception class your code sees depends on which Ruby IO layer was on top of the socket. Production rescues should catch all four:

require "net/http"
require "openssl"

begin
  TcpUserTimeout.with_timeout(30) do
    Net::HTTP.get(URI("https://upstream.example.com"))
  end
rescue Errno::ETIMEDOUT,         # raw read/write surface
       IO::TimeoutError,          # Ruby 3.2+ wrapper for some IO operations
       Net::ReadTimeout,          # Net::HTTP's wrapping of either of the above
       OpenSSL::SSL::SSLError     # TLS connections may surface socket death as a TLS error
       => e
  Rails.logger.warn("upstream wedged: #{e.class}: #{e.message}")
  # Retry, return a stale cache, surface a 504, etc.
end

OpenSSL::SSL::SSLError is included because OpenSSL on top of TCP can wrap the underlying socket's ETIMEDOUT as a TLS error rather than passing it through cleanly. Both behaviors are "valid" kernel-enforced timeouts; you just need to catch both classes.

Failure modes worth knowing

This gem is a sharp tool. Read these before relying on it.

  • Exception class translation. When the kernel kills the connection, Ruby surfaces it as either Errno::ETIMEDOUT (lower-level reads/writes) or IO::TimeoutError (Ruby 3.2+'s wrapping of certain IO operations). Net::HTTP may further wrap as Net::ReadTimeout. Production rescues should catch all three, plus OpenSSL::SSL::SSLError for TLS-wrapped sockets where the SSL layer surfaces the underlying socket death as a TLS error.

  • TLS-wrapped sockets. OpenSSL on top of TCP β€” when the kernel kills the underlying socket mid-SSL_read, the SSL layer can surface a clean Errno::ETIMEDOUT or an OpenSSL::SSL::SSLError. Both are valid kernel-enforced timeouts; rescues should catch both classes.

  • Persistent connection pools. Pooled connections retain the TCP_USER_TIMEOUT value from whichever request created them. Per-block tightening is best-effort for pooled connections; global_default_seconds is the operative ceiling. See the pooled-client section for client-specific guidance.

  • DNS not covered. TCP_USER_TIMEOUT only applies after the TCP socket is established. getaddrinfo wedges (slow/wedged resolver) bypass this entirely. Mitigate via resolv.conf (options timeout:1 attempts:2) or via your platform's DNS settings.

  • Connect timeout vs read timeout. TCP_USER_TIMEOUT covers post-connection wedges. The connect phase has its own timeout β€” Net::HTTP#open_timeout, libpq connect_timeout, etc. Both layers needed.

  • FFI-based clients bypass the hooks. This gem prepends Socket.tcp, TCPSocket.new, and TCPSocket.open. Anything wrapping libcurl (curb), or other FFI-based HTTP clients, bypasses Ruby's socket layer and isn't covered. Pure-Ruby HTTP clients (Net::HTTP, httpx, most Faraday adapters, excon) all go through the hooked methods.

  • Linux only. macOS, BSD, and Windows don't support TCP_USER_TIMEOUT. The gem silently no-ops via Errno::ENOPROTOOPT on those platforms, which keeps dev work unaffected but also means you can't validate kernel enforcement locally on a Mac. Use the included Dockerfile.test to run the kernel tests against Linux from anywhere.

Database and pooled-client integrations

This gem hooks Ruby's socket layer, which means it covers any pure-Ruby client (Net::HTTP, httpx, most Faraday adapters, redis-rb, excon, mongo ruby driver). It does not cover clients whose sockets are managed at the C level (libpq, libmysqlclient, libcurl). For those, the underlying library typically exposes its own equivalent of TCP_USER_TIMEOUT β€” set it there, and let this gem cover everything else.

PostgreSQL (libpq)

libpq supports tcp_user_timeout as a connection parameter directly (PG 12+). For Postgres connections specifically, set it server-side without going through this gem's Ruby socket hooks:

# config/database.yml
default: &default
  adapter: postgresql
  connect_timeout: 5
  variables:
    statement_timeout: '120s'
  tcp_user_timeout: 600000  # ms; 10 min safety net for non-job DB calls

MySQL (libmysqlclient)

The C client doesn't expose TCP_USER_TIMEOUT directly, but Trilogy (Ruby 3.0+, the new default for Rails 7.1+) goes through Ruby sockets and is covered by this gem. If you're still on the legacy mysql2 adapter, rely on read_timeout / write_timeout at the adapter level β€” they don't kill wedged connections at the kernel layer, but they do bound TCP-active reads.

MongoDB

The Ruby driver opens sockets via TCPSocket.new, so this gem covers them. The driver also exposes socket_timeout / connect_timeout as connection options β€” set both for belt-and-suspenders.

Redis

redis-rb uses Ruby sockets β€” covered. Set connect_timeout, read_timeout, write_timeout at client construction as a separate layer. Note that pub/sub subscribers (subscribe, psubscribe) hold connections idle for arbitrary durations β€” add those hosts to exempt_hosts.

Memcached (dalli)

dalli opens TCP sockets via TCPSocket.new for non-Unix-socket connections β€” covered by the gem's hooks. The dalli client also supports socket_timeout: at construction (default 0.5s); both layers compose. Memcached connections are typically short and fast, so the kernel deadline rarely fires in practice but is there as the safety net.

SMTP (Net::SMTP / Mail / ActionMailer)

Net::SMTP opens sockets via TCPSocket.open β€” covered. Mail's SMTP delivery method goes through Net::SMTP, and ActionMailer uses Mail underneath, so the entire mail-delivery stack is covered. SMTP servers can stop responding mid-transaction (relay troubles, greylisting); the deadline catches that.

gRPC (FFI-based)

The official grpc gem links libgrpc at the C level β€” its sockets are NOT visible to this gem's hooks. gRPC has its own deadline/timeout API (deadline: on every call); use it. The grpc proposal A18 documents kernel-level TCP_USER_TIMEOUT for gRPC at the C layer, but you have to enable it via channel args.

Persistent connection pools

Pooled clients (ActiveRecord, redis-rb with a connection pool, custom HTTP keep-alive pools) retain whatever TCP_USER_TIMEOUT was set when the connection was created. with_timeout only affects sockets opened inside the block β€” pooled connections opened earlier keep their original setting.

In practice this means:

  • global_default_seconds is the operative ceiling for all pooled connections.
  • with_timeout is best-effort for pooled clients β€” if the pool already has a warm connection, it'll be used as-is.
  • For tighter per-call bounds against a pooling client, either force a fresh connection, or live with the pool's existing bound and rely on application-level retries.

Testing

The cross-platform unit tests run anywhere:

bundle exec rake test

The Linux-only kernel enforcement tests need a Linux box. Use the bundled Docker image from a macOS dev machine:

docker build -f Dockerfile.test -t tcp_user_timeout:linux .
docker run --rm -v $PWD:/app -w /app tcp_user_timeout:linux \
  bash -c "bundle install && bundle exec rake test:linux"

Expected: 0 failures, 0 errors on Linux.

Sources

Changelog

See CHANGELOG.md for the full release history.

License

MIT. See LICENSE.

About

Kernel-enforced socket deadlines on Linux via TCP_USER_TIMEOUT

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors