Skip to content

FlagdProvider default keep_alive_time=0 causes GOAWAY crash with flagd server #365

@aepfli

Description

@aepfli

Description

The FlagdProvider defaults keep_alive_time to 0 (DEFAULT_KEEP_ALIVE = 0 in config.py), which is passed directly to grpc.keepalive_time_ms as a channel option.

In the Python gRPC C-core implementation, keepalive_time_ms=0 is interpreted as "send pings immediately / every ~1ms" rather than "disabled". This causes flagd's Go gRPC server to reject the connection with:

GOAWAY [ENHANCE_YOUR_CALM] "too_many_pings"

flagd's Go gRPC server uses the default grpc-go enforcement policy with MinPingInterval=5min. Any keepalive interval below 5 minutes triggers the GOAWAY.

Impact

  • On aarch64/ARM (e.g., Raspberry Pi), the GOAWAY triggers a fatal gRPC C-core assertion failure in ev_epoll1_linux.cc: Check failed: next_worker->state == KICKEDcrashing the entire Python process
  • On x86_64, the client typically reconnects but logs repeated GOAWAY warnings, degrading performance and reliability

This affects both the IN_PROCESS and RPC resolver types since both create gRPC channels with the same keepalive options.

Reproduction

from openfeature.contrib.provider.flagd import FlagdProvider
from openfeature.contrib.provider.flagd.config import ResolverType

# Default keep_alive_time=0 → immediate crash/GOAWAY with flagd server
provider = FlagdProvider(
    host="flagd",
    port=8015,
    resolver_type=ResolverType.IN_PROCESS,
)

flagd server logs:

Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings"

Client crash (ARM only):

F0000 00:00:1773608949.106482  38 ev_epoll1_linux.cc:1125] Check failed: next_worker->state == KICKED

Suggested Fix

Change the default from 0 to a value that respects flagd's server-side MinPingInterval of 5 minutes:

# config.py
DEFAULT_KEEP_ALIVE = 600000  # 10 minutes (ms) — above flagd's 5min MinPingInterval

Or alternatively, don't set grpc.keepalive_time_ms at all when keep_alive_time=0, letting the gRPC library use its own default (which is "infinite" / disabled in the C-core):

# grpc.py - _generate_channel()
options = []
if config.keep_alive_time > 0:
    options.append(("grpc.keepalive_time_ms", config.keep_alive_time))

Workaround

Set keep_alive_time explicitly when creating the provider:

provider = FlagdProvider(
    host="flagd",
    port=8015,
    resolver_type=ResolverType.IN_PROCESS,
    keep_alive_time=600000,  # 10 minutes
)

Or via environment variable:

FLAGD_KEEP_ALIVE_TIME_MS=600000

Environment

  • openfeature-provider-flagd version: 0.2.7
  • flagd version: v0.13.2
  • Python: 3.11 (aarch64-linux-gnu)
  • grpcio: compiled with C-core (not pure-Python)
  • Platform: Raspberry Pi 5 (ARM64)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions