Skip to content

Cleanup job blocks event loop, causing liveness probe failures in Kubernetes #175

@florianmutter

Description

@florianmutter

Problem

When running the cache server in Kubernetes with a large number of cache entries, the nightly cleanup job blocks the Node.js event loop, causing liveness probe failures and container restarts.

Environment

  • Version: v8.1.4
  • Deployment: Kubernetes with node-cluster preset
  • Storage: GCS
  • Database: PostgreSQL
  • Cache entries: ~18,000

Root Cause

The cleanup job has two architectural issues:

1. Runs on the primary worker

plugins/cleanup.ts#L9:

if (!cluster.isPrimary) return

The primary worker in Nitro's node-cluster preset routes incoming connections. When blocked, it can't route liveness probe requests to any worker.

2. Unbatched concurrent deletions

lib/storage/drivers/gcs.ts#L26-L30:

async function deleteMany(objectNames: string[]) {
  await Promise.all(
    objectNames.map((objectName) => bucket.file(objectName).delete({ ignoreNotFound: true })),
  )
}

With 18,000 entries, this fires 18,000 concurrent HTTP requests, overwhelming the event loop.

Evidence from container logs

When the cleanup job starts, the following warnings appear immediately:

(node:1) TeenyStatisticsWarning: Possible excessive concurrent requests detected.
5000 requests in-flight, which exceeds the configured threshold of 5000.
Use the TEENY_REQUEST_WARN_CONCURRENT_REQUESTS environment variable or the
concurrentRequests option of teeny-request to increase or disable (0) this warning.

Followed by memory leak warnings from too many concurrent operations:

(node:30) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
11 error listeners added to [Writable]. MaxListeners is 10.
Use emitter.setMaxListeners() to increase limit

And eventually connection failures:

[unhandledRejection] FetchError: request to https://storage.googleapis.com/storage/v1/b/actions-cache-c8fe/o/gh-actions-cache%2F... failed, reason: socket hang up
{
  type: 'system',
  errno: 'ECONNRESET',
  code: 'ECONNRESET'
}

Suggested Solutions

1. Batch deletions

Process deletions in chunks (e.g., 100 at a time) with await between batches:

async function deleteMany(objectNames: string[], batchSize = 100) {
  for (let i = 0; i < objectNames.length; i += batchSize) {
    const batch = objectNames.slice(i, i + batchSize);
    await Promise.all(
      batch.map((name) => bucket.file(name).delete({ ignoreNotFound: true }))
    );
  }
}

2. Run cleanup on a worker, not primary

Fork a dedicated worker for cleanup so the primary remains responsive.

3. Use Nitro Tasks instead of plugin-based cron

Nitro has an experimental Tasks API designed for one-off operations. Instead of running cleanup in a plugin on the primary worker, define it as a scheduled task:

// server/tasks/cache/cleanup.ts
export default defineTask({
  meta: {
    name: "cache:cleanup",
    description: "Prune old cache entries",
  },
  async run({ payload }) {
    const storage = await useStorageAdapter()
    await storage.pruneCaches(ENV.CACHE_CLEANUP_OLDER_THAN_DAYS)
    return { result: "Success" }
  },
})

Configure in nitro.config.ts:

export default defineNitroConfig({
  experimental: {
    tasks: true
  },
  scheduledTasks: {
    // Use CACHE_CLEANUP_CRON value, e.g., '0 0 * * *' for midnight
    '0 0 * * *': ['cache:cleanup']
  }
})

Benefits:

  • Tasks are designed for async operations and don't block the primary worker
  • Built-in concurrency control (only one instance runs at a time)
  • Can be triggered via API (/_nitro/tasks/cache:cleanup) or CLI (nitro task run cache:cleanup)
  • Platform support includes node-server preset (which node-cluster is based on)

This would require enabling the experimental tasks feature and migrating from the current plugins/cleanup.ts approach.

4. Add CLI mode for external cleanup (Kubernetes CronJob)

Allow running cleanup as a separate process via CLI flag or environment variable:

node .output/server/index.mjs --cleanup-only
# or
CLEANUP_ONLY=true node .output/server/index.mjs

This would allow Kubernetes operators to run cleanup as a separate CronJob that doesn't affect the running cache server:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: cache-cleanup
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: ghcr.io/falcondev-oss/github-actions-cache-server:v8
            args: ["--cleanup-only"]
            env:
              - name: DATABASE_URL
                valueFrom:
                  secretKeyRef:
                    name: cache-server-secrets
                    key: database-url
          restartPolicy: OnFailure

Benefits:

  • Cleanup can run for as long as needed without affecting liveness probes
  • If cleanup fails or is killed, the cache server continues serving requests
  • Cleanup resources (CPU/memory) can be configured independently

Workaround

We increased the Kubernetes liveness probe failureThreshold to 12 (allows ~2 minutes of unresponsiveness). This is not ideal as it delays detection of actual failures.

Related

GCS Batch API

GCS has a batch API that supports up to 100 operations per request. However, the Node.js client library (@google-cloud/storage) does not support it - see googleapis/nodejs-storage#1040.

Other storage drivers

This also affects S3 and filesystem drivers since the batching issue is in the storage adapter layer:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions