Cleanup job blocks event loop, causing liveness probe failures in Kubernetes

## Problem

When running the cache server in Kubernetes with a large number of cache entries, the nightly cleanup job blocks the Node.js event loop, causing liveness probe failures and container restarts.

## Environment

- Version: v8.1.4
- Deployment: Kubernetes with `node-cluster` preset
- Storage: GCS
- Database: PostgreSQL
- Cache entries: ~18,000

## Root Cause

The cleanup job has two architectural issues:

### 1. Runs on the primary worker

[`plugins/cleanup.ts#L9`](https://github.com/falcondev-oss/github-actions-cache-server/blob/v8.1.4/plugins/cleanup.ts#L9):
```typescript
if (!cluster.isPrimary) return
```

The primary worker in Nitro's `node-cluster` preset routes incoming connections. When blocked, it can't route liveness probe requests to any worker.

### 2. Unbatched concurrent deletions

[`lib/storage/drivers/gcs.ts#L26-L30`](https://github.com/falcondev-oss/github-actions-cache-server/blob/v8.1.4/lib/storage/drivers/gcs.ts#L26-L30):
```typescript
async function deleteMany(objectNames: string[]) {
  await Promise.all(
    objectNames.map((objectName) => bucket.file(objectName).delete({ ignoreNotFound: true })),
  )
}
```

With 18,000 entries, this fires 18,000 concurrent HTTP requests, overwhelming the event loop.

## Evidence from container logs

When the cleanup job starts, the following warnings appear immediately:

```
(node:1) TeenyStatisticsWarning: Possible excessive concurrent requests detected.
5000 requests in-flight, which exceeds the configured threshold of 5000.
Use the TEENY_REQUEST_WARN_CONCURRENT_REQUESTS environment variable or the
concurrentRequests option of teeny-request to increase or disable (0) this warning.
```

Followed by memory leak warnings from too many concurrent operations:

```
(node:30) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
11 error listeners added to [Writable]. MaxListeners is 10.
Use emitter.setMaxListeners() to increase limit
```

And eventually connection failures:

```
[unhandledRejection] FetchError: request to https://storage.googleapis.com/storage/v1/b/actions-cache-c8fe/o/gh-actions-cache%2F... failed, reason: socket hang up
{
  type: 'system',
  errno: 'ECONNRESET',
  code: 'ECONNRESET'
}
```

## Suggested Solutions

### 1. Batch deletions

Process deletions in chunks (e.g., 100 at a time) with `await` between batches:

```typescript
async function deleteMany(objectNames: string[], batchSize = 100) {
  for (let i = 0; i < objectNames.length; i += batchSize) {
    const batch = objectNames.slice(i, i + batchSize);
    await Promise.all(
      batch.map((name) => bucket.file(name).delete({ ignoreNotFound: true }))
    );
  }
}
```

### 2. Run cleanup on a worker, not primary

Fork a dedicated worker for cleanup so the primary remains responsive.

### 3. Use Nitro Tasks instead of plugin-based cron

Nitro has an experimental [Tasks API](https://nitro.build/guide/tasks) designed for one-off operations. Instead of running cleanup in a plugin on the primary worker, define it as a scheduled task:

```typescript
// server/tasks/cache/cleanup.ts
export default defineTask({
  meta: {
    name: "cache:cleanup",
    description: "Prune old cache entries",
  },
  async run({ payload }) {
    const storage = await useStorageAdapter()
    await storage.pruneCaches(ENV.CACHE_CLEANUP_OLDER_THAN_DAYS)
    return { result: "Success" }
  },
})
```

Configure in `nitro.config.ts`:
```typescript
export default defineNitroConfig({
  experimental: {
    tasks: true
  },
  scheduledTasks: {
    // Use CACHE_CLEANUP_CRON value, e.g., '0 0 * * *' for midnight
    '0 0 * * *': ['cache:cleanup']
  }
})
```

Benefits:
- Tasks are designed for async operations and don't block the primary worker
- Built-in concurrency control (only one instance runs at a time)
- Can be triggered via API (`/_nitro/tasks/cache:cleanup`) or CLI (`nitro task run cache:cleanup`)
- Platform support includes `node-server` preset (which `node-cluster` is based on)

This would require enabling the experimental tasks feature and migrating from the current `plugins/cleanup.ts` approach.

### 4. Add CLI mode for external cleanup (Kubernetes CronJob)

Allow running cleanup as a separate process via CLI flag or environment variable:

```bash
node .output/server/index.mjs --cleanup-only
# or
CLEANUP_ONLY=true node .output/server/index.mjs
```

This would allow Kubernetes operators to run cleanup as a separate CronJob that doesn't affect the running cache server:

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cache-cleanup
spec:
  schedule: "0 0 * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cleanup
            image: ghcr.io/falcondev-oss/github-actions-cache-server:v8
            args: ["--cleanup-only"]
            env:
              - name: DATABASE_URL
                valueFrom:
                  secretKeyRef:
                    name: cache-server-secrets
                    key: database-url
          restartPolicy: OnFailure
```

Benefits:
- Cleanup can run for as long as needed without affecting liveness probes
- If cleanup fails or is killed, the cache server continues serving requests
- Cleanup resources (CPU/memory) can be configured independently

## Workaround

We increased the Kubernetes liveness probe `failureThreshold` to 12 (allows ~2 minutes of unresponsiveness). This is not ideal as it delays detection of actual failures.

## Related

### GCS Batch API

GCS has a [batch API](https://cloud.google.com/storage/docs/batch) that supports up to 100 operations per request. However, the Node.js client library (`@google-cloud/storage`) does not support it - see [googleapis/nodejs-storage#1040](https://github.com/googleapis/nodejs-storage/issues/1040#issuecomment-1638504673).

### Other storage drivers

This also affects S3 and filesystem drivers since the batching issue is in the storage adapter layer:
- [`lib/storage/drivers/s3.ts`](https://github.com/falcondev-oss/github-actions-cache-server/blob/v8.1.4/lib/storage/drivers/s3.ts)
- [`lib/storage/drivers/filesystem.ts`](https://github.com/falcondev-oss/github-actions-cache-server/blob/v8.1.4/lib/storage/drivers/filesystem.ts)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cleanup job blocks event loop, causing liveness probe failures in Kubernetes #175

Problem

Environment

Root Cause

1. Runs on the primary worker

2. Unbatched concurrent deletions

Evidence from container logs

Suggested Solutions

1. Batch deletions

2. Run cleanup on a worker, not primary

3. Use Nitro Tasks instead of plugin-based cron

4. Add CLI mode for external cleanup (Kubernetes CronJob)

Workaround

Related

GCS Batch API

Other storage drivers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Cleanup job blocks event loop, causing liveness probe failures in Kubernetes #175

Description

Problem

Environment

Root Cause

1. Runs on the primary worker

2. Unbatched concurrent deletions

Evidence from container logs

Suggested Solutions

1. Batch deletions

2. Run cleanup on a worker, not primary

3. Use Nitro Tasks instead of plugin-based cron

4. Add CLI mode for external cleanup (Kubernetes CronJob)

Workaround

Related

GCS Batch API

Other storage drivers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions