-
Notifications
You must be signed in to change notification settings - Fork 30
Description
Problem
When running the cache server in Kubernetes with a large number of cache entries, the nightly cleanup job blocks the Node.js event loop, causing liveness probe failures and container restarts.
Environment
- Version: v8.1.4
- Deployment: Kubernetes with
node-clusterpreset - Storage: GCS
- Database: PostgreSQL
- Cache entries: ~18,000
Root Cause
The cleanup job has two architectural issues:
1. Runs on the primary worker
if (!cluster.isPrimary) returnThe primary worker in Nitro's node-cluster preset routes incoming connections. When blocked, it can't route liveness probe requests to any worker.
2. Unbatched concurrent deletions
lib/storage/drivers/gcs.ts#L26-L30:
async function deleteMany(objectNames: string[]) {
await Promise.all(
objectNames.map((objectName) => bucket.file(objectName).delete({ ignoreNotFound: true })),
)
}With 18,000 entries, this fires 18,000 concurrent HTTP requests, overwhelming the event loop.
Evidence from container logs
When the cleanup job starts, the following warnings appear immediately:
(node:1) TeenyStatisticsWarning: Possible excessive concurrent requests detected.
5000 requests in-flight, which exceeds the configured threshold of 5000.
Use the TEENY_REQUEST_WARN_CONCURRENT_REQUESTS environment variable or the
concurrentRequests option of teeny-request to increase or disable (0) this warning.
Followed by memory leak warnings from too many concurrent operations:
(node:30) MaxListenersExceededWarning: Possible EventEmitter memory leak detected.
11 error listeners added to [Writable]. MaxListeners is 10.
Use emitter.setMaxListeners() to increase limit
And eventually connection failures:
[unhandledRejection] FetchError: request to https://storage.googleapis.com/storage/v1/b/actions-cache-c8fe/o/gh-actions-cache%2F... failed, reason: socket hang up
{
type: 'system',
errno: 'ECONNRESET',
code: 'ECONNRESET'
}
Suggested Solutions
1. Batch deletions
Process deletions in chunks (e.g., 100 at a time) with await between batches:
async function deleteMany(objectNames: string[], batchSize = 100) {
for (let i = 0; i < objectNames.length; i += batchSize) {
const batch = objectNames.slice(i, i + batchSize);
await Promise.all(
batch.map((name) => bucket.file(name).delete({ ignoreNotFound: true }))
);
}
}2. Run cleanup on a worker, not primary
Fork a dedicated worker for cleanup so the primary remains responsive.
3. Use Nitro Tasks instead of plugin-based cron
Nitro has an experimental Tasks API designed for one-off operations. Instead of running cleanup in a plugin on the primary worker, define it as a scheduled task:
// server/tasks/cache/cleanup.ts
export default defineTask({
meta: {
name: "cache:cleanup",
description: "Prune old cache entries",
},
async run({ payload }) {
const storage = await useStorageAdapter()
await storage.pruneCaches(ENV.CACHE_CLEANUP_OLDER_THAN_DAYS)
return { result: "Success" }
},
})Configure in nitro.config.ts:
export default defineNitroConfig({
experimental: {
tasks: true
},
scheduledTasks: {
// Use CACHE_CLEANUP_CRON value, e.g., '0 0 * * *' for midnight
'0 0 * * *': ['cache:cleanup']
}
})Benefits:
- Tasks are designed for async operations and don't block the primary worker
- Built-in concurrency control (only one instance runs at a time)
- Can be triggered via API (
/_nitro/tasks/cache:cleanup) or CLI (nitro task run cache:cleanup) - Platform support includes
node-serverpreset (whichnode-clusteris based on)
This would require enabling the experimental tasks feature and migrating from the current plugins/cleanup.ts approach.
4. Add CLI mode for external cleanup (Kubernetes CronJob)
Allow running cleanup as a separate process via CLI flag or environment variable:
node .output/server/index.mjs --cleanup-only
# or
CLEANUP_ONLY=true node .output/server/index.mjsThis would allow Kubernetes operators to run cleanup as a separate CronJob that doesn't affect the running cache server:
apiVersion: batch/v1
kind: CronJob
metadata:
name: cache-cleanup
spec:
schedule: "0 0 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: cleanup
image: ghcr.io/falcondev-oss/github-actions-cache-server:v8
args: ["--cleanup-only"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: cache-server-secrets
key: database-url
restartPolicy: OnFailureBenefits:
- Cleanup can run for as long as needed without affecting liveness probes
- If cleanup fails or is killed, the cache server continues serving requests
- Cleanup resources (CPU/memory) can be configured independently
Workaround
We increased the Kubernetes liveness probe failureThreshold to 12 (allows ~2 minutes of unresponsiveness). This is not ideal as it delays detection of actual failures.
Related
GCS Batch API
GCS has a batch API that supports up to 100 operations per request. However, the Node.js client library (@google-cloud/storage) does not support it - see googleapis/nodejs-storage#1040.
Other storage drivers
This also affects S3 and filesystem drivers since the batching issue is in the storage adapter layer: