monitor lifecycle conductor#2723
Conversation
Hello benzekrimaha,My role is to assist you with the merge of this Available options
Available commands
Status report is not available. |
380069a to
25ea9d5
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files
... and 5 files with indirect coverage changes
@@ Coverage Diff @@
## development/9.3 #2723 +/- ##
===================================================
+ Coverage 74.48% 74.83% +0.35%
===================================================
Files 200 200
Lines 13603 13701 +98
===================================================
+ Hits 10132 10253 +121
+ Misses 3461 3438 -23
Partials 10 10
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
8316f88 to
408c96c
Compare
408c96c to
e1c5b13
Compare
e1c5b13 to
aefb677
Compare
| const log = this.logger.newRequestLogger(); | ||
| const start = new Date(); | ||
| const start = Date.now(); | ||
| this._scanId = uuid(); |
There was a problem hiding this comment.
Hmm, we're storing the scan ID as a "global" field variable, but it sounds like it is really relevant/used only inside this function (through indirect calls). Could we drop the global field and instead pass it through to whatever uses it? Maybe in _createBucketTaskMessages?
725c3df to
11a94ea
Compare
a2128cf to
a464b39
Compare
| 'configured bucket processor retention interval without update to ' + | ||
| 'avoid unbounded process memory growth. ' + | ||
| 'Prometheus retains scraped scan-id series until TSDB retention.', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], |
There was a problem hiding this comment.
The conductor_scan_id label on bucketProcessorScanMessagesProcessed has unbounded cardinality — each scan produces a new UUID. The local cleanup timer limits prom-client memory, but Prometheus TSDB retains every scraped series until its own retention expires. Over weeks/months this can cause significant storage growth and slow queries on the TSDB side.
Consider whether you've estimated the TSDB cardinality impact (e.g. one new scan-id per hour × 30-day retention = 720 label values × replicas). If this is expected and acceptable, no change needed, but it's worth documenting the expected growth rate in a runbook or the alert description.
— Claude Code
There was a problem hiding this comment.
This is an intentional trade-off for the overlapping-scan alert: the alert needs to distinguish scan IDs to detect bucket processors handling work from more than one conductor scan. Local prom-client cleanup is configurable through bucketProcessor.scanMetricRetentionMs, and the metric help text now states that Prometheus retains scraped scan-id series until TSDB retention. I kept the wording concise to avoid over-explaining this internal troubleshooting detail in user-facing monitoring text.
| const DEFAULT_SCAN_METRIC_RETENTION_MS = 24 * 60 * 60 * 1000; | ||
| const CONDUCTOR_ORIGIN = 'conductor'; | ||
| const BUCKET_PROCESSOR_ORIGIN = 'bucket_processor'; | ||
| let scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_MS; |
There was a problem hiding this comment.
Module-level mutable state (let scanMetricRetentionMs) shared across tests. The resetLifecycleScanMetricCleanupTimers function handles this for tests, but in production, if configureLifecycleScanMetricRetention is called multiple times (e.g. multiple LifecycleBucketProcessor instances in the same process), the last call wins silently. This is likely fine given the current single-instance architecture, but worth noting.
— Claude Code
There was a problem hiding this comment.
This is process-wide state by design for the current bucket processor deployment model: metrics are process-wide and there is a single lifecycle bucket processor service configuration per process. Tests reset it through resetLifecycleScanMetricCleanupTimers(). If we ever run multiple differently configured lifecycle bucket processors in one process, this should be revisited with instance-scoped metric state.
| details, | ||
| }; | ||
| } | ||
|
|
There was a problem hiding this comment.
_getContinuationEntry spreads ...bucketData and then overwrites contextInfo and details. This is correct as long as bucketData doesn't have additional properties that shouldn't be forwarded (e.g. stale details from a previous continuation). The previous code used Object.assign({}, bucketData, ...) which had the same behavior, so this is not a regression — just noting that the spread inherits all properties from bucketData, which is the intended behavior for continuation entries.
— Claude Code
There was a problem hiding this comment.
Yes, this is intentional and matches the previous Object.assign({}, bucketData, ...) behavior. Continuation entries preserve the bucket task envelope and only replace contextInfo / details for the continuation message.
|
|
||
| /** | ||
| * This function forces syncing the latest data mover topic | ||
| * offsets to the 'lifecycle' metrics snapshot. It is called when |
There was a problem hiding this comment.
Minor: the _applyTransitionRule call in _processCurrents at line 1191 passes ...this._getScanContext(bucketData) into the params object, but _applyTransitionRule then builds context with conductorScanId: params.conductorScanId. This works because _getScanContext returns exactly those two keys. However, the scan context is propagated via two different mechanisms in this file: via _getActionContext (which calls _getScanContext internally) for delete/MPU actions, and via spreading _getScanContext into params objects for transition actions. A single pattern would be slightly less error-prone.
— Claude Code
There was a problem hiding this comment.
Fixed. Transition action entries now use the same action-context pattern as the other lifecycle actions: callers pass actionContext: this._getActionContext(bucketData, log, 'transition'), and _getTransitionActionEntry adds that context to the entry. I also updated the V2 transition test to assert the context through actionContext.
|
85f1055 to
4da31b3
Compare
| attempt, | ||
| }); | ||
| entry.addContext({ | ||
| entry.addContext(params.actionContext || { |
There was a problem hiding this comment.
The fallback means that if params.actionContext is undefined (callers that haven't been updated), the context won't include scan fields. This is fine for rolling upgrades, but every call site in this PR that calls _applyTransitionRule should pass actionContext. I verified the callers in this PR and they all do — just flagging that any future caller that forgets actionContext will silently lose scan context tracing without any warning.
— Claude Code
There was a problem hiding this comment.
Agreed. The fallback has been removed, and current callers now pass actionContext explicitly so missing scan context is not silently hidden.
| processBucketEntry(bucketLCRules, bucketData, s3target, | ||
| backbeatMetadataProxy, nbRetries, done) { | ||
| const log = this.log.newRequestLogger(); | ||
| if (log.addDefaultFields) { |
There was a problem hiding this comment.
The guard if (log.addDefaultFields) suggests this might be called with a logger that doesn't have addDefaultFields. However, just a few lines down in the same method, this.log.newRequestLogger() is called — and in the conductor and bucket processor, log.addDefaultFields is called unconditionally. This inconsistency suggests either the guard is unnecessary (and could mask a bug if the wrong logger type is passed), or there's a known code path where a non-request logger is used. If it's the latter, a comment explaining when that happens would help.
— Claude Code
There was a problem hiding this comment.
This was addressed: the guard was unnecessary and has been removed, so addDefaultFields is now called consistently with the conductor and bucket processor paths.
|
Carry conductor scan start timestamps through lifecycle bucket messages, task contexts, and action entries for consistent downstream observability. Issue: BB-740
Record bucket-task message pickup by conductor scan id before scheduler dispatch, observe message age, and make per-scan metric retention configurable. Issue: BB-740
Separate stalled-scan thresholds from lifecycle latency and alert only when bucket processors handle multiple scan ids in parallel. Issue: BB-740
Clarify bucket processor pickup panels, split pod and scan-id views, and render scan message age as a heatmap. Issue: BB-740
4da31b3 to
4520f5a
Compare
| const unknownCanonicalIds = this._accountIdCache.getMisses(); | ||
| if (err && err.Throttling) { | ||
| log.info('not starting new lifecycle batch', { reason: err }); | ||
| if (scanStarted) { |
There was a problem hiding this comment.
When _controlBacklog throttles but scanStarted is true (meaning the waterfall step that sets scanStarted = true already ran, then listBuckets returned a Throttling error), this block resets _batchInProgress = false and clears _currentScanId/_currentScanStartTimestamp. That's correct for cleanup.
However, this block does not call LifecycleMetrics.onProcessBuckets to undo the heartbeat set at line 446, and it does not call onConductorScanComplete. This means s3_lifecycle_latest_batch_start_time was already set (line 446), but s3_lifecycle_latest_batch_end_time never gets updated. The LifecycleStalledScan alert will fire because start_time > end_time and enough wall-clock time passes — even though the conductor is not actually stalled, it just got throttled mid-scan. Consider calling onConductorScanComplete here too (with the partial totalBucketsListed count and elapsed time) so the alert self-resolves.
— Claude Code
There was a problem hiding this comment.
I updated the conductor so that when throttling happens after a scan has already started, we still record the scan end metrics before clearing the in-progress scan state. This avoids leaving latest_batch_start_time > latest_batch_end_time and prevents LifecycleStalledScan from firing for a scan that already exited.
| 'configured bucket processor retention interval without update to ' + | ||
| 'avoid unbounded process memory growth. ' + | ||
| 'Prometheus retains scraped scan-id series until TSDB retention.', | ||
| labelNames: [LIFECYCLE_LABEL_ORIGIN, LIFECYCLE_LABEL_CONDUCTOR_SCAN_ID], |
There was a problem hiding this comment.
The conductor_scan_id label on bucketProcessorScanMessagesProcessed can take a unique UUID value per scan. Since each scan generates a new UUID, this creates a new time series for every scan. While the cleanup timer in setScanMetricTimeout bounds prom-client's in-process memory, Prometheus server will retain all scraped series until TSDB retention expires. For a system scanning every few minutes, this could produce thousands of unique label values per day. Consider whether the LifecycleBucketProcessorMultipleParallelScans alert (which is the main consumer of this cardinality) could be achieved with a lower-cardinality approach, e.g. a gauge tracking the count of distinct active scans rather than one series per scan UUID.
— Claude Code
There was a problem hiding this comment.
This is a known trade-off and matches the compromise discussed with François earlier. We keep conductor_scan_id as a label because it makes overlapping scans visible if the expected “one scan at a time” invariant breaks. The prom-client side is bounded by configurable local cleanup, and the help text explicitly notes that Prometheus retains scraped scan-id series until TSDB retention. So I do not think we should replace this with a lower-cardinality gauge in this PR unless we decide to change that accepted trade-off.
| this._batchInProgress = false; | ||
| this._currentScanId = null; | ||
| this._currentScanStartTimestamp = null; | ||
| LifecycleMetrics.onConductorScanComplete( |
There was a problem hiding this comment.
When the scan was started (scanStarted = true) and a Throttling error occurs during listing, onConductorScanComplete is called. But on the non-throttling error path (line ~502–506), the same logic runs — this is correct. However, I notice the _currentScanId is reset to null inside the Throttling branch but _batchInProgress is also set to false. If a concurrent processBuckets call were to sneak in between these assignments (Node.js is single-threaded so this can't actually happen), it could see inconsistent state. As-is this is fine for single-threaded Node.js, just noting for future multi-instance awareness.
— Claude Code
| ruleType: 'transition', | ||
| reqId: log.getSerializedUids(), | ||
| }); | ||
| entry.addContext(params.actionContext); |
There was a problem hiding this comment.
_applyTransitionRule receives params.actionContext and passes it directly to entry.addContext(). Callers construct actionContext via this._getActionContext(bucketData, log, 'transition') which always includes origin, ruleType, reqId, and scan context. However, nothing prevents a caller from omitting actionContext entirely (it's not validated), in which case entry.addContext(undefined) would be called. Consider adding a guard or documenting actionContext as required in the JSDoc for _applyTransitionRule.
— Claude Code
| const DEFAULT_SCAN_METRIC_RETENTION_MS = 24 * 60 * 60 * 1000; | ||
| const CONDUCTOR_ORIGIN = 'conductor'; | ||
| const BUCKET_PROCESSOR_ORIGIN = 'bucket_processor'; | ||
| let scanMetricRetentionMs = DEFAULT_SCAN_METRIC_RETENTION_MS; |
There was a problem hiding this comment.
scanMetricRetentionMs is module-level mutable state shared across all callers in the same process. configureLifecycleScanMetricRetention is called from LifecycleBucketProcessor constructor. If a process ever created two LifecycleBucketProcessor instances with different configs, the last one wins silently. This is probably fine for the current single-instance usage, but worth noting.
— Claude Code
|
PR Review Summary |
|
PR is kept as-is for these points:
|
Issue: BB-740