fix(health): gate memory_critical on real memory pressure#735
Conversation
memory_critical fired whenever heapUsed/heapTotal > 95% and rss > 512MB. A busy Node process keeps its heap near-full by design, so that condition is effectively always true under normal load, producing chronic false-positive critical alerts that drown out real signals. Add a real-pressure gate: memory_critical now additionally requires a genuine pressure signal - either this process's absolute RSS has crossed a high ceiling (default 4GB), or the host as a whole has run low on free RAM (default: less than 10% of total free). Both bounds, plus the existing percent and RSS-floor knobs, are env-overridable (AGENTMEMORY_MEMORY_CRITICAL_RSS_MB, AGENTMEMORY_MEMORY_SYSTEM_FREE_FLOOR_RATIO, AGENTMEMORY_MEMORY_CRITICAL_PERCENT, AGENTMEMORY_MEMORY_WARN_PERCENT, AGENTMEMORY_MEMORY_RSS_FLOOR_MB) via resolveThresholdConfig(), which monitor.ts reads once at startup. The warn-level threshold is unchanged. Uses only node:os; no new dependencies. Signed-off-by: MarvinFS <MarvinFS@users.noreply.github.com>
|
@MarvinFS is attempting to deploy a commit to the rohitg00's projects Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughThe PR adds environment-configurable memory critical gating to the health monitor. New ChangesMemory Critical Threshold Configuration and Evaluation
sequenceDiagram
participant registerHealthMonitor
participant resolveThresholdConfig
participant evaluateHealth
participant HostOS
registerHealthMonitor->>resolveThresholdConfig: read env vars
resolveThresholdConfig-->>registerHealthMonitor: thresholdConfig
registerHealthMonitor->>evaluateHealth: (healthSnapshot, thresholdConfig)
evaluateHealth->>HostOS: os.freemem()/os.totalmem() (real-pressure)
evaluateHealth-->>registerHealthMonitor: status/alerts/notes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint skipped: no ESLint configuration detected in root package.json. To enable, add Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
test/health-thresholds.test.ts (1)
47-52: ⚡ Quick winAdd direct coverage for the host-memory gate as well.
These updates only exercise the new absolute-RSS path. The
memorySystemFreeFloorRatiobranch, and the documented0disable case, still have no direct test, so a regression in the new host-memory gate would slip through.Also applies to: 97-105
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/health-thresholds.test.ts` around lines 47 - 52, The test only exercises the absolute-RSS branch of evaluateHealth via memoryCriticalRssBytes; add direct tests for the host-memory gate by calling evaluateHealth with memorySystemFreeFloorRatio set to a non-zero value and with system-free bytes in the synthetic stats (s) low enough to trip the gate, asserting that status/alerts reflect memory_critical, and also add a test for the documented disable case by passing memorySystemFreeFloorRatio: 0 to ensure the host-memory check is skipped; update or add tests near the existing blocks (the evaluateHealth calls around lines 47 and 97-105) to supply appropriate s.systemFreeBytes and corresponding assertions for both the tripping and disabled behaviors.src/health/thresholds.ts (1)
12-23: ⚡ Quick winPrefer self-describing names/helpers over the new explanatory comments.
Most of these comments restate behavior that the field names and gate logic can carry directly. Extracting the pressure check into a helper such as
isUnderRealMemoryPressure(...)would let you drop most of the prose.As per coding guidelines,
src/**/*.{ts,js}: No code comments explaining WHAT — use clear naming instead.Also applies to: 52-55, 73-75, 137-141
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/health/thresholds.ts` around lines 12 - 23, Replace the long explanatory comments in thresholds.ts by using self-describing names and a small helper: rename/ensure the fields memoryCriticalRssBytes and memorySystemFreeFloorRatio remain clearly named, remove the prose above them, and implement an exported helper function isUnderRealMemoryPressure(totalRamBytes, freeRamBytes, processRssBytes, processHeapUsedBytes, memorySystemFreeFloorRatio, memoryCriticalRssBytes) that encapsulates the existing gate logic (system free-ratio escape hatch plus the RSS+heap real-pressure check) and use that helper wherever the gate is evaluated (e.g., the memory critical gate logic referenced around lines ~52-55, ~73-75, ~137-141) so the comments can be removed and behavior is expressed by the function and field names.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/health/thresholds.ts`:
- Around line 38-50: Environment-parsed thresholds can be set to impossible
values; update parseIntEnv and parseFloatEnv and the caller
resolveThresholdConfig to validate and clamp overrides for specific keys (e.g.
AGENTMEMORY_MEMORY_SYSTEM_FREE_FLOOR_RATIO, AGENTMEMORY_MEMORY_CRITICAL_RSS_MB)
before they affect underRealPressure logic: enforce non-negative integers for MB
thresholds, enforce ratio values to be within [0,1] (or other sensible min/max
per threshold), and fall back to the default if the parsed value is out of
bounds; reference the parseIntEnv, parseFloatEnv helpers and
resolveThresholdConfig so you add clamping/validation there (reject or clamp
invalid values and optionally log a warning) to prevent impossible env values
from enabling false positives.
---
Nitpick comments:
In `@src/health/thresholds.ts`:
- Around line 12-23: Replace the long explanatory comments in thresholds.ts by
using self-describing names and a small helper: rename/ensure the fields
memoryCriticalRssBytes and memorySystemFreeFloorRatio remain clearly named,
remove the prose above them, and implement an exported helper function
isUnderRealMemoryPressure(totalRamBytes, freeRamBytes, processRssBytes,
processHeapUsedBytes, memorySystemFreeFloorRatio, memoryCriticalRssBytes) that
encapsulates the existing gate logic (system free-ratio escape hatch plus the
RSS+heap real-pressure check) and use that helper wherever the gate is evaluated
(e.g., the memory critical gate logic referenced around lines ~52-55, ~73-75,
~137-141) so the comments can be removed and behavior is expressed by the
function and field names.
In `@test/health-thresholds.test.ts`:
- Around line 47-52: The test only exercises the absolute-RSS branch of
evaluateHealth via memoryCriticalRssBytes; add direct tests for the host-memory
gate by calling evaluateHealth with memorySystemFreeFloorRatio set to a non-zero
value and with system-free bytes in the synthetic stats (s) low enough to trip
the gate, asserting that status/alerts reflect memory_critical, and also add a
test for the documented disable case by passing memorySystemFreeFloorRatio: 0 to
ensure the host-memory check is skipped; update or add tests near the existing
blocks (the evaluateHealth calls around lines 47 and 97-105) to supply
appropriate s.systemFreeBytes and corresponding assertions for both the tripping
and disabled behaviors.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 1e6ba365-14f2-493f-a08a-43022801a5d6
📒 Files selected for processing (3)
src/health/monitor.tssrc/health/thresholds.tstest/health-thresholds.test.ts
…e helper Addresses CodeRabbit review feedback on the memory_critical gate: - resolveThresholdConfig() validated nothing, so a typo such as AGENTMEMORY_MEMORY_SYSTEM_FREE_FLOOR_RATIO=2 or AGENTMEMORY_MEMORY_CRITICAL_RSS_MB=-1 made the real-pressure gate effectively always true, reintroducing the false-positive memory_critical alerts this change is meant to remove. parseIntEnv/parseFloatEnv now take min/max bounds and fall back to the default when an override is out of range (percent 0-100, ratio 0-1, MB >= 0). - Extract isUnderRealMemoryPressure() so the gate is expressed by a name rather than a comment, and trim the explanatory prose. - Add direct coverage for the host-free-RAM gate (both the tripping path and the ratio=0 disable case, via mocked os.totalmem/os.freemem) and for the env-bounds rejection. Signed-off-by: MarvinFS <MarvinFS@users.noreply.github.com>
|
Thanks for the review — addressed in d23ee8f:
Full suite green (1294 tests) on a fresh |
memory_criticalfires wheneverheapUsed/heapTotal > 95% && rss > 512MB. A busy Node process keeps its heap near-full by design, so that condition is effectively always true under normal load, producing chronic false-positive critical alerts that drown out real signals.This adds a real-pressure gate:
memory_criticalnow additionally requires a genuine pressure signal — either this process's absolute RSS has crossed a high ceiling (default 4 GB), or the host as a whole has run low on free RAM (default: < 10% of total free). The warn-level threshold is unchanged.All bounds are env-overridable via
resolveThresholdConfig(), whichmonitor.tsreads once at startup:AGENTMEMORY_MEMORY_CRITICAL_RSS_MB(default 4096)AGENTMEMORY_MEMORY_SYSTEM_FREE_FLOOR_RATIO(default 0.1; set to 0 to disable the system-memory check)AGENTMEMORY_MEMORY_CRITICAL_PERCENT,AGENTMEMORY_MEMORY_WARN_PERCENT,AGENTMEMORY_MEMORY_RSS_FLOOR_MBUses only
node:os; no new dependencies.test/health-thresholds.test.tsis updated for the new gate. Verifiednpm run buildclean andnpm testgreen on a freshmaincheckout.Summary by CodeRabbit
New Features
Bug Fixes
Tests