feat(prediction): add predictive cooldown with historical usage patterns#19
feat(prediction): add predictive cooldown with historical usage patterns#19owaindjones wants to merge 52 commits into
Conversation
…atterns Learns from daily system metric snapshots to dynamically extend the idle cooldown duration before releasing sleep inhibition. Uses a time-aware statistical model that scores CPU and network activity by hour-of-day, with a configurable max extension capped at 60 seconds. Key changes: - New prediction:: module with binary history log (bincode v2) using date-partitioned files under XDG_DATA_HOME or /var/lib/rouser - PredictionModel scores historical patterns and predicts additional cooldown seconds when metrics drop below threshold - Service.rs wires recording into tick() loop and applies predictions during cooldown transitions with info-level logging - Config adds [prediction] section with max_extension_secs (default 60s) - All clippy warnings resolved, tests pass (74+74 across lib/bin)
…ign all prediction fields - Replace u64 max_extension_secs with humantime_serde-parsed Duration (default 1h) across config, model, and service layers - Rename CooldownPrediction.additional_seconds to additional_time as std::time::Duration for consistency with other timing fields - Update DataManager.predicted_extension_secs → predicted_additional_time - Add pruning debug logging in HistoryLog when files are removed - Add record flush logging in PredictionModel on each data point write - Wire prune() call into service.rs tick loop (every ~12h via counter) - Add .sisyphus/ to .gitignore
…to tick loop Debug logging (Task 1): - Add per-tick debug log in PredictionModel::record() showing data point number with CPU max, network throughput, disk I/O, and UTC hour bucket - Add debug log when prune() is called on each service tick - Wire model.prune(history_length) into service.rs tick loop (safe due to daily deduplication in HistoryLog::prune()) Documentation (Task 2): - README.md: add 'Predictive cooldown' bullet to Key Features list - docs/configuration.md: add [prediction] section with full config table, update example TOML block and See Also links - docs/prediction-model.md: new comprehensive guide covering data collection, hour-of-day histogram building, scoring algorithm, confidence scaling, pruning mechanics, configuration tuning, and debug log reference - mkdocs.yml + docs/index.md: add navigation links to prediction model doc Manual QA verified with RUST_LOG=debug dry-run showing all three log types:
…nterval, fix flaky date test - Fix stale inline comment in docs/configuration.md example TOML (update_interval description now matches actual behavior) - Auto-enforce prediction.update_interval >= root update_interval via std::cmp::max; emit warn! when correction is applied so operators notice misconfiguration - Rename debug log field 'samples=N' to 'accumulated_ticks=N' for clarity in model.rs flush logging - Add two multi-tick averaging tests: arithmetic mean verification across flush boundaries and GPU per-slot averaging with varying GPU counts, both with descriptive comments explaining expected values and flush timing - Fix flaky test_history_entry_date_extraction to use Utc::now() instead of Local::now(), matching entry_date()'s UTC implementation - Update AGENTS.md comment policy under Core Principles
…nt-config and inhibition fallback - Replace hardcoded CPU/network/disk thresholds in prediction model with single 'inhibited' boolean from service.rs threshold logic. This removes three unnecessary config fields (cpu_high_threshold, network_high_threshold, disk_high_threshold) and uses the actual inhibition state computed per-tick. - Fix --print-config: was ignoring -c flag and always using merged defaults. Now respects single config file path when provided. - Fix inhibition fallback: rewrite InhibitionState::acquire() to use a clean retry pattern via SleepInhibitor::acquire_with_fallback(). Removes buggy code that made two redundant D-Bus calls on auth error (creating duplicate inhibitors). - Upgrade TimeKey from single hour-of-day dimension to three dimensions: year, week_of_year, seconds_into_week for seasonal/monthly/weekday patterns. - Fix clippy warnings: redundant closure, unnecessary cast, clone-on-copy, manual RangeInclusive::contains (4 errors total). - Update prediction-model.md documentation to reflect TimeKey representation and simplified inhibition-based scoring.
… fallback to auth-only errors - Fix critical bug: score_inhibition_rate() ±3600s proximity search now constrains by year and ±1 week of year to prevent historical data from last year contaminating current predictions. - Narrow inhibition D-Bus fallback: only falls back on auth-related errors (interactive authentication, Access denied), not all failures. Non-auth errors propagate unchanged without masking real infrastructure issues. - Remove dead code: hour_component() and day_of_week() methods from TimeKey were never called anywhere in the codebase. - Add 5 new unit tests for TimeKey struct and prediction scoring path.
…auth error patterns
Add linear_day() helper for correct end-of-year boundary handling
in score_inhibition_rate(). Expand is_auth_error() to catch additional
polkit error strings ("not authorized", "not authenticated").
…ek to f64
Remove 'Running history pruning' debug line — prune() already logs
at info level when files are actually removed. Remove 'Metrics exceed
threshold, checking inhibition status' debug line — state transitions
are logged at INFO level ('Sleep inhibited:', 'Releasing sleep').
Change TimeKey.seconds_into_week from i64 to f64 for millisecond
precision (0–604799.999s). Implement Eq + Hash manually via bit-level
equality since f64 doesn't derive these traits; deterministic integer
arithmetic ensures exact equality for HashMap key compatibility.
220a4d8 to
e5be9a2
Compare
…d-only home Replace RuntimeDirectory (tmpfs, lost on reboot) with StateDirectory=rouser-data to provide a persistent writable directory at /var/lib/rouser-data. Set XDG_DATA_HOME=/var/lib/rouser-data so the history log writes there when running as systemd service with ProtectHome=read-only — /var/lib is outside /home and survives reboots.
Bug #1: 'Predictive cooldown extension' info log fired on every tick while extended cooldown was active because predicted_additional_time was already set from a previous tick. Added check for predicted_additional_time.is_zero() so the message only logs once per transition into below-threshold state, matching how 'Sleep inhibited' logs only fire on state transitions. Bug #2: Predictive cooldown extension had no effect — inhibition was released after base cooldown_duration (10s) instead of respecting the predicted +1028s extension. The release logic checked plain cooldown_duration first and released before reaching the predictive branch. Replaced two-branch logic with single path using std::cmp::max(cooldown_duration, predicted_additional_time) so the prediction always extends (not replaces) the base cooldown period.
…erpolation Add backward-compatible rate-of-change (delta) fields to HistoryEntry: - elapsed_since_last_ns, cpu_delta_per_sec, network/disk/gpu deltas per sec - compute_deltas() method for computing consecutive entry differences - XDG_STATE_HOME migration with /tmp fallback using PID-based unique path and 0700 permissions to minimize TOCTOU risk on shared systems Add gap detection (fill_gaps) that inserts synthetic zero-value entries when computer is shut down or sleeping, preventing prediction model overfitting on active-period data only. Uses GAP_THRESHOLD_NS=5min / FILL_INTERVAL_NS=30s. Ensure sorted file reading by date ascending with monotonic timestamp ordering via BTreeMap iteration + sort_by_key after loading all files. Improve service.rs cooldown_extension_applied flag to prevent redundant prediction queries and add base+extension breakdown in release logging. Update documentation for XDG_STATE_HOME, prediction model, systemd service. Add AGENTS.md note about state directory migration breaking change.
…nd signals Delta fields were previously dead code — computed struct fields existed but the prediction model never consumed them. This fix: 1. Tracks last flushed entry metrics to enable actual delta computation 2. Calls HistoryEntry::compute_deltas() when flushing snapshots (not just tests) 3. Adds TrendSignal scoring that normalizes CPU/network rate-of-change into a 0.5-1.4x multiplier on the base inhibition score for trend-aware predictions 4. Updates prediction-model.md with documentation for delta features, gap handling, and trend-aware scoring sections 5. Fixes test to use is_root=false for portable XDG_STATE_HOME writes in tests Regression tests verify deltas are computed in production flush path.
…lenames Previously files without valid YYYYMMDD dates were silently skipped with a warning. Now they are read and grouped by their filesystem modification timestamp as sort key, ensuring no history data is lost from old-format or corrupted backup files in the history directory.
…allback On Linux, std::fs provides no safe way to access file birth/creation times without unsafe syscalls. Since AGENTS.md prohibits introducing unsafe code without explicit instruction, modification time is used as the best available proxy — historical log files are typically not modified after initial writes.
…tic records Real history entries pushed after fill_gaps() retained stale delta values referencing their original predecessor. Now compute_deltas() is called against the actual predecessor in the filled sequence.
…ecomputation test TrendSignal::compute() now divides network delta sum by only the count of entries with valid network deltas (net_samples), matching how CPU averages are computed. Previously divided by total entry count n, which diluted the average when some entries had None network deltas. Add integration test verifying that real entries after gap-filled synthetic records have their deltas correctly recomputed against zero-value predecessors.
Replace last_elapsed >= 1_000_000_000 && last_elapsed <= FILL_INTERVAL_NS with (1_000_000_000..=FILL_INTERVAL_NS).contains(&last_elapsed).
- Work in branches only (commits to main forbidden without explicit instruction) - Remove --config flag from ExecStart since ConfigLoader::load_merged() handles auto-discovery of /etc/rouser/config.toml + ~/.config/rouser/config.toml
The systemd service was updated to drop the --config flag since ConfigLoader::load_merged() handles auto-discovery. Update all four ExecStart example references in this doc to match.
…iority chain Phase 1 initializes tracing at DEBUG (or explicit RUST_LOG/CLI override) so auto-install logs during config load are captured. Phase 2 reconfigures the log level using resolve_tracing_log_level() which follows the exact priority chain: CLI -l flag > RUST_LOG env var > config.log_level > 'info'. Uses tracing_subscriber::reload::Layer for runtime filter swapping via .modify() instead of requiring a fresh subscriber install. This avoids panics when another global subscriber already exists (e.g., from PAM).
…ed trend window Remove 5 delta fields (elapsed_since_last_ns, cpu_delta_per_sec, network_delta_per_sec, disk_delta_per_sec, gpu_deltas_per_sec) from HistoryEntry serialization. Compute deltas on-the-fly at prediction time using a standalone EntryDeltas::compute() method that takes consecutive entries and calculates per-second rates. Remove hard-coded GAP_THRESHOLD_NS (5min) and FILL_INTERVAL_NS (30s) constants. Make fill_gaps() a public configurable function using the [prediction] update_interval config value for both threshold and interval. Synthetic zero-value entries are now in-memory only — added at prediction time, never flushed to disk. Replace '20 most recent entries' hard-coded count with timestamp-based window: all entries where timestamp >= current_time - max_extension_time. This ensures consistent temporal coverage regardless of tick frequency.
…ehavior Update docs/prediction-model.md: replace hard-coded '>5 minutes' and '30-second intervals' with references to [prediction].update_interval config. Remove delta fields storage table — deltas are now computed on-the-fly at prediction time, not stored in history files. Replace '20 most recent entries' description with timestamp-based window using max_extension_time. Clarify that synthetic gap-filled entries exist only in memory during prediction.
…iting period Previously predict_cooldown() ran only once per inhibited-to-below-threshold transition, then the computed extension was static for the entire remaining cooldown. Now it is re-evaluated on every tick while metrics stay below threshold, allowing the extension to increase or decrease based on current trends (minimum 0 via Duration::ZERO). Changes: - Added spike guard: skip re-evaluation when should_inhibit is true - Moved predict_cooldown() into the below-threshold waiting block for per-tick re-evaluation during active cooldown - Removed !cooldown_extension_applied guard from transition logic - Info log on first non-zero extension, debug log on subsequent changes
…d unreachable spike guard
Oracle review identified:
- cooldown_extension_applied was written 3 times but never read — dead code
from the old per-transition guard that was replaced by tick-based re-evaluation
- Spike guard (if should_inhibit { return }) inside the below-threshold block
could never trigger since metrics_below_threshold_since implies not inhibiting
Removes: struct field, constructor init, spike guard, all assignments. No
behavioral change — purely dead code cleanup.
Current issues
|
Feature request: Calculate overall system average and max GPU usageThe CPU usage is calculated per-core (and frequency-weighted per-core), but the final metrics used in the inhibition decision are: aggregate average CPU usage (total usage divided by number of cores), and maximum individual CPU core usage. Rouser should apply the same to the GPU usage:
The config file should be refactored to look like this: [metrics.gpu]
per_gpu_threshold = 33.3 # GPU usage threshold (percentage)
total_threshold = 50.0
ema_alpha = 0.7 # EMA smoothing factor^ In that scenario: Any individual GPU can trigger inhibition if they report usage over 33.3%, and inhibition can be triggered if the overall GPU usage on the system is above 50%. For a system with two GPUs, this means one can be 100% busy and the other idle (0%), or both hovering around 50% usage. The inhibition decision code, history file format, and prediction model should be refactored to replace where they use the individual GPU usage metrics with the two new metrics: total (average) system GPU usage and maximum per-GPU usage. Benefit: Should GPUs be added or removed from the system, the history file structure is preserved. Currently, adding or removing a GPU would change the size of the entries and mean that previous entries could no longer be used to train the prediction model. It is still helpful to enumerate individual GPUs and report their usage in the debug output as is happening now, so don't remove that. It serves as a helpful diagnostic to show how they each contribute to the final metrics. |
Keep inhibited_timekeys in sync when records are flushed so predictions reflect current data instead of stale startup snapshot. Add an in-memory rolling window (recent_entries) for trend analysis during cooldown periods, eliminating costly disk reads on every predict_cooldown() call. Fix double-prediction bug where the transition block overwrote the fresh prediction computed inside the cooldown block with a potentially zero value from stale historical data.
…esholds
Replace single gpu.threshold config with dual-threshold system:
- per_gpu_threshold (default 15%): triggers inhibition if any single GPU exceeds it
- total_threshold (default 15%): triggers inhibition if system-wide GPU average exceeds it
- Both use OR logic — either threshold being exceeded inhibits sleep
Key changes:
- New GpuAggregate struct in metrics/gpu.rs with from_gpus/from_values constructors
- Replace HistoryEntry.gpu_usages Vec<f64> with GpuSnapshot { per_gpu_max, total_average }
for consistent history format regardless of GPU count
- ThresholdManager::should_inhibit() takes &GpuAggregate instead of &[f64]
- Updated config/rouser.toml: [metrics.gpu].threshold → per_gpu_threshold + total_threshold
- Simplified EntryDeltas: removed gpu_deltas_per_sec vector field (aggregates suffice)
- Added #[allow(clippy::too_many_arguments)] to HistoryEntry::new() (8 params, consistent pattern)
92 tests pass. 0 failed.
…l average Update [metrics.gpu] section to reflect new configuration structure: - Replace single threshold with per_gpu_threshold and total_threshold keys - Document OR logic for both thresholds - Update example config and best practices section
- gpu-usage-measurement.md: replace single threshold example with per_gpu_threshold + total_threshold config, document OR logic for sleep inhibition decisions - metrics-overview.md: expand Aggregation Strategy section to cover both per-device and system-wide average thresholds, explain GpuSnapshot history format independence from GPU count - scratch/007-fixes-and-aggregate-gpu-metrics.md: update outdated 'What's NOT Done' entry (docs/configuration.md already committed in 887f39f)
… fix all stale doc references Change defaults from 15/15 to more conservative values that reduce false-positive sleep inhibition during moderate multi-GPU workloads. Source-of-truth updates (AGENTS.md rule: always update config.toml first): - src/config.rs: default_gpu_threshold() → 25.0, default_gpu_total_threshold() → 40.0 - config/rouser.toml: per_gpu_threshold = 25.0, total_threshold = 40.0 Documentation fixes — replaced all stale single-threshold format with dual: - configuration.md: example + table defaults (15→25, 15→40) - gpu-usage-measurement.md: config example values - metrics-overview.md: Aggregation Strategy section expanded for dual thresholds - averaging.md: 6 GPU threshold examples across all configs + Per-GPU EMA text - developer-guide.md: code example uses GpuAggregate with both thresholds - installation.md: 3 GPU config blocks updated (default, workstation, gaming) - systemd-user-service.md: default service config GPU section Test assertion in src/config.rs test_defaults() also updated.
… in Default impls Remove all fn default_*() helper functions from config.rs since config/rouser.toml is the source of truth. Replace serde defaults with bare #[serde(default)] and hardcode values in explicit Default trait impls. Metrics struct now uses #[derive(Default)]. Update AGENTS.md Configuration Conventions to document this pattern.
Replace #[serde(default = "default_what")] with bare #[serde(default)] on InhibitionConfig.what field. The Default impl already provides the same value, making default_what() dead code. Also fix CONTRIBUTING.md and docs/developer-guide.md to document the new convention.
Fix three hardcoded Default trait impl values that didn't match config/rouser.toml: duration_threshold 30→5s, cooldown_duration 60→10s, exclude_device_prefixes empty→full list. Also update test_timing_defaults to assert correct TOML-matching values.
Add per_gpu_max and total_average to the main Metrics debug log line so operators can see the exact values used for inhibition decisions. Also adds 4 integration tests validating has_gpus() consistency with enumerate_gpus(), driver type recognition, and empty/valid card detection.
…PU edge cases Add 8 unit tests covering GpuAggregate::from_values() and from_gpus(): empty input returns defaults (0.0), single GPU yields identical max/average values, two+ GPUs compute correct max and mean, and from_gpus results match from_values for identical data.
Add per-GPU max and total average GPU usage to the 'Flushed averaged snapshot' debug message so operators can see whether GPUs contributed to a flush event without needing to parse per-device logs.
Include gpu_delta_per_gpu_max and gpu_delta_total_average in rate-of-change calculations. Update TrendSignal to average GPU trends alongside CPU, network, and disk for more complete trend-aware cooldown prediction.
Address all user corrections: GPU aggregate metrics in snapshots, gap-filled entries as valid idle states (not filtered), disk and GPU deltas included in trend calculations. Document new unsupervised NG-RC reservoir computing approach replacing histogram-based TimeKey matching.
…md updates - docs/prediction-todo.md: 19 task tracker with architecture decision record for NG-RC reservoir computing (irithyll crate), dependency analysis, effort estimates, and implementation notes per AGENTS.md constraints. - AGENTS.md: add Prediction Model Refactoring section referencing the TODO file, documenting TimeKey deprecation rationale, feature vectors, unsupervised learning approach, gap-filled entry handling, GPU deltas, and planned config fields.
Add ml_hidden_dim (default 16) and ml_delay_buffer_size (default 8) config options for the NG-RC reservoir computing model. Update Cargo.toml with irithyll v9.9 dependency using serde-bincode feature flag. Sync defaults across Cargo.toml, config/rouser.toml, src/config.rs, docs/configuration.md, and tests.
…pipeline Introduce src/prediction/ml_model.rs containing FeatureVector, NormalizationStats, MlPredictor structs. Implements unsupervised streaming learning via irithyll's NG-RC reservoir computing architecture. Includes Welford's online algorithm for running statistics, checkpoint persistence, and comprehensive test coverage.

Summary
Implements #18 — Predictive cooldown based on system usage patterns. rouser now learns from historical CPU/network activity across days/weeks and dynamically extends the post-idle cooldown duration when patterns indicate likely continued active use, capped at a configurable maximum extension time.
Changes
Feature implementation (
feat(prediction): add predictive cooldown based on historical usage patterns)predictionmodule withhistory(binary log) andmodel(statistical predictor) submoduleshistory.log.YYYYMMDD) under XDG data dir or/var/lib/rouser/for rootDebug logging (
feat(prediction): add debug logging and prune tracking to prediction model)info!log on startup showing loaded historical data pointsdebug!log per record() call with metric values and hour-of-day bucketdebug!/info!logs in HistoryLog::prune() for each file removed and summary of pruned countDocumentation (
docs: add prediction feature docs, update config reference)docs/configuration.md: added[prediction]section table documenting all 3 config keys (update_interval, history_length, max_extension_secs)docs/prediction-model.md: comprehensive guide explaining how the prediction model works — data collection, hour-of-day analysis, scoring algorithm, confidence calculation, and configuration tuningSecurity audit (
chore: add security review for prediction module)history.log.*pattern filesConfig alignment (
refactor(prediction): use Duration type for max_extension_time)std::time::Durationwith humantime_serde parsingmax_extension_secs: u64→max_extension_time: Duration(default 1h)Testing
All checks pass:
cargo fmt --check— cleancargo clippy --all-targets -- -D warnings— zero warnings/errorscargo test --all-targets— 148 tests pass (74 lib + 72 binary, 2 ignored hardware-specific)Config Example
Manual QA Notes
RUST_LOG=debug rouser --dry-runto see prediction model initialization and per-tick logging$XDG_DATA_HOME/rouser/history.log.*.logfiles exist with date-partitioned datahistory_length = "1d"in config and observing debug logs on subsequent runs