Skip to content

feat(prediction): add predictive cooldown with historical usage patterns#19

Open
owaindjones wants to merge 52 commits into
mainfrom
feat/predictive-cooldown
Open

feat(prediction): add predictive cooldown with historical usage patterns#19
owaindjones wants to merge 52 commits into
mainfrom
feat/predictive-cooldown

Conversation

@owaindjones
Copy link
Copy Markdown
Owner

Summary

Implements #18 — Predictive cooldown based on system usage patterns. rouser now learns from historical CPU/network activity across days/weeks and dynamically extends the post-idle cooldown duration when patterns indicate likely continued active use, capped at a configurable maximum extension time.

Changes

Feature implementation (feat(prediction): add predictive cooldown based on historical usage patterns)

  • New prediction module with history (binary log) and model (statistical predictor) submodules
  • Time-aware hour-of-day analysis: tracks per-hour high-activity counts for CPU (>50%) and network/disk (>10Mbps/>5MB/s)
  • Linear score interpolation to configurable extension range when transitioning from inhibited → below-threshold state
  • Binary serialization via bincode v2 with date-partitioned files (history.log.YYYYMMDD) under XDG data dir or /var/lib/rouser/ for root
  • Automatic pruning of old history files (every ~12h, configurable retention)

Debug logging (feat(prediction): add debug logging and prune tracking to prediction model)

  • info! log on startup showing loaded historical data points
  • debug! log per record() call with metric values and hour-of-day bucket
  • debug!/info! logs in HistoryLog::prune() for each file removed and summary of pruned count
  • Flush logging when model records a new snapshot to the history buffer

Documentation (docs: add prediction feature docs, update config reference)

  • README.md: added "Predictive cooldown" to Key Features list with brief description
  • docs/configuration.md: added [prediction] section table documenting all 3 config keys (update_interval, history_length, max_extension_secs)
  • New docs/prediction-model.md: comprehensive guide explaining how the prediction model works — data collection, hour-of-day analysis, scoring algorithm, confidence calculation, and configuration tuning

Security audit (chore: add security review for prediction module)

  • Path validation: all file paths derived from XDG spec or config constants (no user input)
  • Bincode deserialization safe: length-prefixed format prevents buffer overread; truncated entries logged as warnings and skipped
  • Prune function validates YYYYMMDD filename format before processing; only matches history.log.* pattern files
  • No shell execution, symlink following, or world-writable permissions in history directory creation (0755)

Config alignment (refactor(prediction): use Duration type for max_extension_time)

  • Standardized all prediction timing fields to std::time::Duration with humantime_serde parsing
  • Renamed max_extension_secs: u64max_extension_time: Duration (default 1h)

Testing

All checks pass:

  • cargo fmt --check — clean
  • cargo clippy --all-targets -- -D warnings — zero warnings/errors
  • cargo test --all-targets — 148 tests pass (74 lib + 72 binary, 2 ignored hardware-specific)

Config Example

[prediction]
update_interval = "30s"          # How often to record a data point
history_length = "30d"           # Keep this much historical data
max_extension_time = "1h"        # Maximum additional cooldown extension

Manual QA Notes

  • Run RUST_LOG=debug rouser --dry-run to see prediction model initialization and per-tick logging
  • After running for several days, check $XDG_DATA_HOME/rouser/history.log.*.log files exist with date-partitioned data
  • Verify pruning by setting history_length = "1d" in config and observing debug logs on subsequent runs

owaindjones added 13 commits May 1, 2026 20:13
…atterns

Learns from daily system metric snapshots to dynamically extend the
idle cooldown duration before releasing sleep inhibition. Uses a
time-aware statistical model that scores CPU and network activity by
hour-of-day, with a configurable max extension capped at 60 seconds.

Key changes:
- New prediction:: module with binary history log (bincode v2) using
  date-partitioned files under XDG_DATA_HOME or /var/lib/rouser
- PredictionModel scores historical patterns and predicts additional
  cooldown seconds when metrics drop below threshold
- Service.rs wires recording into tick() loop and applies predictions
  during cooldown transitions with info-level logging
- Config adds [prediction] section with max_extension_secs (default 60s)
- All clippy warnings resolved, tests pass (74+74 across lib/bin)
…ign all prediction fields

- Replace u64 max_extension_secs with humantime_serde-parsed
  Duration (default 1h) across config, model, and service layers
- Rename CooldownPrediction.additional_seconds to additional_time
  as std::time::Duration for consistency with other timing fields
- Update DataManager.predicted_extension_secs → predicted_additional_time
- Add pruning debug logging in HistoryLog when files are removed
- Add record flush logging in PredictionModel on each data point write
- Wire prune() call into service.rs tick loop (every ~12h via counter)
- Add .sisyphus/ to .gitignore
…to tick loop

Debug logging (Task 1):
- Add per-tick debug log in PredictionModel::record() showing data point
  number with CPU max, network throughput, disk I/O, and UTC hour bucket
- Add debug log when prune() is called on each service tick
- Wire model.prune(history_length) into service.rs tick loop (safe due to
  daily deduplication in HistoryLog::prune())

Documentation (Task 2):
- README.md: add 'Predictive cooldown' bullet to Key Features list
- docs/configuration.md: add [prediction] section with full config table,
  update example TOML block and See Also links
- docs/prediction-model.md: new comprehensive guide covering data collection,
  hour-of-day histogram building, scoring algorithm, confidence scaling,
  pruning mechanics, configuration tuning, and debug log reference
- mkdocs.yml + docs/index.md: add navigation links to prediction model doc

Manual QA verified with RUST_LOG=debug dry-run showing all three log types:
…nterval, fix flaky date test

- Fix stale inline comment in docs/configuration.md example TOML
  (update_interval description now matches actual behavior)
- Auto-enforce prediction.update_interval >= root update_interval via
  std::cmp::max; emit warn! when correction is applied so operators
  notice misconfiguration
- Rename debug log field 'samples=N' to 'accumulated_ticks=N' for
  clarity in model.rs flush logging
- Add two multi-tick averaging tests: arithmetic mean verification
  across flush boundaries and GPU per-slot averaging with varying
  GPU counts, both with descriptive comments explaining expected
  values and flush timing
- Fix flaky test_history_entry_date_extraction to use Utc::now()
  instead of Local::now(), matching entry_date()'s UTC implementation
- Update AGENTS.md comment policy under Core Principles
…nt-config and inhibition fallback

- Replace hardcoded CPU/network/disk thresholds in prediction model with
  single 'inhibited' boolean from service.rs threshold logic. This removes
  three unnecessary config fields (cpu_high_threshold, network_high_threshold,
  disk_high_threshold) and uses the actual inhibition state computed per-tick.
- Fix --print-config: was ignoring -c flag and always using merged defaults.
  Now respects single config file path when provided.
- Fix inhibition fallback: rewrite InhibitionState::acquire() to use a clean
  retry pattern via SleepInhibitor::acquire_with_fallback(). Removes buggy
  code that made two redundant D-Bus calls on auth error (creating duplicate
  inhibitors).
- Upgrade TimeKey from single hour-of-day dimension to three dimensions:
  year, week_of_year, seconds_into_week for seasonal/monthly/weekday patterns.
- Fix clippy warnings: redundant closure, unnecessary cast, clone-on-copy,
  manual RangeInclusive::contains (4 errors total).
- Update prediction-model.md documentation to reflect TimeKey representation
  and simplified inhibition-based scoring.
… fallback to auth-only errors

- Fix critical bug: score_inhibition_rate() ±3600s proximity search now
  constrains by year and ±1 week of year to prevent historical data from
  last year contaminating current predictions.

- Narrow inhibition D-Bus fallback: only falls back on auth-related errors
  (interactive authentication, Access denied), not all failures. Non-auth
  errors propagate unchanged without masking real infrastructure issues.

- Remove dead code: hour_component() and day_of_week() methods from TimeKey
  were never called anywhere in the codebase.

- Add 5 new unit tests for TimeKey struct and prediction scoring path.
…auth error patterns

Add linear_day() helper for correct end-of-year boundary handling
in score_inhibition_rate(). Expand is_auth_error() to catch additional
polkit error strings ("not authorized", "not authenticated").
…ek to f64

Remove 'Running history pruning' debug line — prune() already logs
at info level when files are actually removed. Remove 'Metrics exceed
threshold, checking inhibition status' debug line — state transitions
are logged at INFO level ('Sleep inhibited:', 'Releasing sleep').

Change TimeKey.seconds_into_week from i64 to f64 for millisecond
precision (0–604799.999s). Implement Eq + Hash manually via bit-level
equality since f64 doesn't derive these traits; deterministic integer
arithmetic ensures exact equality for HashMap key compatibility.
@owaindjones owaindjones force-pushed the feat/predictive-cooldown branch from 220a4d8 to e5be9a2 Compare May 2, 2026 08:32
owaindjones added 16 commits May 2, 2026 10:09
…d-only home

Replace RuntimeDirectory (tmpfs, lost on reboot) with
StateDirectory=rouser-data to provide a persistent writable directory at
/var/lib/rouser-data. Set XDG_DATA_HOME=/var/lib/rouser-data so the
history log writes there when running as systemd service with
ProtectHome=read-only — /var/lib is outside /home and survives reboots.
Bug #1: 'Predictive cooldown extension' info log fired on every tick
while extended cooldown was active because predicted_additional_time
was already set from a previous tick. Added check for
predicted_additional_time.is_zero() so the message only logs once per
transition into below-threshold state, matching how 'Sleep inhibited'
logs only fire on state transitions.

Bug #2: Predictive cooldown extension had no effect — inhibition was
released after base cooldown_duration (10s) instead of respecting the
predicted +1028s extension. The release logic checked plain
cooldown_duration first and released before reaching the predictive
branch. Replaced two-branch logic with single path using
std::cmp::max(cooldown_duration, predicted_additional_time) so the
prediction always extends (not replaces) the base cooldown period.
…erpolation

Add backward-compatible rate-of-change (delta) fields to HistoryEntry:
- elapsed_since_last_ns, cpu_delta_per_sec, network/disk/gpu deltas per sec
- compute_deltas() method for computing consecutive entry differences
- XDG_STATE_HOME migration with /tmp fallback using PID-based unique path
  and 0700 permissions to minimize TOCTOU risk on shared systems

Add gap detection (fill_gaps) that inserts synthetic zero-value entries when
computer is shut down or sleeping, preventing prediction model overfitting
on active-period data only. Uses GAP_THRESHOLD_NS=5min / FILL_INTERVAL_NS=30s.

Ensure sorted file reading by date ascending with monotonic timestamp ordering
via BTreeMap iteration + sort_by_key after loading all files.

Improve service.rs cooldown_extension_applied flag to prevent redundant
prediction queries and add base+extension breakdown in release logging.

Update documentation for XDG_STATE_HOME, prediction model, systemd service.
Add AGENTS.md note about state directory migration breaking change.
…nd signals

Delta fields were previously dead code — computed struct fields existed but
the prediction model never consumed them. This fix:

1. Tracks last flushed entry metrics to enable actual delta computation
2. Calls HistoryEntry::compute_deltas() when flushing snapshots (not just tests)
3. Adds TrendSignal scoring that normalizes CPU/network rate-of-change into a
   0.5-1.4x multiplier on the base inhibition score for trend-aware predictions
4. Updates prediction-model.md with documentation for delta features, gap handling,
   and trend-aware scoring sections
5. Fixes test to use is_root=false for portable XDG_STATE_HOME writes in tests

Regression tests verify deltas are computed in production flush path.
…lenames

Previously files without valid YYYYMMDD dates were silently skipped with a
warning. Now they are read and grouped by their filesystem modification
timestamp as sort key, ensuring no history data is lost from old-format or
corrupted backup files in the history directory.
…allback

On Linux, std::fs provides no safe way to access file birth/creation times
without unsafe syscalls. Since AGENTS.md prohibits introducing unsafe code
without explicit instruction, modification time is used as the best available
proxy — historical log files are typically not modified after initial writes.
…tic records

Real history entries pushed after fill_gaps() retained stale delta
values referencing their original predecessor. Now compute_deltas() is
called against the actual predecessor in the filled sequence.
…ecomputation test

TrendSignal::compute() now divides network delta sum by only the count
of entries with valid network deltas (net_samples), matching how CPU
averages are computed. Previously divided by total entry count n, which
diluted the average when some entries had None network deltas.

Add integration test verifying that real entries after gap-filled synthetic
records have their deltas correctly recomputed against zero-value predecessors.
Replace last_elapsed >= 1_000_000_000 && last_elapsed <= FILL_INTERVAL_NS
with (1_000_000_000..=FILL_INTERVAL_NS).contains(&last_elapsed).
- Work in branches only (commits to main forbidden without explicit instruction)
- Remove --config flag from ExecStart since ConfigLoader::load_merged() handles
  auto-discovery of /etc/rouser/config.toml + ~/.config/rouser/config.toml
The systemd service was updated to drop the --config flag since
ConfigLoader::load_merged() handles auto-discovery. Update all four
ExecStart example references in this doc to match.
…iority chain

Phase 1 initializes tracing at DEBUG (or explicit RUST_LOG/CLI override)
so auto-install logs during config load are captured. Phase 2 reconfigures
the log level using resolve_tracing_log_level() which follows the exact
priority chain: CLI -l flag > RUST_LOG env var > config.log_level > 'info'.

Uses tracing_subscriber::reload::Layer for runtime filter swapping via
.modify() instead of requiring a fresh subscriber install. This avoids
panics when another global subscriber already exists (e.g., from PAM).
…ed trend window

Remove 5 delta fields (elapsed_since_last_ns, cpu_delta_per_sec,
network_delta_per_sec, disk_delta_per_sec, gpu_deltas_per_sec) from
HistoryEntry serialization. Compute deltas on-the-fly at prediction time
using a standalone EntryDeltas::compute() method that takes consecutive
entries and calculates per-second rates.

Remove hard-coded GAP_THRESHOLD_NS (5min) and FILL_INTERVAL_NS (30s)
constants. Make fill_gaps() a public configurable function using the
[prediction] update_interval config value for both threshold and interval.
Synthetic zero-value entries are now in-memory only — added at prediction
time, never flushed to disk.

Replace '20 most recent entries' hard-coded count with timestamp-based
window: all entries where timestamp >= current_time - max_extension_time.
This ensures consistent temporal coverage regardless of tick frequency.
…ehavior

Update docs/prediction-model.md: replace hard-coded '>5 minutes' and '30-second
intervals' with references to [prediction].update_interval config. Remove delta
fields storage table — deltas are now computed on-the-fly at prediction time, not
stored in history files. Replace '20 most recent entries' description with
timestamp-based window using max_extension_time. Clarify that synthetic gap-filled
entries exist only in memory during prediction.
…iting period

Previously predict_cooldown() ran only once per inhibited-to-below-threshold
transition, then the computed extension was static for the entire remaining
cooldown. Now it is re-evaluated on every tick while metrics stay below
threshold, allowing the extension to increase or decrease based on current
trends (minimum 0 via Duration::ZERO).

Changes:
- Added spike guard: skip re-evaluation when should_inhibit is true
- Moved predict_cooldown() into the below-threshold waiting block for
  per-tick re-evaluation during active cooldown
- Removed !cooldown_extension_applied guard from transition logic
- Info log on first non-zero extension, debug log on subsequent changes
…d unreachable spike guard

Oracle review identified:
- cooldown_extension_applied was written 3 times but never read — dead code
  from the old per-transition guard that was replaced by tick-based re-evaluation
- Spike guard (if should_inhibit { return }) inside the below-threshold block
  could never trigger since metrics_below_threshold_since implies not inhibiting

Removes: struct field, constructor init, spike guard, all assignments. No
behavioral change — purely dead code cleanup.
@owaindjones
Copy link
Copy Markdown
Owner Author

owaindjones commented May 4, 2026

Current issues

  • It loads the history files every time it (re)calculates the cooldown; whilst in "cooldown" state this means it's loading all history from file on every tick -- it should only load from file once at startup, in order to train the prediction model. Updating the prediction model with data during runtime should not require it to load all history from scratch every time - some form of online training should be used to update the model iteratively in-memory when each snapshot is logged, so that it only needs to read from history files at startup.

    • Important note: Gaps in data need to be filled on the fly when updating the prediction model; gaps should be detected at runtime - when snapshots are logged / model is updated, the gaps in the input data should be filled with synthetic data at that point (and remembering to not write the synthetic data to disk).
  • Predicted cooldown is always given as the very specific value +1028.571428571s and I have not seen this value change at all in the journalctl logs for the latest commit, which makes me suspect something is fixed in the calculation; it may be taking the [prediction] max_extension_time and not doing anything with the actual prediction?

  • It does not appear as though the predicted cooldown actually affects when inhibition is released (as in, the extended cooldown value is not applied) as we still see this in logs: Releasing sleep inhibition: all metrics below threshold for 10.04917252s and systemd-inhibit confirms inhibition is dropped much sooner than 1028 seconds/30 minutes.

@owaindjones
Copy link
Copy Markdown
Owner Author

Feature request: Calculate overall system average and max GPU usage

The CPU usage is calculated per-core (and frequency-weighted per-core), but the final metrics used in the inhibition decision are: aggregate average CPU usage (total usage divided by number of cores), and maximum individual CPU core usage.

Rouser should apply the same to the GPU usage:

  • Calculate usage independently for each GPU, including vendor-specific frequency-weighting and usage calculations, as is happening now

  • But for the final metrics, instead of using the individual GPU usages, calculate these two metrics:

    • Aggregate average GPU usage (sum of GPU usage of all GPUs, divided by number of GPUs)
    • Maximum individual GPU usage

The config file should be refactored to look like this:

[metrics.gpu]
per_gpu_threshold = 33.3      # GPU usage threshold (percentage)
total_threshold = 50.0
ema_alpha = 0.7       # EMA smoothing factor

^ In that scenario: Any individual GPU can trigger inhibition if they report usage over 33.3%, and inhibition can be triggered if the overall GPU usage on the system is above 50%. For a system with two GPUs, this means one can be 100% busy and the other idle (0%), or both hovering around 50% usage.

The inhibition decision code, history file format, and prediction model should be refactored to replace where they use the individual GPU usage metrics with the two new metrics: total (average) system GPU usage and maximum per-GPU usage.

Benefit: Should GPUs be added or removed from the system, the history file structure is preserved. Currently, adding or removing a GPU would change the size of the entries and mean that previous entries could no longer be used to train the prediction model.

It is still helpful to enumerate individual GPUs and report their usage in the debug output as is happening now, so don't remove that. It serves as a helpful diagnostic to show how they each contribute to the final metrics.

owaindjones added 23 commits May 7, 2026 09:12
Keep inhibited_timekeys in sync when records are flushed so predictions
reflect current data instead of stale startup snapshot. Add an in-memory
rolling window (recent_entries) for trend analysis during cooldown periods,
eliminating costly disk reads on every predict_cooldown() call.

Fix double-prediction bug where the transition block overwrote the fresh
prediction computed inside the cooldown block with a potentially zero value
from stale historical data.
…esholds

Replace single gpu.threshold config with dual-threshold system:
- per_gpu_threshold (default 15%): triggers inhibition if any single GPU exceeds it
- total_threshold (default 15%): triggers inhibition if system-wide GPU average exceeds it
- Both use OR logic — either threshold being exceeded inhibits sleep

Key changes:
- New GpuAggregate struct in metrics/gpu.rs with from_gpus/from_values constructors
- Replace HistoryEntry.gpu_usages Vec<f64> with GpuSnapshot { per_gpu_max, total_average }
  for consistent history format regardless of GPU count
- ThresholdManager::should_inhibit() takes &GpuAggregate instead of &[f64]
- Updated config/rouser.toml: [metrics.gpu].threshold → per_gpu_threshold + total_threshold
- Simplified EntryDeltas: removed gpu_deltas_per_sec vector field (aggregates suffice)
- Added #[allow(clippy::too_many_arguments)] to HistoryEntry::new() (8 params, consistent pattern)

92 tests pass. 0 failed.
…l average

Update [metrics.gpu] section to reflect new configuration structure:
- Replace single threshold with per_gpu_threshold and total_threshold keys
- Document OR logic for both thresholds
- Update example config and best practices section
- gpu-usage-measurement.md: replace single threshold example with per_gpu_threshold
  + total_threshold config, document OR logic for sleep inhibition decisions
- metrics-overview.md: expand Aggregation Strategy section to cover both
  per-device and system-wide average thresholds, explain GpuSnapshot history
  format independence from GPU count
- scratch/007-fixes-and-aggregate-gpu-metrics.md: update outdated 'What's NOT
  Done' entry (docs/configuration.md already committed in 887f39f)
… fix all stale doc references

Change defaults from 15/15 to more conservative values that reduce
false-positive sleep inhibition during moderate multi-GPU workloads.

Source-of-truth updates (AGENTS.md rule: always update config.toml first):
- src/config.rs: default_gpu_threshold() → 25.0, default_gpu_total_threshold() → 40.0
- config/rouser.toml: per_gpu_threshold = 25.0, total_threshold = 40.0

Documentation fixes — replaced all stale single-threshold format with dual:
- configuration.md: example + table defaults (15→25, 15→40)
- gpu-usage-measurement.md: config example values
- metrics-overview.md: Aggregation Strategy section expanded for dual thresholds
- averaging.md: 6 GPU threshold examples across all configs + Per-GPU EMA text
- developer-guide.md: code example uses GpuAggregate with both thresholds
- installation.md: 3 GPU config blocks updated (default, workstation, gaming)
- systemd-user-service.md: default service config GPU section

Test assertion in src/config.rs test_defaults() also updated.
… in Default impls

Remove all fn default_*() helper functions from config.rs since
config/rouser.toml is the source of truth. Replace serde defaults with
bare #[serde(default)] and hardcode values in explicit Default trait
impls. Metrics struct now uses #[derive(Default)].

Update AGENTS.md Configuration Conventions to document this pattern.
Replace #[serde(default = "default_what")] with bare #[serde(default)]
on InhibitionConfig.what field. The Default impl already provides the
same value, making default_what() dead code. Also fix CONTRIBUTING.md
and docs/developer-guide.md to document the new convention.
Fix three hardcoded Default trait impl values that didn't match
config/rouser.toml: duration_threshold 30→5s, cooldown_duration
60→10s, exclude_device_prefixes empty→full list. Also update
test_timing_defaults to assert correct TOML-matching values.
Add per_gpu_max and total_average to the main Metrics debug log line
so operators can see the exact values used for inhibition decisions.

Also adds 4 integration tests validating has_gpus() consistency with
enumerate_gpus(), driver type recognition, and empty/valid card detection.
…PU edge cases

Add 8 unit tests covering GpuAggregate::from_values() and
from_gpus(): empty input returns defaults (0.0), single GPU yields
identical max/average values, two+ GPUs compute correct max and mean,
and from_gpus results match from_values for identical data.
Add per-GPU max and total average GPU usage to the 'Flushed averaged
snapshot' debug message so operators can see whether GPUs contributed
to a flush event without needing to parse per-device logs.
Include gpu_delta_per_gpu_max and gpu_delta_total_average in rate-of-change
calculations. Update TrendSignal to average GPU trends alongside CPU, network,
and disk for more complete trend-aware cooldown prediction.
Address all user corrections: GPU aggregate metrics in snapshots, gap-filled
entries as valid idle states (not filtered), disk and GPU deltas included in
trend calculations. Document new unsupervised NG-RC reservoir computing approach
replacing histogram-based TimeKey matching.
…md updates

- docs/prediction-todo.md: 19 task tracker with architecture decision record
  for NG-RC reservoir computing (irithyll crate), dependency analysis, effort
  estimates, and implementation notes per AGENTS.md constraints.
- AGENTS.md: add Prediction Model Refactoring section referencing the TODO
  file, documenting TimeKey deprecation rationale, feature vectors, unsupervised
  learning approach, gap-filled entry handling, GPU deltas, and planned config
  fields.
Add ml_hidden_dim (default 16) and ml_delay_buffer_size (default 8) config
options for the NG-RC reservoir computing model. Update Cargo.toml with irithyll
v9.9 dependency using serde-bincode feature flag. Sync defaults across
Cargo.toml, config/rouser.toml, src/config.rs, docs/configuration.md, and tests.
…pipeline

Introduce src/prediction/ml_model.rs containing FeatureVector, NormalizationStats,
MlPredictor structs. Implements unsupervised streaming learning via irithyll's
NG-RC reservoir computing architecture. Includes Welford's online algorithm for
running statistics, checkpoint persistence, and comprehensive test coverage.
@owaindjones
Copy link
Copy Markdown
Owner Author

YOLO

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant