Handle Telegram rate limits without stuck turns and harden bot restarts after binary upgrades#9
Conversation
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
WalkthroughAdds a Telegram ChangesRetry Turn + Instance Management
sequenceDiagram
participant User as User / Telegram
participant App as App Handler
participant Store as Store (DB)
participant Queue as Turn Queue
participant Processor as Turn Processor
User->>App: /retry 42
App->>Store: retry_request_for_turn(42, session_key, user_id)
alt Not retryable
Store-->>App: None
App->>User: "Turn is not retryable"
else Retryable
Store-->>App: TurnRequest { prompt, review_mode, ... }
App->>Queue: enqueue_turn(TurnRequest)
Queue-->>Processor: process_turn()
alt Pre-Codex rate limit
Processor->>App: pre-Codex send fails w/ retry-after
App->>User: Send delayed retry notice with rate_limit_retry_keyboard
User->>App: presses Retry button (cmd:/retry <turn_id>)
else Success
Processor->>User: assistant response
end
end
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/app/turns.rs (1)
537-549:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftBackoff on a new chunk can permanently suppress the final reply.
If
send_message()for a later chunk hits 429,defer_after_retry_after()setsedit_backoff_untilandflush()returns early. The laterfinish(true)call hits the same guard and also returnsOk(()), so the turn no longer fails, but the remaining output may never be delivered unless the user manually runs/copy.Also applies to: 578-593, 669-676
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/app/turns.rs`:
- Around line 32-45: The Err branch of latest_limits_snapshot_from_shared(...)
is aborting the entire turn (clearing cancel_slot, cancel.cancel(), awaiting
chat_action_task, calling finish_pre_codex_turn_failure and finish_turn_cleanup)
which hides prompts; change this to recover gracefully by setting limits_inline
to None (or an appropriate empty banner), remove the calls that cancel and
return (do not call finish_pre_codex_turn_failure or finish_turn_cleanup), and
instead log the error (or record it) so the turn proceeds to Codex; keep
references to cancel_slot, cancel, chat_action_task only if you must clean up
non-fatal resources but do not abort the turn flow in the
latest_limits_snapshot_from_shared error path and let format_limits_inline be
optional.
In `@src/commands.rs`:
- Around line 228-230: Help text for the "/retry" command is inaccurate
(mentions only "failed" turns) while the command supports retrying both failed
and cancelled turns; update the text passed to text_help for the "/retry" match
arm and the corresponding bot-command description (the other occurrence around
the handler near line 266) to say "failed or cancelled" (e.g., "Usage: /retry
<turn_id>\n\nRetries a failed or cancelled turn without attachments.") so users
know both states are eligible.
In `@src/store.rs`:
- Around line 448-488: retry_request_for_turn is rebuilding retryable turns with
attachments: vec![] which drops any original attachments; modify the function to
query the persisted attachments for the given turn_id (e.g., select attachment
metadata/path from the attachments table where turn_id = ?) and populate
TurnRequest.attachments with that list instead of always using an empty vec, so
the returned TurnRequest preserves the original files when reconstructing a
failed/cancelled turn.
- Around line 57-74: The recovery currently clears all `running` turns and
`busy` sessions on `Store::open()` via `recover_interrupted_turns()` which can
corrupt a live instance; change the logic so recovery only touches state that is
stale (e.g., last_updated/heartbeat older than a configurable cutoff) and do not
clear anything until this process first claims ownership of the DB (implement a
DB-backed instance/lock row or an exclusive SQLite advisory lock in
`Store::open()` to prove this process is sole owner). Update
`recover_interrupted_turns()` to accept/compute a cutoff timestamp and only mark
turns/sessions older than that as recovered, and alter `Store::open()` to
acquire the lock/claim, verify no other live instance (or that other instance is
stale by the same cutoff) before calling `recover_interrupted_turns()`, leaving
`seed_admins()` and `audit()` behavior unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 82b35fa5-6216-4779-9934-a9f013550dc5
📒 Files selected for processing (8)
README.mdREADME.ru.mdsrc/app.rssrc/app/support.rssrc/app/tests.rssrc/app/turns.rssrc/commands.rssrc/store.rs
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/store.rs (2)
784-791: 💤 Low valueStale cutoff parsing is sensible; consider a max bound.
unwrap_or(120)with a> 0filter is good. Worth adding a sanity upper bound (e.g., a few hours) to prevent a typo likeTELECODEX_INSTANCE_STALE_SECONDS=8640000from effectively disabling recovery while still passing the> 0check. Optional.♻️ Optional refinement
fn stale_instance_cutoff() -> String { + const MAX_STALE_SECONDS: i64 = 24 * 60 * 60; let seconds = std::env::var("TELECODEX_INSTANCE_STALE_SECONDS") .ok() .and_then(|value| value.parse::<i64>().ok()) - .filter(|value| *value > 0) + .filter(|value| *value > 0 && *value <= MAX_STALE_SECONDS) .unwrap_or(120); (Utc::now() - Duration::seconds(seconds)).to_rfc3339() }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/store.rs` around lines 784 - 791, The stale_instance_cutoff function currently accepts any positive TELECODEX_INSTANCE_STALE_SECONDS value; add a sanity upper bound to avoid extremely large values (e.g., limit to a few hours) by clamping the parsed seconds to a max (e.g., MAX_STALE_SECONDS) before using it to compute the cutoff. Update stale_instance_cutoff to parse the env var as now, then apply a min(parsed_seconds, MAX_STALE_SECONDS) (or similar clamp) so typos like huge numbers cannot disable recovery; reference TELECODEX_INSTANCE_STALE_SECONDS and the stale_instance_cutoff function when implementing the clamp.
700-706: 💤 Low valueDrop-based release is fine but explicitly best-effort.
Dropusesconn.get_mut()which fails if the mutex is poisoned (panicked while holding the lock), andstd::process::exitskips destructors entirely. That's why theRestartBotpath insrc/app.rsexplicitly callsrelease_instance_lock()beforeexit(0)— good belt-and-braces. Consider a brief doc comment here pointing out that callers needing deterministic release must callrelease_instance_lock()explicitly.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/store.rs` around lines 700 - 706, Add a brief doc comment above the Drop impl for Store explaining that Store::drop uses conn.get_mut() and performs a best-effort release via release_instance_lock_for, which can fail if the mutex is poisoned and will be skipped by std::process::exit; callers that need deterministic release must call release_instance_lock() explicitly (see the RestartBot path in RestartBot/Restart code for an example). Ensure the comment names the methods release_instance_lock_for and release_instance_lock and mentions the conn.get_mut() failure mode so future maintainers know to call the explicit release before exiting.src/app/support.rs (2)
200-215: ⚡ Quick win
restart_executablesilently returns a non-existent path.If
current_exe()doesn’t exist and either the path has no(deleted)suffix or the stripped replacement is also missing, this falls through toOk(exe)and returns the original (non-existent) path. The downstreamCommand::new(exe).spawn().context("failed to spawn restarted bot")?will then fail with a generic spawn error, hiding the real cause (binary not found).Fail fast with a clear error so the user sees why the restart can't proceed.
♻️ Suggested fix
fn restart_executable() -> Result<PathBuf> { let exe = std::env::current_exe().context("failed to resolve current executable")?; if exe.exists() { return Ok(exe); } let exe_text = exe.as_os_str().to_string_lossy(); if let Some(stripped) = exe_text.strip_suffix(" (deleted)") { let replacement = PathBuf::from(stripped); if replacement.exists() { return Ok(replacement); } } - Ok(exe) + bail!( + "restart executable not found on disk: {} (original binary may have been removed without a replacement)", + exe.display() + ) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/app/support.rs` around lines 200 - 215, The function restart_executable currently may return a non-existent PathBuf; update restart_executable to fail fast when no existing executable is found: after obtaining exe from std::env::current_exe() (and after the existing checks for the " (deleted)" suffix and the replacement PathBuf), if neither exe nor replacement exists return an Err with a clear context message (e.g., "executable not found for restart" including the attempted path text) instead of falling through to Ok(exe); ensure the error uses the same error type/context mechanism used elsewhere (the Result/anyhow::Context pattern) so callers (e.g., Command::new(exe).spawn()) receive a descriptive failure.
196-198: 💤 Low valueMatch
SYSTEMD_EXEC_PIDagainststd::process::id()to identify directly-managed processes.
SYSTEMD_EXEC_PIDis inherited by all descendant processes from a systemd-managed unit. A bare presence check will prevent restart even when the bot is run as a grandchild of a systemd-managed process (e.g., shell script launched by systemd). Comparing the environment value against the current process ID ensures only directly-managed instances skip restart.♻️ Optional refinement
-fn is_systemd_managed_process() -> bool { - std::env::var_os("SYSTEMD_EXEC_PID").is_some() -} +fn is_systemd_managed_process() -> bool { + let Some(value) = std::env::var_os("SYSTEMD_EXEC_PID") else { + return false; + }; + value + .to_str() + .and_then(|raw| raw.parse::<u32>().ok()) + .map(|pid| pid == std::process::id()) + .unwrap_or(false) +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/app/support.rs` around lines 196 - 198, The current is_systemd_managed_process() only checks presence of SYSTEMD_EXEC_PID; change it to read the SYSTEMD_EXEC_PID env var and compare it to the current process id (std::process::id()) so it returns true only when SYSTEMD_EXEC_PID == current PID; update the function to parse the env value to an integer (u32/usize as appropriate), handle parse errors by treating them as non-matching, and return false when the env var is absent or does not equal the current process id.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/app.rs`:
- Around line 299-309: The heartbeat loop currently logs any error from
heartbeat_instance() and keeps running, which allows split-brain when the DB
lock is lost; modify run_instance_heartbeat_loop so that when
self.shared.store.heartbeat_instance() returns an Err whose message indicates
the lock loss (exact text "Telecodex database instance lock was lost" from
heartbeat_instance), it triggers a graceful shutdown instead of continuing:
e.g., send a shutdown signal via the existing CancellationToken or oneshot in
shared (or call the existing shutdown path) so the main tokio::select! handles
cleanup and exit; for other errors keep logging and continue. Ensure you
reference run_instance_heartbeat_loop and heartbeat_instance and use the shared
shutdown mechanism rather than calling process::exit directly.
---
Nitpick comments:
In `@src/app/support.rs`:
- Around line 200-215: The function restart_executable currently may return a
non-existent PathBuf; update restart_executable to fail fast when no existing
executable is found: after obtaining exe from std::env::current_exe() (and after
the existing checks for the " (deleted)" suffix and the replacement PathBuf), if
neither exe nor replacement exists return an Err with a clear context message
(e.g., "executable not found for restart" including the attempted path text)
instead of falling through to Ok(exe); ensure the error uses the same error
type/context mechanism used elsewhere (the Result/anyhow::Context pattern) so
callers (e.g., Command::new(exe).spawn()) receive a descriptive failure.
- Around line 196-198: The current is_systemd_managed_process() only checks
presence of SYSTEMD_EXEC_PID; change it to read the SYSTEMD_EXEC_PID env var and
compare it to the current process id (std::process::id()) so it returns true
only when SYSTEMD_EXEC_PID == current PID; update the function to parse the env
value to an integer (u32/usize as appropriate), handle parse errors by treating
them as non-matching, and return false when the env var is absent or does not
equal the current process id.
In `@src/store.rs`:
- Around line 784-791: The stale_instance_cutoff function currently accepts any
positive TELECODEX_INSTANCE_STALE_SECONDS value; add a sanity upper bound to
avoid extremely large values (e.g., limit to a few hours) by clamping the parsed
seconds to a max (e.g., MAX_STALE_SECONDS) before using it to compute the
cutoff. Update stale_instance_cutoff to parse the env var as now, then apply a
min(parsed_seconds, MAX_STALE_SECONDS) (or similar clamp) so typos like huge
numbers cannot disable recovery; reference TELECODEX_INSTANCE_STALE_SECONDS and
the stale_instance_cutoff function when implementing the clamp.
- Around line 700-706: Add a brief doc comment above the Drop impl for Store
explaining that Store::drop uses conn.get_mut() and performs a best-effort
release via release_instance_lock_for, which can fail if the mutex is poisoned
and will be skipped by std::process::exit; callers that need deterministic
release must call release_instance_lock() explicitly (see the RestartBot path in
RestartBot/Restart code for an example). Ensure the comment names the methods
release_instance_lock_for and release_instance_lock and mentions the
conn.get_mut() failure mode so future maintainers know to call the explicit
release before exiting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ce95d40e-9759-42fc-b38e-cd303b076b57
📒 Files selected for processing (5)
src/app.rssrc/app/support.rssrc/app/turns.rssrc/commands.rssrc/store.rs
🚧 Files skipped from review as they are similar to previous changes (2)
- src/commands.rs
- src/app/turns.rs
This comment was marked as outdated.
This comment was marked as outdated.
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
|
I pushed a fix in commit 030f147, please review it. |
|
Closing this PR for now. I want to do another pass on the implementation and send the related changes together as one combined PR, so the review surface is cleaner. Thanks for the review so far. |
Summary
This PR fixes a failure mode where a Telegram
429 Too Many Requestsresponse could leave a Telecodexsession stuck as
busy.What changed:
runningturns asfailed;sessions.busy.failed;busy;retry_afterbackoff, send a user-facing notice.Retryinline button and/retry <turn_id>command for failed/cancelled turns withoutattachments.
Context
This came from real usage with multiple long-running Telecodex topics. One topic got stuck as
busy:/ statusshowed it as busy, but/stopsaid there was no active turn. The root cause was a Telegram 429while creating the initial placeholder message, before Codex had actually started.
Telegram's Bot FAQ recommends respecting per-chat/group/global limits and the Bot API returns
retry_afterfor 429 responses, so the fix follows that signal instead of retrying immediately.Testing
cargo testcargo clippy --all-targets --all-features -- -D warningscargo build --releaseFollow-up after local rollout testing:
/restart_botwhen the running process points to an old replaced binary inode (current_exe()ending up as a deleted executable after
cargo build --release)./restart_botnow exits cleanly and lets the configured service restart policy bringthe bot back instead of spawning a detached duplicate process.
shutdown signal.