fix: prevent session message loss in LocalManagedAgent._persist_state#1727
Conversation
Use update_session_metadata (locked read-modify-write) instead of get_session + _save_session, which could overwrite messages written concurrently via add_message on another store instance. Adds regression test mirroring session store race coverage from #1724. Co-authored-by: Mervin Praison <MervinPraison@users.noreply.github.com>
|
@coderabbitai review |
|
/review |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more β On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
β Actions performedReview triggered.
|
|
Important Review skippedBot user detected. To trigger a single review, invoke the βοΈ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
π WalkthroughWalkthroughThis PR fixes session state persistence in ChangesSession Metadata Persistence Fix
Estimated code review effortπ― 2 (Simple) | β±οΈ ~10 minutes Possibly related PRs
Poem
π₯ Pre-merge checks | β 5β Passed checks (5 passed)
βοΈ Tip: You can configure your own custom pre-merge checks in the settings. β¨ Finishing Touchesπ§ͺ Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@copilot Do a thorough review of this PR. Read ALL existing reviewer comments above from Qodo, Coderabbit, and Gemini first β incorporate their findings. Review areas:
|
Agent-Logs-Url: https://github.com/MervinPraison/PraisonAI/sessions/4d08ce0a-1b7b-4e50-b841-6a08416f2f53 Co-authored-by: MervinPraison <454862+MervinPraison@users.noreply.github.com>
Completed a full review and incorporated all available reviewer signals. Action taken (fd788e9): I fixed a backward-compatibility gap in Findings:
CI note: multiple workflow runs reported |
|
@claude You are the FINAL architecture reviewer. If the branch is under MervinPraison/PraisonAI (not a fork), you are able to make modifications to this branch and push directly. SCOPE: Focus ONLY on Python packages (praisonaiagents, praisonai). Do NOT modify praisonai-rust or praisonai-ts. Read ALL comments above from Gemini, Qodo, CodeRabbit, and Copilot carefully before responding. Phase 1: Review per AGENTS.md
Phase 2: FIX Valid Issues Phase 3: Final Verdict |
|
Claude finished @MervinPraison's task ββ View job Final Architecture Review βPhase 1: Architecture Review per AGENTS.md
Phase 2: Fix Valid Issues
Phase 3: Final Verdict
Comprehensive Architecture Reviewβ Protocol-Driven Design Compliance
β Backward Compatibility Verified
β Performance Impact Assessment
β Multi-Agent Safety & Async Safety
β Previous Reviewer Findings Addressed
β Test Coverage Adequate
FINAL VERDICT: β APPROVEDThis PR successfully fixes a critical race condition in session message persistence while maintaining full architectural compliance and backward compatibility. All reviewer feedback has been addressed, and the implementation follows PraisonAI's protocol-driven design principles. Recommendation: Merge this PR - the fix is minimal, focused, and addresses a real correctness issue without introducing any architectural debt. |
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more β On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Greptile SummaryThis PR fixes a race condition in
Confidence Score: 3/5The core race-condition fix is correct, but the chosen helper skips None-valued fields, meaning a cleared compute instance ID is not written back to disk and will be incorrectly restored on the next session resume. The switch to update_session_metadata correctly solves the message-loss race. However, that helper's if value is None: continue guard introduces a new defect: when _compute_instance_id is set to None after a shutdown and _persist_state is called again, the old instance ID stays in metadata. _restore_state then re-hydrates the freed ID, leaving the agent pointing at a resource that no longer exists. The test added does not cover this path. src/praisonai/praisonai/integrations/managed_local.py β specifically the interaction between None-valued state fields and the update_session_metadata helper's skip logic. Important Files Changed
Sequence DiagramsequenceDiagram
participant A as LocalManagedAgent
participant S as DefaultSessionStore (reader)
participant D as Disk (session.json)
participant W as Concurrent Writer
Note over A,W: Old (racy) path
A->>S: get_session(id) β stale copy
W->>D: add_message("second") [atomic write]
A->>D: _save_session(stale copy) [overwrites "second"]
Note over A,W: New (fixed) path
A->>S: "update_session_metadata(id, **state)"
S->>D: FileLock acquired
S->>D: _load_session_from_disk β fresh copy (has "second")
S->>D: mutate metadata only
S->>D: atomic write (messages preserved)
S->>D: FileLock released
Reviews (1): Last reviewed commit: "fix: preserve legacy _persist_state fall..." | Re-trigger Greptile |
| if hasattr(store, "update_session_metadata"): | ||
| store.update_session_metadata(self._session_id, **state) |
There was a problem hiding this comment.
Stale
compute_instance_id after instance shutdown
DefaultSessionStore.update_session_metadata silently skips any field whose value is None (see if value is None: continue in store.py:541). This means if a compute instance is freed mid-session β setting self._compute_instance_id = None at line 1043 β the subsequent _persist_state() call won't overwrite the old instance ID that is already in the metadata. On the next _restore_state(), line 547 (if meta.get("compute_instance_id"): self._compute_instance_id = ...) will restore the stale ID, leaving the agent referencing a freed compute resource.
The same applies to compute_provider when self._compute is None. Previously the raw session.metadata.update(state) explicitly stored None, clearing the stale value. The new path no longer does so.
Bug and impact
LocalManagedAgent._persist_state()could silently delete chat messages from the JSON session store when metadata was persisted while another writer (or anotherDefaultSessionStoreinstance) had added messages viaadd_message().This affects managed/local agent runs that call
_persist_state()after each turn while the gateway, TUI, or another process also writes to the same session file.Root cause
The DefaultSessionStore path used the same unsafe pattern fixed in #1724 for other mutators:
get_session()β stale in-memory copymetadata_save_session()β full session overwrite on diskIf a concurrent
add_message()ran between steps 1 and 3, those messages were lost.Fix
Use
update_session_metadata()(locked read-modify-write from disk) instead ofget_session()+_save_session().Validation
test_local_managed_persist_state_preserves_messagesreproducing the two-instance race from session store tests.Summary by CodeRabbit
Bug Fixes
Tests