Skip to content

Upload session state is not persisted; server restart causes 404 on chunk PUT and orphaned files on disk #116

@dobby-coder

Description

@dobby-coder

Filed as a follow-up to postguard-website#117 — that issue is about the website surfacing the failure better; this one is about the underlying persistence gap on the cryptify side.

What is wrong

Store keeps all upload session state in memory (HashMap<String, Arc<Mutex<FileState>>> in src/store.rs:37). There is no on-disk persistence and no startup hydration. On any cryptify process restart — deploy, panic, OOM, container reschedule — every in-progress upload session is lost.

Two visible symptoms follow:

  1. Chunk PUTs return 404 forever after restart. upload_chunk looks up the uuid via store.get(uuid) and returns Ok(None) (HTTP 404) when the session is missing (src/main.rs:284-287). The client has no way to recover; from the client's point of view this is indistinguishable from a typo'd uuid.

  2. On-disk files leak. upload_init creates data_dir/<uuid> (src/main.rs:101) and stores a 14-day expiry on the in-memory FileState. After a restart, the on-disk file remains but no FileState references it any more, and the purge task in src/store.rs:152-173 only walks the in-memory expirations map. There is no separate disk reaper, so the file sits there until manually cleaned.

Source: src/store.rs · src/main.rs upload_chunk · src/main.rs upload_init

Why it matters

  • For a single-replica deployment (current postguard-ops setup), every cryptify deploy or crash silently kills any user mid-upload. Ruben's report on postguard-website#117 is consistent with a deploy-time eviction.
  • It also blocks resume support (suggested fix Perform encryption in a Web/Service Worker #4 on website#117) and horizontal scaling — a multi-replica deployment would need shared state regardless.

Possible designs

In rough order of effort:

  1. Sidecar JSON per upload. On init, write data_dir/<uuid>.meta.json with the FileState fields. On chunk PUT, update uploaded + cryptify_token in that file (already syncing the disk file anyway). On startup, scan data_dir/ and rebuild the HashMap. Cheapest change, single-replica only.

  2. SQLite file in data_dir. Same shape as Add code from rust backend #1 but transactional and easier to query for the purge task. Still single-replica.

  3. PostgreSQL. Reuses the postguard-business stack. Required if cryptify ever runs more than one replica.

The 14-day expires field already exists on FileState, so any of these can drive the purge of both the metadata and the disk file from a single source of truth.

Out of scope here

  • Client-side retry / better error message — covered by postguard-website#117.
  • Resume support API — depends on this issue being fixed first.

Suggested next step

Confirm the diagnosis in production: pick a 404 chunk PUT from logs and check whether its timestamp lines up with a cryptify restart, and whether data_dir/<that-uuid> still exists on the host. If both are true, scope the fix to design #1 above and I can have a draft PR ready.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingrustPull requests that update Rust code

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions