Skip to content

Conversation

@achsanalfitra
Copy link

Pull Request to Issue #427

Key info:

  • CAS-based registry
  • Hashed using md-5 (which is already an existing dependency from sqlx)
  • Eviction on get and contains key (similar to memory cache)

Note:

  • Uses DateTime to parse serialized expiry into Timeout (this can be improved by implementing From in Timeout)
  • No retry policy implemented
  • No entry cap implemented

@github-actions github-actions bot added the C-lib Crate: cot (main library crate) label Jan 6, 2026
};

if expiry.is_expired(None) {
return Ok(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we not remove the file here (early on), right when we find out the file has expired, making this a single source of truth?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that would be more efficient. Initially, I was aiming to match the eviction policy that only read() and contains_key() may delete. I guess it would be better to eagerly remove it once expired

let mut buffer = Vec::new();

// advances cursor by the expiry header offset
file.seek(SeekFrom::Start(8))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to seek here again after doing so when we parse the expiry. Does that not make this seek redundant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it again, I agree

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree. Firstly, the parse_expiry function resets the cursor back to the beginning (as it should) so this is needed. Secondly, this should stay here for correctness purpose, we can't assume the cursor on the file will always be at the beginning.

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also update

pub(super) fn fn_to_cache_test(test_fn: &ItemFn) -> TokenStream {
as well as
pub struct TestCache {
with a File variant to setup integration tests for the file store

@ElijahAhianyo
Copy link
Contributor

For the failing nightly tests, I suspect it might be broken upstream, I have that pinned in my PR(

_RUST_NIGHTLY: &rust_nightly nightly-2025-11-11
) which should make the CI deterministic, that should get merged soon

Copy link
Member

@m4tx m4tx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your patience waiting for my review! This looks very good already, but there are a few things I'd like to see clarified before we merge.

use std::path::Path;

use chrono::{DateTime, Utc};
use md5::{Digest, Md5};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth using sha2 instead? I know md5 is already in our indirect dependencies - but sha2 is a direct dependency (used in the auth), and probably less prone to collisions. After all it doesn't matter that much, but it might be worth sticking to one algorithm as broadly as possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got, it!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for caching purpose we don't need a classic cryptographic secure hashing algorithm, we should focus on a fast one instead. I'd suggest using BLAKE3 for that case and it has official Rust implementation. If we trust their benchmarks, it's much faster than SHA2.

Copy link
Member

@m4tx m4tx Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seqre only if we (or any of our dependencies) use it already. I don't see much value in having the fastest function for hashing the cache keys (which will typically be small).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this again, I think I agree we should use blake3 for this. Currently we use:

  • md5 for MySQL and PostgreSQL (both behind feature flags)
  • blake2 for hashing passwords (not behind feature flags, but maybe should be)
  • sha2 for hashing static file contents (not behind feature flag)

I think it's okay to use blake3 here, and we can update the static file content hashing to use blake3, too, to avoid extra dependencies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m4tx sha2 and blake2 are relatively smaller than blake3 though.

The upside is that blake3 offers more features and is more performant than the rest (as they claimed). I agree that this can be beneficial long term if we reuse the dependency in other cases, such as the static file content hashing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use blake3 even if it's bigger than blake2. With much better performance it'll be a big benefit for the static files. At the end of the day, changing the hashing algorithm when needed is trivial.

key: &str,
) -> CacheStoreResult<Option<(tokio::fs::File, std::path::PathBuf)>> {
let key_hash = FileStore::create_key_hash(key);
let path = self.dir_path.join(&key_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we do in case there's a collision when hashing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, there's no resolve on collision. My ideas are to embed real name on the file as header (this is simpler but may incur syscall only if the real name doesn't match the file). Another option would be to implement a jump table and sync to file on push new hash

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need a collision discovery unless it'd be simple to implement and fast at runtime. If we switch to another algorithm, their collision resistance is pretty high. For example for BLAKE3, it's 2**128. At that level of resistance, getting any collision is a feat and getting a specific collision for an attack seems almost impossible.

let data = serde_json::to_string(&value)
.map_err(|e| FileCacheStoreError::Serialize(Box::new(e)))?;

let mut buffer: Vec<u8> = Vec::with_capacity(8 + data.len());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract the magic number to a named constant to clarify what it means.

And since we essentially create a custom binary format, it might be worth documenting this at the module level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it!

if let Ok(meta) = entry.metadata().await
&& meta.is_file()
{
total_size += meta.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will return the number of bytes; not the number of entries, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this would return the total bytes. I was under the impression that approx_size depends on the cache type to track its quantity unit. If this is changed to entries number, should this bytes aggregation function be keep around for future use (maybe for monitoring)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say so, until we add functionality for that for all cache stores, it would be just dead code. We can always recreate it!

Copy link
Member

@seqre seqre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution, it's a great start! There are some things that needs changing before we merge it though.

//! let key = "example_key".to_string();
//! let value = serde_json::json!({"data": "example_value"});
//!
//! store.insert(key.clone(), value.clone(), Default::default()).await.unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd change it to Timeout::default() so that people reading the docs know what the 3rd argument is.

use std::path::Path;

use chrono::{DateTime, Utc};
use md5::{Digest, Md5};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for caching purpose we don't need a classic cryptographic secure hashing algorithm, we should focus on a fast one instead. I'd suggest using BLAKE3 for that case and it has official Rust implementation. If we trust their benchmarks, it's much faster than SHA2.

use crate::config::Timeout;
use crate::error::error_impl::impl_into_cot_error;

const ERROR_PREFIX: &str = "file based cache store error:";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
const ERROR_PREFIX: &str = "file based cache store error:";
const ERROR_PREFIX: &str = "file-based cache store error:";

Comment on lines 56 to 64
#[error("{ERROR_PREFIX} file dir creation error: {0}")]
DirCreation(Box<dyn std::error::Error + Send + Sync>),

/// An error occured during temp file creation
#[error("{ERROR_PREFIX} file temp file creation error: {0}")]
TempFileCreation(Box<dyn std::error::Error + Send + Sync>),

/// An error occured during write/stream file
#[error("{ERROR_PREFIX} file io error: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "file" is already included in ERROR_PREFIX

Suggested change
#[error("{ERROR_PREFIX} file dir creation error: {0}")]
DirCreation(Box<dyn std::error::Error + Send + Sync>),
/// An error occured during temp file creation
#[error("{ERROR_PREFIX} file temp file creation error: {0}")]
TempFileCreation(Box<dyn std::error::Error + Send + Sync>),
/// An error occured during write/stream file
#[error("{ERROR_PREFIX} file io error: {0}")]
#[error("{ERROR_PREFIX} dir creation error: {0}")]
DirCreation(Box<dyn std::error::Error + Send + Sync>),
/// An error occured during temp file creation
#[error("{ERROR_PREFIX} temp file creation error: {0}")]
TempFileCreation(Box<dyn std::error::Error + Send + Sync>),
/// An error occured during write/stream file
#[error("{ERROR_PREFIX} io error: {0}")]

Comment on lines 137 to 142
store.create_dir_sync_root()?;

Ok(store)
}

fn create_dir_sync_root(&self) -> CacheStoreResult<()> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think sync should be a suffix to show it's another version of existing async function

Suggested change
store.create_dir_sync_root()?;
Ok(store)
}
fn create_dir_sync_root(&self) -> CacheStoreResult<()> {
store.create_dir_root_sync()?;
Ok(store)
}
fn create_dir_root_sync(&self) -> CacheStoreResult<()> {

&self,
file: &mut tokio::fs::File,
) -> CacheStoreResult<Option<Value>> {
if !self.parse_expiry(file).await? {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not fully convinced the parse_expiry is the best name, I had to spent few seconds parsing this line mentally. It's what the function does technically, but logically it checks if the file expired. I wonder if it would be better to rename it to check_expiry, is_expired or similar

key: &str,
) -> CacheStoreResult<Option<(tokio::fs::File, std::path::PathBuf)>> {
let key_hash = FileStore::create_key_hash(key);
let path = self.dir_path.join(&key_hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we really need a collision discovery unless it'd be simple to implement and fast at runtime. If we switch to another algorithm, their collision resistance is pretty high. For example for BLAKE3, it's 2**128. At that level of resistance, getting any collision is a feat and getting a specific collision for an attack seems almost impossible.

Ok((temp_file, temp_path))
}

async fn file_open(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe open_file_for_reading instead? It seems more aligned with what it does, or at least it says more about what it specifically does.

if let Ok(meta) = entry.metadata().await
&& meta.is_file()
{
total_size += meta.len();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say so, until we add functionality for that for all cache stores, it would be just dead code. We can always recreate it!

}

async fn contains_key(&self, key: &str) -> CacheStoreResult<bool> {
let Ok(Some(mut file_tuple)) = self.file_open(key).await else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please deconstruct the tuple here with pattern matching into specific parts, so you don't have to use .0 and .1 below.

@github-actions github-actions bot added C-cli Crate: cot-cli (issues and Pull Requests related to Cot CLI) A-ci Area: CI (Continuous Integration) labels Jan 17, 2026
@achsanalfitra
Copy link
Author

Hi everyone, thank you for the feedback. And... sorry for the wait!

I think that is all, excluding

  • Integration test (considering that it'd be too complicated if I add it here now?)
  • Reimplementation of hashing/on collision mechanism

@achsanalfitra achsanalfitra requested review from m4tx and seqre January 17, 2026 12:54
Copy link
Member

@m4tx m4tx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good - however, I've identified one major problem: I think we need to lock the files before writing to avoid data races and broken cache entries. Please have a look at my comment for more details.

In addition to that, please ensure the change can be cleanly merged into the master branch - at the moment it seems like it does some changes in unrelated files (looks like 2721b6a hasn't properly got into your branch?).

I think after these are fixed, this will be ready to be merged!

#[non_exhaustive]
pub enum FileCacheStoreError {
/// An error occured during directory creation
#[error("{ERROR_PREFIX} dir creation error: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("{ERROR_PREFIX} dir creation error: {0}")]
#[error("{ERROR_PREFIX} directory creation error: {0}")]

nitpick: clarity

DirCreation(Box<dyn std::error::Error + Send + Sync>),

/// An error occured during temp file creation
#[error("{ERROR_PREFIX} temp file creation error: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("{ERROR_PREFIX} temp file creation error: {0}")]
#[error("{ERROR_PREFIX} temporary file creation error: {0}")]

nitpick

TempFileCreation(Box<dyn std::error::Error + Send + Sync>),

/// An error occured during write/stream file
#[error("{ERROR_PREFIX} io error: {0}")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#[error("{ERROR_PREFIX} io error: {0}")]
#[error("{ERROR_PREFIX} I/O error: {0}")]

nitpick

use std::path::Path;

use chrono::{DateTime, Utc};
use md5::{Digest, Md5};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use blake3 even if it's bigger than blake2. With much better performance it'll be a big benefit for the static files. At the end of the day, changing the hashing algorithm when needed is trivial.

file.sync_data()
.await
.map_err(|e| FileCacheStoreError::Io(Box::new(e)))?;
tokio::fs::rename(file_path, self.dir_path.join(&key_hash))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we close the file before renaming?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fs currently has a no-op on flush() for Windows and Unix. And, since we use write_all() that will not return mid-flight, this is safe to call since all the write operations were completed before rename(). Are there other cases that we miss here?

&self,
key_hash: &str,
) -> CacheStoreResult<(tokio::fs::File, std::path::PathBuf)> {
let temp_path = self.dir_path.join(format!("{key_hash}.{TEMPFILE_SUFFIX}"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm worried about is that this can lead to data races. Since we are using a multithreaded runtime (but this would also apply to a case where we would have multiple workers set up with the same cache directory), it's possible that we'll have two competing threads trying to write to the same file, resulting in a garbled output.

I see that Django seems to be locking the files exclusively before attempting to write to them - perhaps we should implement a similar mechanism using something like fs4?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we lock the file, there's still a TOCTOU race condition on rename(). When the lock is released before it completes, another thread can start loading new data.

What about retry on create_new() instead, since that method will return error if the file already exist? I'll test this approach right away.

Copy link
Author

@achsanalfitra achsanalfitra Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested the create_new() approach. That works, but with a catch:

There maybe ghosted .tmp due to crash mid-write. Therefore create_dir_root_sync() was modified to clean the directory out of all the leftovers.

Only problem comes if separate processes (not threads) are writing cache into the same cache directory and cleanup on sync() would delete ongoing write from another process.

Edit: on second thought, I'll try to lock the temp file approach

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: the locking approach on temp file works, but it can lead to race conditions on rename(). That's manageable by making the file cache best effort where this would mirror django's approach.

However, using this approach, the store initiation/cleanup won't corrupt mid-flight writes.

Which one should we implement?

Note: try_lock_shared() for cleaning up orphaned .tmp files requires MSRV 1.89 to be stable. Alternatively, we can use the try_lock_exclusive() which doesn't corrupt the write, but sometimes remove the data pre-rename if it got hold onto the file first, effectively making the file gone.

@achsanalfitra
Copy link
Author

I've tested 0adfa9e on Windows and it seems to work properly. The tests also pass on Miri (Ubuntu build).

The problem was I didn't check the result boolean on try_lock_exclusive() which on Windows OS would give Ok(false) if the lock is used by another thread. Also fixed an ACCESS_DENIED issue when renaming a locked file.

Hopefully, this will work now.

Copy link
Member

@m4tx m4tx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I'll need to take a longer look (hopefully tomorrow), but I'm not expecting to find anything major that needs to be changed.

Regarding the CI License check - you're free to add the BSD-2 clause license to the cargo-deny config.

Just one question about the implementation: my understanding is that we could just as well skip using temporary files, and achieve the same thing with lock_shared/lock_exclusive. Temporary files just give us potentially better performance, because reading will never lock. Is that correct? (just to clarify, the current approach is fine; I just want to make sure it's clear it's not the only valid one)

email_address.workspace = true
fake = { workspace = true, optional = true, features = ["derive", "chrono"] }
form_urlencoded.workspace = true
fs4.workspace = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since cache is an optional feature, this could be an optional dependency, too.

async-trait.workspace = true
axum = { workspace = true, features = ["http1", "tokio"] }
backtrace.workspace = true
blake3.workspace = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically this should also be an optional dependency, but since we plan to switch static files to use blake3 instead of sha2, it's fine to leave this as-is.

//! provide a simple asynchronous interface for putting, getting, and managing
//! cached values, optionally with expiration policies.
pub mod file;
Copy link
Member

@m4tx m4tx Jan 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a full implementation, please take a look at config.rs whether it also needs to be modified, so that the users can easily switch between cache implementations in the config TOML file. This will also require a change here:

_ => {

@codecov
Copy link

codecov bot commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 84.58150% with 70 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
cot/src/cache/store/file.rs 84.58% 37 Missing and 33 partials ⚠️
Flag Coverage Δ
rust 90.24% <84.58%> (-0.14%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cot/src/cache/store.rs 100.00% <ø> (ø)
cot/src/cache/store/file.rs 84.58% <84.58%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@achsanalfitra
Copy link
Author

Yes, I agree. We could use lock_shared / lock_exclusive on a single file and skip the swap entirely. However, in a multithreaded runtime, lock_exclusive can lead to deadlocks, which pushes the design toward using try_lock_exclusive with retries. At that point, we are effectively implementing a retry-based contention loop.

With the swap-based approach, readers never need to acquire a lock, and contention is naturally pushed onto writers instead. I think this approach potentially gives us more performance and efficiency (due to minimum lock during read) when the caching workload is read-heavy.

Looks good overall. I'll need to take a longer look (hopefully tomorrow), but I'm not expecting to find anything major that needs to be changed.

Regarding the CI License check - you're free to add the BSD-2 clause license to the cargo-deny config.

Just one question about the implementation: my understanding is that we could just as well skip using temporary files, and achieve the same thing with lock_shared/lock_exclusive. Temporary files just give us potentially better performance, because reading will never lock. Is that correct? (just to clarify, the current approach is fine; I just want to make sure it's clear it's not the only valid one)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-ci Area: CI (Continuous Integration) C-cli Crate: cot-cli (issues and Pull Requests related to Cot CLI) C-lib Crate: cot (main library crate)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants