Open
Conversation
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap. Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention. Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
* Update dependency hermetic_cc_toolchain to v4 * Fix bazelrc for new hermetic_cc_toolchain --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes TraceMachina#1903 Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused. Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker. This way they will always be one-to-one mapped even with a load balancer in front. Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
By default there's no timeout for the GCS store connect or read. We've seen a number of nodes lock up due to this when executing in a fresh GKE Pod. Add in timeouts and remove the internal retry mechanism to reqwest since we have our own by creating the client manually.
* Sweep forgotten client operation IDs * add helpful log
We take pains not to attempt to fetch the zero byte digest from the GCS store, so we should avoid uploading it too. The stdout is usually zero bytes and uploading loads of them causes a lot of 429 errors. Add a check for the zero byte digest and error if it's not zero bytes.
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
The remove_callbacks in evicting_map has a Mutex even though there's a Mutex on State required. Although this doesn't cause an issue currently because nothing calls add_remove_callback without a Mutex on State, this could cause issues in the future. Remove the unnecessary Mutex.
There are multiple use cases where we don't want a fast-slow store to persist to one of the stores in some direction. For example, worker nodes do not want to store build results on the local filesystem, just with the upstream CAS. Another case would be the re-use of prod action cache in a dev environment, but not vice-versa. This PR introduces options to the fast-slow store which default to the existing behaviour, but allows customisation of each side of the fast slow store to either persist in the case or get operations, put operations or to make them read only. Fixes TraceMachina#1577
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
If there is a lot of files being created at the same time then there might be the case that the populator is attempting to write to a DropCloseWriterHalf but the DropCloseReaderHalf is blocked waiting for a file system semaphore. These are held by the FileSlot in FileSystemStore while it populates the temp file. This can lead to a deadlock where the readers are holding semaphores on the paths that don't have FileSlots and the ones that do have FileSlots don't have download semaphores. Ensure that the reader has the first chunk of data before taking the FileSlot. If the reader starts then we assume that we have all the permits we need to finish the file, therefore it should be safe to take the FileSlot semaphore.
…na#2018) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
…raceMachina#2194) * Fix Fast slow store Not Found error by returning failed precondition * Add tests for CAS NotFound to FailedPrecondition conversion Tests the conditional that converts NotFound errors containing "not found in either fast or slow store" to FailedPrecondition, and verifies other NotFound errors still return InternalError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add json schema generation for CasConfig via schemars --------- Co-authored-by: pegasust <pegasucksgg@gmail.com>
* Add debug info to connection manager queues * Don't need all the logging for update_action_result * Add CAS speed check
# Conflicts: # nativelink-scheduler/src/simple_scheduler_state_manager.rs # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel # Conflicts: # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel
# Conflicts: # Cargo.lock
# Conflicts: # nativelink-worker/src/local_worker.rs
# Conflicts: # nativelink-worker/src/local_worker.rs # nativelink-worker/src/running_actions_manager.rs
# Conflicts: # nativelink-store/src/fast_slow_store.rs
commit 14f92b8 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:07:20 2026 +0000 Introduce batchInterval and batchDebounce. commit 12d839c Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:06:45 2026 +0000 Introduce batch notify and assign actions. commit 7393a00 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:06:12 2026 +0000 Use RWLock instead of single Mutext in MemoryAwaitedActionDb. commit 8cfeade Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:05:37 2026 +0000 Fix batch matching by allowing same worker accept multiple jobs. commit 7cd29c9 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Fri Jan 23 12:15:40 2026 +0000 Introduce batch worker matching.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.