Skip to content

Feature/1.0.0 rc4 patched router#2

Open
rchshld wants to merge 151 commits intomainfrom
feature/1.0.0-rc4-patched-router
Open

Feature/1.0.0 rc4 patched router#2
rchshld wants to merge 151 commits intomainfrom
feature/1.0.0-rc4-patched-router

Conversation

@rchshld
Copy link

@rchshld rchshld commented Mar 20, 2026

No description provided.

MarcusSorealheis and others added 30 commits October 15, 2025 15:02
)

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch.  This trashes the performance of the EvictionMap.

Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention.

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
* Update dependency hermetic_cc_toolchain to v4

* Fix bazelrc for new hermetic_cc_toolchain

---------

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>
When execution is complete, there's a large amount of IO still to be done.  In the mean time a new action could be starting.

Previously an attempt to implement this was quite complex and caused panics.  In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler.  This massively simplifies things.

Fixes TraceMachina#1903

Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused.

Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker.  This way they will always be one-to-one mapped even with a load balancer in front.

Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
By default there's no timeout for the GCS store connect or read.  We've seen a number of nodes lock up due to this when executing in a fresh GKE Pod.

Add in timeouts and remove the internal retry mechanism to reqwest since we have our own by creating the client manually.
* Sweep forgotten client operation IDs

* add helpful log
We take pains not to attempt to fetch the zero byte digest from the GCS store, so we should avoid uploading it too.  The stdout is usually zero bytes and uploading loads of them causes a lot of 429 errors.

Add a check for the zero byte digest and error if it's not zero bytes.
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
The remove_callbacks in evicting_map has a Mutex even though there's a Mutex on State required.  Although this doesn't cause an issue currently because nothing calls add_remove_callback without a Mutex on State, this could cause issues in the future.

Remove the unnecessary Mutex.
There are multiple use cases where we don't want a fast-slow store to
persist to one of the stores in some direction.  For example, worker
nodes do not want to store build results on the local filesystem, just
with the upstream CAS.  Another case would be the re-use of prod action
cache in a dev environment, but not vice-versa.

This PR introduces options to the fast-slow store which default to the
existing behaviour, but allows customisation of each side of the fast
slow store to either persist in the case or get operations, put
operations or to make them read only.

Fixes TraceMachina#1577
Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>
If there is a lot of files being created at the same time then there might be the case that the populator is attempting to write to a DropCloseWriterHalf but the DropCloseReaderHalf is blocked waiting for a file system semaphore.  These are held by the FileSlot in FileSystemStore while it populates the temp file.  This can lead to a deadlock where the readers are holding semaphores on the paths that don't have FileSlots and the ones that do have FileSlots don't have download semaphores.

Ensure that the reader has the first chunk of data before taking the FileSlot.  If the reader starts then we assume that we have all the permits we need to finish the file, therefore it should be safe to take the FileSlot semaphore.
…na#2018)

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
amankrx and others added 28 commits February 28, 2026 10:54
…raceMachina#2194)

* Fix Fast slow store Not Found error by returning failed precondition

* Add tests for CAS NotFound to FailedPrecondition conversion

Tests the conditional that converts NotFound errors containing
"not found in either fast or slow store" to FailedPrecondition,
and verifies other NotFound errors still return InternalError.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Marcus <marcuseagan@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* feat: add json schema generation for CasConfig via schemars

---------

Co-authored-by: pegasust <pegasucksgg@gmail.com>
* Add debug info to connection manager queues
* Don't need all the logging for update_action_result
* Add CAS speed check
# Conflicts:
#	nativelink-scheduler/src/simple_scheduler_state_manager.rs
#	nativelink-store/src/redis_store.rs
#	nativelink-util/BUILD.bazel

# Conflicts:
#	nativelink-store/src/redis_store.rs
#	nativelink-util/BUILD.bazel
# Conflicts:
#	Cargo.lock
# Conflicts:
#	nativelink-worker/src/local_worker.rs
# Conflicts:
#	nativelink-worker/src/local_worker.rs
#	nativelink-worker/src/running_actions_manager.rs
# Conflicts:
#	nativelink-store/src/fast_slow_store.rs
commit 14f92b8
Author: Dmitrii Kostyrev <dkostyrev@joom.com>
Date:   Sun Jan 25 12:07:20 2026 +0000

    Introduce batchInterval and batchDebounce.

commit 12d839c
Author: Dmitrii Kostyrev <dkostyrev@joom.com>
Date:   Sun Jan 25 12:06:45 2026 +0000

    Introduce batch notify and assign actions.

commit 7393a00
Author: Dmitrii Kostyrev <dkostyrev@joom.com>
Date:   Sun Jan 25 12:06:12 2026 +0000

    Use RWLock instead of single Mutext in MemoryAwaitedActionDb.

commit 8cfeade
Author: Dmitrii Kostyrev <dkostyrev@joom.com>
Date:   Sun Jan 25 12:05:37 2026 +0000

    Fix batch matching by allowing same worker accept multiple jobs.

commit 7cd29c9
Author: Dmitrii Kostyrev <dkostyrev@joom.com>
Date:   Fri Jan 23 12:15:40 2026 +0000

    Introduce batch worker matching.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.