Feature/1.0.0 rc4 patched router by rchshld · Pull Request #2 · joomcode/nativelink

rchshld · 2026-03-20T12:26:13Z

No description provided.

) Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

The changes to the EvictionMap to add removal callbacks introduced a lot of memory allocations, async locks and dynamic dispatch. This trashes the performance of the EvictionMap. Fix the implementation to avoid all of the indirection through generics and move callbacks to outside of the locks to avoid deadlocks and issues with contention. Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

* Update dependency hermetic_cc_toolchain to v4 * Fix bazelrc for new hermetic_cc_toolchain --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Tom Parker-Shemilt <tom@tracemachina.com>

When execution is complete, there's a large amount of IO still to be done. In the mean time a new action could be starting. Previously an attempt to implement this was quite complex and caused panics. In this implementation a very simple mechanism is used which only executes on success and keeps track of which operations have been notified on the scheduler. This massively simplifies things. Fixes TraceMachina#1903 Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>

When there are multiple schedulers in high availability mode then the workers can end up communicating with the wrong scheduler and state can get confused. Modify the communications such that there is a single bi-directional gRPC stream between the client and the worker. This way they will always be one-to-one mapped even with a load balancer in front. Co-authored-by: Chris Staite <chris@yourdreamnet.co.uk>

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

By default there's no timeout for the GCS store connect or read. We've seen a number of nodes lock up due to this when executing in a fresh GKE Pod. Add in timeouts and remove the internal retry mechanism to reqwest since we have our own by creating the client manually.

* Sweep forgotten client operation IDs * add helpful log

We take pains not to attempt to fetch the zero byte digest from the GCS store, so we should avoid uploading it too. The stdout is usually zero bytes and uploading loads of them causes a lot of 429 errors. Add a check for the zero byte digest and error if it's not zero bytes.

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

The remove_callbacks in evicting_map has a Mutex even though there's a Mutex on State required. Although this doesn't cause an issue currently because nothing calls add_remove_callback without a Mutex on State, this could cause issues in the future. Remove the unnecessary Mutex.

There are multiple use cases where we don't want a fast-slow store to persist to one of the stores in some direction. For example, worker nodes do not want to store build results on the local filesystem, just with the upstream CAS. Another case would be the re-use of prod action cache in a dev environment, but not vice-versa. This PR introduces options to the fast-slow store which default to the existing behaviour, but allows customisation of each side of the fast slow store to either persist in the case or get operations, put operations or to make them read only. Fixes TraceMachina#1577

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

If there is a lot of files being created at the same time then there might be the case that the populator is attempting to write to a DropCloseWriterHalf but the DropCloseReaderHalf is blocked waiting for a file system semaphore. These are held by the FileSlot in FileSystemStore while it populates the temp file. This can lead to a deadlock where the readers are holding semaphores on the paths that don't have FileSlots and the ones that do have FileSlots don't have download semaphores. Ensure that the reader has the first chunk of data before taking the FileSlot. If the reader starts then we assume that we have all the permits we need to finish the file, therefore it should be safe to take the FileSlot semaphore.

Fixes TraceMachina#2012

…na#2018) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

…raceMachina#2194) * Fix Fast slow store Not Found error by returning failed precondition * Add tests for CAS NotFound to FailedPrecondition conversion Tests the conditional that converts NotFound errors containing "not found in either fast or slow store" to FailedPrecondition, and verifies other NotFound errors still return InternalError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Marcus <marcuseagan@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add json schema generation for CasConfig via schemars --------- Co-authored-by: pegasust <pegasucksgg@gmail.com>

…a#2197)

* Add debug info to connection manager queues * Don't need all the logging for update_action_result * Add CAS speed check

# Conflicts: # nativelink-scheduler/src/simple_scheduler_state_manager.rs # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel # Conflicts: # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel

# Conflicts: # Cargo.lock

# Conflicts: # nativelink-worker/src/local_worker.rs

# Conflicts: # nativelink-worker/src/local_worker.rs # nativelink-worker/src/running_actions_manager.rs

# Conflicts: # nativelink-store/src/fast_slow_store.rs

commit 14f92b8 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:07:20 2026 +0000 Introduce batchInterval and batchDebounce. commit 12d839c Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:06:45 2026 +0000 Introduce batch notify and assign actions. commit 7393a00 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:06:12 2026 +0000 Use RWLock instead of single Mutext in MemoryAwaitedActionDb. commit 8cfeade Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Sun Jan 25 12:05:37 2026 +0000 Fix batch matching by allowing same worker accept multiple jobs. commit 7cd29c9 Author: Dmitrii Kostyrev <dkostyrev@joom.com> Date: Fri Jan 23 12:15:40 2026 +0000 Introduce batch worker matching.

…ails.

MarcusSorealheis and others added 30 commits October 15, 2025 15:02

comment legacy Dockerfile test (TraceMachina#1983)

6316b55

Remove folders with bad permissions (TraceMachina#1980)

5e487f3

Make all tests in running_actions_manager_test serial (TraceMachina#1984

41cdd9c

) Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

Update Rust crate relative-path to v2 (TraceMachina#1985)

997feb4

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Build toolchain-examples (TraceMachina#1971)

2d08aba

fixed cost docs (TraceMachina#1986)

aab10ee

Replace derivative with derive_more (TraceMachina#1989)

9f39700

Explicitly separate state locks and awaits (TraceMachina#1991)

930b352

Require default-features=false (TraceMachina#1993)

0146c34

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

Sweep forgotten client operation IDs (TraceMachina#1965)

9fcf5b1

* Sweep forgotten client operation IDs * add helpful log

Fix clippy::cast_possible_truncation (TraceMachina#1423)

b050976

Add Rust test to RBE work (TraceMachina#1992)

e01079b

Unify all the service setups with a macro (TraceMachina#1996)

e46b5c7

Pin various dependencies (mostly Docker images) (TraceMachina#1990)

29c3dc4

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

Update Swatinem/rust-cache digest to 9416228 (TraceMachina#2004)

15c747e

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

Release NativeLink v0.7.4 (TraceMachina#2005)

0cc8e5d

Buck2 integration test (TraceMachina#1828)

1296a3a

fix: guard shutting down in scheduler while SIGTERM (TraceMachina#2012)

1708859

Co-authored-by: Marcus Eagan <marcuseagan@gmail.com>

fix: scheduler shutdown not guarded (TraceMachina#2015)

552a1cd

Fixes TraceMachina#2012

Release NativeLink v0.7.5 (TraceMachina#2016)

016cd50

chore(deps): update swatinem/rust-cache digest to a84bfdc (TraceMachi…

d5ea603

…na#2018) Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

amankrx and others added 28 commits February 28, 2026 10:54

Add json schema (TraceMachina#2193)

d926c47

* feat: add json schema generation for CasConfig via schemars --------- Co-authored-by: pegasust <pegasucksgg@gmail.com>

Prevent retry loop large uploads (TraceMachina#2195)

2a2ca64

Only display Baggage enduser.id when identity is present (TraceMachin…

86b86e1

…a#2197)

Fix Redis to reconnect in Sentinel (Chris Staite) (TraceMachina#2190)

8783134

Release NativeLink v1.0.0-rc3 (TraceMachina#2198)

bdf3f9d

remove free cloud user (TraceMachina#2199)

c7109f6

Handle correctly subscription messages (TraceMachina#2201)

2ea428b

Upgrade curl to 8.5.0-2ubuntu10.8 (TraceMachina#2204)

36a8238

Add debug info to connection manager queues (TraceMachina#2188)

6b6efcf

* Add debug info to connection manager queues * Don't need all the logging for update_action_result * Add CAS speed check

empty find_missing_blobs can return immediately (TraceMachina#2217)

dad870a

Release NativeLink v1.0.0-rc4

69db8a6

Introduce custom metrics.

f91e606

# Conflicts: # nativelink-scheduler/src/simple_scheduler_state_manager.rs # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel # Conflicts: # nativelink-store/src/redis_store.rs # nativelink-util/BUILD.bazel

Introduce balanced channel in ginepro.

167af83

# Conflicts: # Cargo.lock

Fix instance name parsing.

a64e2a0

Introduce execution_completion_behaviour: one_shot_always for workers.

dbef3ff

# Conflicts: # nativelink-worker/src/local_worker.rs

Allow parsing execution_completion_behaviour from environment variable.

5052e01

Rewrite worker metrics using OTEL.

69547f4

# Conflicts: # nativelink-worker/src/local_worker.rs # nativelink-worker/src/running_actions_manager.rs

FastSlowStoreMetrics as OTEL.

1348c3c

# Conflicts: # nativelink-store/src/fast_slow_store.rs

Introduce MetricsStore.

45b6e86

Introduce eviction count.

28912a1

Remove action digest and worker id from analytics.

5447630

Make store_operation_duration histogram more narrow.

194b5d0

Introduce store_size metric.

b0a7c23

Inner_store in metrics store should return underlying store.

39b5d05

Existence cache should invalidate if get_part from underlying store f…

4b9d0cd

…ails.

Introduce scheduler router

fb72423

rchshld had a problem deploying to production March 20, 2026 12:26 — with GitHub Actions Error

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/1.0.0 rc4 patched router#2

Feature/1.0.0 rc4 patched router#2
rchshld wants to merge 151 commits intomainfrom
feature/1.0.0-rc4-patched-router

rchshld commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

rchshld commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants