[tx] Refactor TinkerEngine to use backend architecture #787

OhadRubin · 2025-12-17T08:17:59Z

Summary

(TL;DR: i got maxtext working as a backend with a lot of models and a lot of sharding options etc, i'm splitting it into multiple PR's if we could integrate this I could add my maxtext backend and it would allow a lot of flexibility)
Introduces a clean separation between engine (orchestration) and backend (computation) in TinkerEngine.

Engine handles: DB operations, request validation, data extraction, file I/O, orchestration
Backend handles: Model state, JAX/Flax computation, gradient accumulation, optimizer updates

New files in `backends/`

backend.py - AbstractBackend interface
jax.py - JaxBackend implementation (extracted from engine.py)
utils.py - Shared utilities (log_timing, pad, pad_batch)

New types

PreparedModelPassBatch - Batch data for forward/backward ops
PreparedSampleBatch - Batch data for sampling ops

Test plan

Verify forward/backward batch processing produces identical results
Verify sampling produces identical results
Verify checkpoint save/load works correctly
Verify optimizer step applies gradients correctly

This is a purely structural refactor - no functional changes to computation logic. Line-by-line comparison confirms identical behavior.

🤖 Generated with Claude Code

Introduces a clean separation between engine (orchestration) and backend (computation): **New files in `backends/`:** - `backend.py`: AbstractBackend interface defining the contract - `native.py`: NativeBackend implementation (extracted from engine.py) - `utils.py`: Shared utilities (log_timing, pad, pad_batch) - `__init__.py`: Module exports **Engine responsibilities (engine.py):** - Database operations (futures, checkpoints) - Request validation (`_filter_valid_requests`) - Data extraction from requests (`_prepare_model_pass_batch`, `_prepare_sample_batch`) - File I/O (checkpoint download/upload) - Orchestration of batch processing **Backend responsibilities (native.py):** - Model initialization and state management - JAX/Flax computation (forward, backward, gradient accumulation) - Optimizer creation and updates - Checkpoint data extraction/insertion **New types in types.py:** - `PreparedModelPassBatch`: Batch data for forward/backward ops - `PreparedSampleBatch`: Batch data for sampling ops This is a purely structural refactor - no functional changes to computation logic. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed refactoring of the TinkerEngine to use a backend architecture. The clean separation between the engine for orchestration and the backend for computation greatly improves the code's structure, clarity, and maintainability. The new AbstractBackend interface is well-defined, and the NativeBackend correctly encapsulates the JAX/Flax computation logic. The changes in TinkerEngine make it much cleaner and easier to follow. I've identified one potential high-severity issue regarding a division by zero and a couple of medium-severity suggestions for improving code clarity.

skyrl-tx/tx/tinker/backends/native.py

skyrl-tx/tx/tinker/backends/jax.py

skyrl-tx/tx/tinker/backends/backend.py

- Added `metrics` property in `TinkerEngine` for backward compatibility with backend metrics. - Introduced `configure_adapter` method in `AbstractBackend` to streamline LoRA adapter configuration. - Updated `NativeBackend` to implement the new `configure_adapter` method, replacing the previous `update_adapter_config` call. These changes improve the modularity and maintainability of the codebase while ensuring compatibility with existing metrics functionality.

- Replace create_optimizer + configure_adapter with register_model - Move optimizer storage from engine to backend (self.optimizers dict) - Remove optimizer params from checkpoint methods - Add min_seq_len param to round_up_seq_len calls 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Refactor backend to manage optimizers internally

added use_eager_sharding and nan-prevention

OhadRubin · 2025-12-20T06:26:28Z

@pcmoritz can you take a look? you said you wanted to add pytorch compat at some point, I got another backend in a another PR with maxtext working with a lot of advanced sharding options (got 100k sequence length working on a TPU v5p-8 with 30B-A3B).
(piglatin.py works with this btw, haven't tested other things yet)

# Conflicts: # skyrl-tx/tx/tinker/engine.py

pcmoritz · 2025-12-25T13:22:21Z

@pcmoritz TODO: Port FSDP changes back into the native backend

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ckend_refactor

pcmoritz · 2025-12-26T14:56:51Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring that introduces a backend architecture to TinkerEngine. The separation of concerns between the engine (orchestration) and the backend (computation) is clear and greatly improves the codebase's structure and maintainability. The new AbstractBackend interface is well-defined, and the NativeBackend correctly encapsulates the JAX/Flax computation logic.

I've identified a critical issue where request validation was inadvertently dropped for batch processing methods, which could lead to crashes. I've also found a medium-severity issue regarding a potential division-by-zero in gradient calculation.

Overall, this is an excellent structural improvement. Once the identified issues are addressed, this will be a great addition to the codebase.

skyrl-tx/tx/tinker/engine.py

skyrl-tx/tx/tinker/backends/jax.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ckend_refactor

pcmoritz · 2025-12-26T15:47:28Z

/gemini review

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring of TinkerEngine. It introduces a clean separation between the engine (orchestration) and a new backend layer (computation) by creating an AbstractBackend and a NativeBackend implementation. This change greatly improves the modularity and maintainability of the codebase, making it easier to add new backends in the future as mentioned in the PR description. The logic has been correctly moved from the engine to the native backend, and the engine's role is now clearly focused on orchestration tasks like DB operations and request handling. The introduction of PreparedModelPassBatch and PreparedSampleBatch types for communication between the engine and backend is a good design choice. While described as a purely structural refactor, I noticed it also includes a subtle but important bug fix in the loss calculation to prevent division by zero, which is a great improvement. I've added a couple of suggestions for further minor performance and robustness enhancements. Overall, this is excellent work that significantly improves the architecture.

skyrl-tx/tx/tinker/engine.py

skyrl-tx/tx/tinker/backends/jax.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

pcmoritz · 2025-12-27T09:48:09Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a significant and well-executed structural refactoring by separating the TinkerEngine's orchestration logic from the computation logic, which is now handled by a new backend architecture. The introduction of AbstractBackend and NativeBackend creates a clean separation of concerns, making the system more modular and extensible. The related test and benchmark files have been updated correctly to reflect these architectural changes. My review comments focus on improving the robustness of the new NativeBackend by making a utility function safe against division-by-zero and replacing assert statements with proper ValueError exceptions for input validation. Overall, this is a high-quality refactoring that greatly improves the codebase's structure.

skyrl-tx/tx/tinker/backends/jax.py

pcmoritz · 2025-12-27T10:15:01Z

@OhadRubin Thanks a lot for your PR! I did some refactoring, mostly making the AbstractBackend more canonical by removing the jax dependency, and streamlining the methods to line up very closely with the tinker specification (in particular, I removed the adapter_index handling from the engine, since different backends might handle it differently). This way will give more flexibility to the backend implementation, e.g. we would like to add a backend that is implemented using skyrl-train as part of the ongoing SkyRL tinkerification effort. If backends want to share this code, we can do so via a utilities file?

Do you want to have a look at the changes and let me know if that works for you (e.g. if you can implement #788 with this API) or if you have any suggestions for improvement? Thanks again a lot for the contribution, this will also help making the jax backend multi-node, which I'm planning to work on soon :)

OhadRubin · 2025-12-27T12:22:42Z

@pcmoritz Looks good! I'll align my draft PR to it.

A few questions:

Config standardization - my maxtext draft has maxtext_config_str with some duplicate args (lora rank, sharding). Worth thinking about how to unify?
register_model → create_model, but unregister_model removed without lora zeroing. Intentional? If not - bug! Related: should the backend have a has_capacity interface and manage its own slots, or should the engine handle this? My use case is single machine with max_loras=1 - I had LRU eviction so I don't need to restart when switching models.
For multiple backends on different machines - do you see tx adding a layer between engine and backend to track which node has which model?

pcmoritz · 2025-12-27T12:43:53Z

Thanks for the feedback!

For the configuration, let's introduce a --backend flag (which is a string and selects the backend, maybe we can have "jax" which is the current native backend, "maxtext" which is the backend you implemented and "skyrl(-train)" which will be the SkyRL-train backend) and also a --backend-config flag which is the config for the backend and up to the backend to interpret, but I feel like the best would be to standardize on JSON and then each backend defines a pydantic type to parse and validate the config. I can augment the current PR with that and also rename the "native" backend to "jax".
For register / create model and cleanup: Cleanup is currently not implemented, but can be done since we have the health check pings from the client and when the client is not active any more, we can destroy the model. I think in a follow up PR we can add a delete_model or destroy_model function to the backend that does this. It is better to do this as a follow up PR though since it is better to keep this one mostly a refactoring PR.
For multiple models / multiple nodes: What I had in mind is there will be a 1:1 correspondence between engines and base models (e.g. each engine hosts a single base model) and each engine can be multi-node to shard the model. We will need to implement the possibility to connect the API server with multiple engines (this will require code changes), and there will need to be a way to orchestrate multiple engines e.g. with K8s or Ray (that one will mostly require some helper scripts and documentation).

Let me know if you have thoughts about these things, in the meantime I'll go ahead and rename the native backend to jax and introduce the --backend and --backend-config flag :)

pcmoritz · 2025-12-27T14:05:52Z

I pushed a first version of the refactor, it will need a little more refinement though, I'll get to that in a few hours :)

pcmoritz · 2025-12-29T22:06:41Z

/gemini review

gemini-code-assist

Code Review

This pull request is a well-executed refactoring that introduces a clean separation between the TinkerEngine and a new backend architecture. The introduction of AbstractBackend and the JaxBackend implementation successfully decouples orchestration from computation, which will make the system more modular and extensible. The configuration has also been nicely refactored to move backend-specific settings into a backend_config dictionary.

I have a couple of suggestions for further improvement regarding adapter index allocation and a leaky abstraction in the engine, but overall this is a high-quality structural change.

skyrl-tx/tx/tinker/backends/jax.py

gemini-code-assist · 2025-12-29T22:08:23Z

skyrl-tx/tx/tinker/engine.py

+        # TODO: This leaks the abstraction by accessing backend-specific config.
+        # We should find a better way to handle this going forward.
+        if isinstance(self.backend, JaxBackend) and self.backend.config.sample_max_num_sequences > 0:
+            batchable = batchable[: self.backend.config.sample_max_num_sequences]


As the TODO mentions, this is a leaky abstraction. To resolve this, you could move the batch capping logic into the backend.

Here's a potential approach:

Add a method to AbstractBackend to allow backends to cap the batch size:

# In tx/tinker/backends/backend.py class AbstractBackend(ABC): ... def cap_sample_batch(self, requests: list) -> list: """Cap the number of sample requests in a batch according to backend constraints.""" return requests

Override this in JaxBackend:

# In tx/tinker/backends/jax.py class JaxBackend(AbstractBackend): ... def cap_sample_batch(self, requests: list) -> list: if self.config.sample_max_num_sequences > 0: return requests[:self.config.sample_max_num_sequences] return requests

Update TinkerEngine.find_batchable_sample to use this new backend method, removing the isinstance check:

# In tx/tinker/engine.py ... for op in sample_ops: ... if ...: batchable.append(op) batchable = self.backend.cap_sample_batch(batchable) return {str(f.request_id): ... for f in batchable}

This would remove the isinstance check and knowledge of backend-specific configuration from the engine.

This is not necessarily better since then this leaks into every single backend. I would anticipate that most non JaxBackend backends will use #568 anyways. Maybe the right solution going forward will be to remove sampling from the backend, but we will revisit this once we have at least one more backend.

@pcmoritz
Yeah I think it's fine, in my maxtext backend I'm using vllm anyway by default.
That's why I originally called it the native backend, because it's the only one that gets to leak abstractions haha.
P.S I will have my maxtext ported over to the new interface in the next week or so.
Btw, I did have to add LORA support to the maxtext implementation, so right now it's dependent on my fork, but I will open a PR to the maxtext repo and will see how it goes.

OhadRubin changed the title ~~Refactor TinkerEngine to use backend architecture~~ [tx] Refactor TinkerEngine to use backend architecture Dec 17, 2025

gemini-code-assist bot reviewed Dec 17, 2025

View reviewed changes

skyrl-tx/tx/tinker/backends/native.py Outdated Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/backend.py Outdated Show resolved Hide resolved

OhadRubin mentioned this pull request Dec 17, 2025

[tx] WIP: Add MaxText backend support #788

Draft

3 tasks

OhadRubin and others added 3 commits December 20, 2025 07:04

Merge pull request #1 from OhadRubin/backend_refactor_with_maxtext

fcc3808

Refactor backend to manage optimizers internally

Update native.py

63b26ae

added use_eager_sharding and nan-prevention

pcmoritz added the tx label Dec 22, 2025

pcmoritz added 2 commits December 25, 2025 05:05

fix tests

cc44875

Merge remote-tracking branch 'upstream/main' into backend_refactor

2ec425c

# Conflicts: # skyrl-tx/tx/tinker/engine.py

pcmoritz and others added 14 commits December 25, 2025 05:33

add FSDP back

3eef31b

fixes

a9a9df8

cleanup

4c96e66

update

f697b6e

update

e8eb2e7

Update skyrl-tx/tx/tinker/backends/backend.py

9247f10

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

move adapter_index to backend

770d545

Merge branch 'backend_refactor' of github.com:OhadRubin/SkyRL into ba…

59297cb

…ckend_refactor

update

7ba33be

update naming

df461ce

update

42fd75d

refactor and fix tests

4523d2f

update

3aa67dd

black

e1b22c9

gemini-code-assist bot reviewed Dec 26, 2025

View reviewed changes

skyrl-tx/tx/tinker/engine.py Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

fix bug

5232d46

pcmoritz and others added 3 commits December 26, 2025 07:44

Update skyrl-tx/tx/tinker/engine.py

ad25510

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

c3d27fb

Merge branch 'backend_refactor' of github.com:OhadRubin/SkyRL into ba…

f54aa00

…ckend_refactor

gemini-code-assist bot reviewed Dec 26, 2025

View reviewed changes

skyrl-tx/tx/tinker/engine.py Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

pcmoritz and others added 2 commits December 27, 2025 01:16

Update skyrl-tx/tx/tinker/engine.py

1912b02

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

cb4ea9d

gemini-code-assist bot reviewed Dec 27, 2025

View reviewed changes

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

skyrl-tx/tx/tinker/backends/jax.py Show resolved Hide resolved

pcmoritz added 3 commits December 27, 2025 01:54

fix device / host handling

dabce35

small cleanup

34b66d4

use dataclasses to avoid validation overhead

ed0bd35

pcmoritz approved these changes Dec 27, 2025

View reviewed changes

refactor backend config

3b6e43e

pcmoritz added 7 commits December 27, 2025 09:33

update

126d6c9

update

cf34b38

add back documentation

06572bf

Merge branch 'main' into backend_refactor

4e51af1

update base_model handling

7cbebd7

typo

8c3726a

validation

1b3fbf5

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

pcmoritz merged commit 351524c into NovaSky-AI:main Dec 29, 2025
4 checks passed

This was referenced Dec 31, 2025

[tx] Refactor test_engine.py to be test_backend.py #819

Closed

[tx] Properly clean up adapters that are not used any more #841

Closed

[tx] Refactor TinkerEngine to use backend architecture #787

[tx] Refactor TinkerEngine to use backend architecture #787

Uh oh!

Conversation

OhadRubin commented Dec 17, 2025 • edited by pcmoritz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New files in backends/

New types

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

OhadRubin commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Dec 25, 2025

Uh oh!

pcmoritz commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

pcmoritz commented Dec 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

pcmoritz commented Dec 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcmoritz commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

OhadRubin commented Dec 27, 2025

Uh oh!

pcmoritz commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcmoritz commented Dec 27, 2025

Uh oh!

pcmoritz commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

pcmoritz Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

OhadRubin Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

OhadRubin commented Dec 17, 2025 •

edited by pcmoritz

Loading

New files in `backends/`

OhadRubin commented Dec 20, 2025 •

edited

Loading

pcmoritz commented Dec 27, 2025 •

edited

Loading

pcmoritz commented Dec 27, 2025 •

edited

Loading