[tx] DeepseekV3 implementation #889

tanmaysachan · 2026-01-16T08:44:08Z

[WIP]
Addresses #865

Model outline from pytorch -> jax
parity checks
Infer/benchmark

gemini-code-assist

Code Review

This pull request introduces the JAX implementation for the DeepseekV3 model. The implementation is comprehensive and covers the model's unique features like Multi-Head Latent Attention and Mixture of Experts with shared experts. The code is well-structured.

My review focuses on a critical bug that will prevent the model from running, along with some suggestions to improve maintainability by reducing code duplication and avoiding magic numbers. Addressing these points will make the implementation more robust and easier to maintain.

gemini-code-assist · 2026-01-16T08:47:39Z

skyrl-tx/tx/models/deepseekv3.py

+        # Precompute RoPE frequencies
+        # qk_rope_head_dim = config.qk_rope_head_dim
+        # original_seq_len = getattr(config, "original_seq_len", config.max_position_embeddings)
+        # rope_factor = getattr(config, "rope_factor", 1.0)
+        # beta_fast = getattr(config, "beta_fast", 32)
+        # beta_slow = getattr(config, "beta_slow", 1)
+
+        # TODO: Swap out like llama's rope?
+        # self.freqs_cis = precompute_freqs_cis(
+        #     dim=qk_rope_head_dim,
+        #     max_seq_len=config.max_position_embeddings,
+        #     rope_theta=config.rope_theta,
+        #     original_seq_len=original_seq_len,
+        #     rope_factor=rope_factor,
+        #     beta_fast=beta_fast,
+        #     beta_slow=beta_slow,
+        # )


This block for precomputing RoPE frequencies is commented out, but self.freqs_cis is used in DeepseekV3Model.__call__ at line 571. This will raise an AttributeError at runtime.

Looking at the DeepseekV3MLA implementation, the freqs_cis parameter is not used. Instead, apply_rope is called, which computes the frequencies on the fly.

To fix this, you should remove the freqs_cis parameter from the entire call chain, as it appears to be unused. This involves:

Removing freqs_cis: jax.Array from the signature of DeepseekV3MLA.__call__.

Removing freqs_cis: jax.Array from the signature of DeepseekV3DecoderLayer.__call__.

Removing the freqs_cis=self.freqs_cis argument from the layer() call within DeepseekV3Model.__call__.

This will resolve the crash and align the code with the current apply_rope implementation. You can then address the TODO about swapping the RoPE implementation in a separate change.

gemini-code-assist · 2026-01-16T08:47:39Z

skyrl-tx/tx/models/deepseekv3.py

+        )
+
+        # Bias only for specific model sizes (7168 hidden_size in original)
+        self.use_bias = config.hidden_size == 7168


Hardcoding the magic number 7168 to determine self.use_bias makes the code brittle and less maintainable. If a new model variant is introduced that also requires this bias, this line would need to be updated. A better approach would be to introduce a dedicated boolean flag in the DeepseekV3Config, such as use_router_bias, to control this behavior explicitly.

gemini-code-assist · 2026-01-16T08:47:39Z

skyrl-tx/tx/models/deepseekv3.py

+        # Bias only for specific model sizes (7168 hidden_size in original)
+        self.use_bias = config.hidden_size == 7168
+        if self.use_bias:
+            from tx.layers.util import Param


This import statement is located inside a conditional block within __init__. According to PEP 8, all imports should be at the top of the file. This improves code readability and avoids potential circular import issues or unexpected behavior. Please move from tx.layers.util import Param to the top of the file with the other imports.

gemini-code-assist · 2026-01-16T08:47:39Z

skyrl-tx/tx/models/deepseekv3.py

+class DeepseekV3SharedMLP(nnx.Module):
+    """Always active shared experts."""
+


The DeepseekV3SharedMLP class is nearly identical to DeepseekV3MLP, with the only significant difference being the intermediate_size. This creates code duplication, which can make maintenance harder.

To improve this, consider refactoring them into a single, more generic MLP class (e.g., SwiGLU) that accepts intermediate_size as a parameter in its __init__ method. You could then instantiate this class with config.intermediate_size for the standard MLP and with the calculated shared_inter_dim for the shared MLP part.

Initialize the structure

4d31abf

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

tanmaysachan added 2 commits January 17, 2026 01:19

simplify MLP

6fbf660

adjust for huggingface naming conventions

8a6feac

pcmoritz added the tx label Jan 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] DeepseekV3 implementation #889

[tx] DeepseekV3 implementation #889

Uh oh!

tanmaysachan commented Jan 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

gemini-code-assist bot Jan 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		class DeepseekV3SharedMLP(nnx.Module):
		"""Always active shared experts."""

[tx] DeepseekV3 implementation #889

Are you sure you want to change the base?

[tx] DeepseekV3 implementation #889

Uh oh!

Conversation

tanmaysachan commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tanmaysachan commented Jan 16, 2026 •

edited

Loading