Context Parallelism #67

oliverkinch · 2025-11-13T09:04:19Z

Implements CP for non MoE models. Implementing CP for MoEs will be in a separate PR.

Fix #31.
#38 will be redundant given this PR.

oliverkinch · 2025-11-13T09:06:47Z

From train.py we have

if parallel_dims.cp_enabled: # the following is necessary for CP w/ flex attention
    from torch.distributed.tensor.experimental._attention import _set_cp_global_var, _DispatchMode, _cp_options

    # set_rotate_method("alltoall")  # alltoall or allgather (only allgather for flex)
    _set_cp_global_var("cp_shard_dim", 2)
    # _cp_options.enable_load_balance = True  # no load balancing for flex
    torch.distributed.tensor.experimental._attention._dispatch_mode = (
        _DispatchMode.TORCH_FUNCTION
    )

_set_cp_global_var is only available in torch 2.9.0, but if I force this version the code crashes as .backward() is called. Is _set_cp_global_var necessary?

oliverkinch · 2025-11-13T09:47:48Z

Problems with FLASH_ATTENTION? It works with MATH

with sdpa_kernel(SDPBackend.FLASH_ATTENTION):
    return F.scaled_dot_product_attention(q, k, v, is_causal=True, scale=scale)

torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors: call_function <built-in function scaled_dot_product_attention>(*(FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 192), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>), FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 192), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>), FakeTensor(..., device='cuda:0', size=(1, 16, 4096, 128), dtype=torch.bfloat16,
           grad_fn=<TransposeBackward0>)), **{'is_causal': True, 'scale': 0.07216878364870322}): got RuntimeError('No available kernel. Aborting execution.')

oliverkinch · 2025-11-13T09:59:27Z

@rlrs Context parallelism now runs for gemma and llama

New dcp script related to model where yarn has been used to extend the context length

oliverkinch · 2025-12-10T13:12:43Z

I have now also included the related to YaRN in this PR, see d36078d

Blue: The base model.
Orange: The base model with its context window extended from 4k to 32k using YaRN, without any additional training.
Green: The same YaRN-extended model, further trained for 1,000 steps on long-context data (wiki_expanded)

oliverkinch · 2026-01-28T07:39:30Z

All the CMDs below run as expected. Anything else we should test before merging this? @rlrs

Training
python -m torch.distributed.run --standalone --nproc-per-node 2 train.py jobs/gemma-cp2
python -m torch.distributed.run --standalone --nproc-per-node 2 train.py jobs/llama-cp2

Comparing a Maester DCP checkpoint against a Hugging Face model

python compare_models.py \
  --job-config jobs/munin-32k/config.json \
  --checkpoint-dir jobs/munin-32k/checkpoints/step-1000 \
  --hf-model oliverkinch/munin-32k-step-1000 \
  --num-prompts 0 \
  --dataset data/wiki-expanded-hf \
  --dataset-samples 4 \
  --dataset-max-length 512

Giving output as:

PROMPT 0: '# Frankrig\n\nFrankrig (fransk: "France"), officielt Den Franske Republik (fransk:'
Tokenized length: 512
Logit max abs diff:  7.428315e+00
Logit mean abs diff: 1.034052e-01
HF loss:       0.775544 (ppl=2.172)
Maester loss:  0.771280 (ppl=2.163)

Are these differences acceptable?

YaRN convert

python -u scripts/convert/llama/from_dcp_yarn.py \
  jobs/munin-32k/checkpoints \
  /tmp/munin-open-7b-pt-export \
  --name step-1000 \
  --base danish-foundation-models/munin-open-7b-pt

CP

9a57ae2

oliverkinch added 2 commits November 13, 2025 10:51

Unique name

9d50ff9

Compatibility with train current train code

6485e46

oliverkinch marked this pull request as ready for review November 13, 2025 09:59

oliverkinch requested a review from rlrs November 13, 2025 09:59

YaRN implementation

d36078d

New dcp script related to model where yarn has been used to extend the context length

Fix docstring

0794d7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Context Parallelism #67

Context Parallelism #67

Uh oh!

oliverkinch commented Nov 13, 2025 •

edited

Loading

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Dec 10, 2025 •

edited

Loading

Uh oh!

oliverkinch commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Context Parallelism #67

Are you sure you want to change the base?

Context Parallelism #67

Uh oh!

Conversation

oliverkinch commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Nov 13, 2025

Uh oh!

oliverkinch commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oliverkinch commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

oliverkinch commented Nov 13, 2025 •

edited

Loading

oliverkinch commented Dec 10, 2025 •

edited

Loading