MFU tracking: `pad_to_multiple_of` path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution

## Summary

When the THD collator's `pad_to_multiple_of` option is active (used primarily for FP8/FP4 shape alignment), the collator appends a mock pad sequence at the end of the batch and **mutates `cu_seq_lens_q` in place** to describe the appended layout. Because no `cu_seq_lens_q_padded` key is written in this path — and intentionally cannot be, since `cu_seq_lens_q_padded` is reserved for TE's per-sequence CP zigzag-divisibility padding semantic — the `perf_logger`'s `_attn_work_from_batch` helper has no way to distinguish useful work (real docs) from hardware work (real docs + mock pad) when `pad_to_multiple_of` is used.

In that path, `train/mfu_pct` (useful) and `train/mfu_padded_pct` (hardware) collapse to the same value, and both include the mock sequence's `remainder²` as if it were real work.

## Quantitative impact

`remainder < pad_to_multiple_of` by construction. Typical alignment values for FP8/MXFP8/NVFP4 are `{8, 16, 32}`, so the extra Σ(Lᵢ²) contribution is bounded by `pad_to_multiple_of²` ∈ `{64, 256, 1024}`. Real batch totals are on the order of 10⁷–10⁹ (e.g. ESM-2 at `mbs=26 × S=1022`: `26 × 1022² ≈ 2.7e7`), so the inflation is ≤10⁻⁵ of the real total — **below any measurement noise we resolve in practice**. The mock sequence's labels are `-100` (excluded from loss), so no gradient contribution; the tiny amount of actual attention compute FA executes on the mock block *is* real hardware work, just not useful work.

## Affected code

All four MFU-tracking recipes share this behavior:

| Recipe | Collator site |
|---|---|
| `bionemo-recipes/recipes/esm2_native_te/` | via enforced copy of `bionemo-recipes/models/esm2/collator.py` (`_pt_pad_to_multiple_of` at line 871; `_pad_batch_to_multiple_of` at line 187) |
| `bionemo-recipes/recipes/llama3_native_te/` | byte-identical copy of the ESM-2 collator (`collator.py`) |
| `bionemo-recipes/recipes/opengenome2_llama_native_te/` | byte-identical copy of the ESM-2 collator (`collator.py`) |
| `bionemo-recipes/recipes/codonfm_native_te/` | inline `CodonTHDCollator` in `dataset.py:243-344`; same semantics plus a defensive chunking path that splits a single over-`max_seq_length` pad into multiple mock sequences (currently unreachable since `remainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length`) |

The logger site that reads `cu_seq_lens_q` (and is therefore affected) is `_attn_work_from_batch` in each recipe's `perf_logger.py`.

## Status

**Known limitation, not fixing for now** per discussion on #1548. Impact is <10⁻⁵, well below measurement resolution, and this mode is only exercised when `pad_to_multiple_of` is explicitly set (FP8/FP4 workflows). Inline `NOTE` comments have been added at the `cu_seq_lens_q` read sites in all four `perf_logger.py` files pointing at this issue.

## If we revisit

The constraint is that `cu_seq_lens_q_padded` **must not** be populated from `_pt_pad_to_multiple_of` — that key is reserved for `pad_sequences_to_be_divisible_by` (TE's in-sequence CP padding, consumed by `pad_thd_sequences_for_cp`). Confirmed by @pstjohn and @jomitchellnv on #1548:

> For "in sequence padding" we would use `cu_seqlens_padded` but I don't think we want that just for padding the remainder of a sequence vector (T in the THD).

Two candidate fixes, both avoiding `cu_seq_lens_q_padded`:

**Option A (preferred):** Collator records `batch["num_real_seqs"]` = number of real sequences before any mock-pad append. The perf_logger slices `cu_seq_lens_q` to the first `num_real_seqs` deltas for useful-work Σ(Lᵢ²); keeps the full `cu_seq_lens_q` for hardware-work Σ(Lᵢ²). Handles codonfm's multi-mock chunking for free.

**Option B:** Collator stashes the pre-mutation `cu_seq_lens_q` under a new key (e.g. `cu_seq_lens_q_real`). Logger prefers that key when present. Functionally equivalent but duplicates a small tensor per batch.

Option A is marginally simpler (one int scalar) and generalizes cleanly to "N real + M mock" cases.

## References

- PR #1548 — review thread on `bionemo-recipes/recipes/esm2_native_te/README.md:387`
- `_pt_pad_to_multiple_of`: `bionemo-recipes/models/esm2/collator.py:871`
- `_pad_batch_to_multiple_of`: `bionemo-recipes/models/esm2/collator.py:187`
- `CodonTHDCollator.__call__`: `bionemo-recipes/recipes/codonfm_native_te/dataset.py:270-344`
- `_attn_work_from_batch`: each recipe's `perf_logger.py` (ESM-2 at `:114`, llama3 at `:112`, og2 at `:120`, codonfm at `:115`)

Recipe	Collator site
`bionemo-recipes/recipes/esm2_native_te/`	via enforced copy of `bionemo-recipes/models/esm2/collator.py` (`_pt_pad_to_multiple_of` at line 871; `_pad_batch_to_multiple_of` at line 187)
`bionemo-recipes/recipes/llama3_native_te/`	byte-identical copy of the ESM-2 collator (`collator.py`)
`bionemo-recipes/recipes/opengenome2_llama_native_te/`	byte-identical copy of the ESM-2 collator (`collator.py`)
`bionemo-recipes/recipes/codonfm_native_te/`	inline `CodonTHDCollator` in `dataset.py:243-344`; same semantics plus a defensive chunking path that splits a single over-`max_seq_length` pad into multiple mock sequences (currently unreachable since `remainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MFU tracking: `pad_to_multiple_of` path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution #1561

Summary

Quantitative impact

Affected code

Status

If we revisit

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MFU tracking: pad_to_multiple_of path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution #1561

Description

Summary

Quantitative impact

Affected code

Status

If we revisit

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

MFU tracking: `pad_to_multiple_of` path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution #1561