Skip to content

MFU tracking: pad_to_multiple_of path inflates useful-work Σ(Lᵢ²) by the mock-sequence contribution #1561

@gagank1

Description

@gagank1

Summary

When the THD collator's pad_to_multiple_of option is active (used primarily for FP8/FP4 shape alignment), the collator appends a mock pad sequence at the end of the batch and mutates cu_seq_lens_q in place to describe the appended layout. Because no cu_seq_lens_q_padded key is written in this path — and intentionally cannot be, since cu_seq_lens_q_padded is reserved for TE's per-sequence CP zigzag-divisibility padding semantic — the perf_logger's _attn_work_from_batch helper has no way to distinguish useful work (real docs) from hardware work (real docs + mock pad) when pad_to_multiple_of is used.

In that path, train/mfu_pct (useful) and train/mfu_padded_pct (hardware) collapse to the same value, and both include the mock sequence's remainder² as if it were real work.

Quantitative impact

remainder < pad_to_multiple_of by construction. Typical alignment values for FP8/MXFP8/NVFP4 are {8, 16, 32}, so the extra Σ(Lᵢ²) contribution is bounded by pad_to_multiple_of²{64, 256, 1024}. Real batch totals are on the order of 10⁷–10⁹ (e.g. ESM-2 at mbs=26 × S=1022: 26 × 1022² ≈ 2.7e7), so the inflation is ≤10⁻⁵ of the real total — below any measurement noise we resolve in practice. The mock sequence's labels are -100 (excluded from loss), so no gradient contribution; the tiny amount of actual attention compute FA executes on the mock block is real hardware work, just not useful work.

Affected code

All four MFU-tracking recipes share this behavior:

Recipe Collator site
bionemo-recipes/recipes/esm2_native_te/ via enforced copy of bionemo-recipes/models/esm2/collator.py (_pt_pad_to_multiple_of at line 871; _pad_batch_to_multiple_of at line 187)
bionemo-recipes/recipes/llama3_native_te/ byte-identical copy of the ESM-2 collator (collator.py)
bionemo-recipes/recipes/opengenome2_llama_native_te/ byte-identical copy of the ESM-2 collator (collator.py)
bionemo-recipes/recipes/codonfm_native_te/ inline CodonTHDCollator in dataset.py:243-344; same semantics plus a defensive chunking path that splits a single over-max_seq_length pad into multiple mock sequences (currently unreachable since remainder < pad_to_multiple_of ≤ 32 ≪ max_seq_length)

The logger site that reads cu_seq_lens_q (and is therefore affected) is _attn_work_from_batch in each recipe's perf_logger.py.

Status

Known limitation, not fixing for now per discussion on #1548. Impact is <10⁻⁵, well below measurement resolution, and this mode is only exercised when pad_to_multiple_of is explicitly set (FP8/FP4 workflows). Inline NOTE comments have been added at the cu_seq_lens_q read sites in all four perf_logger.py files pointing at this issue.

If we revisit

The constraint is that cu_seq_lens_q_padded must not be populated from _pt_pad_to_multiple_of — that key is reserved for pad_sequences_to_be_divisible_by (TE's in-sequence CP padding, consumed by pad_thd_sequences_for_cp). Confirmed by @pstjohn and @jomitchellnv on #1548:

For "in sequence padding" we would use cu_seqlens_padded but I don't think we want that just for padding the remainder of a sequence vector (T in the THD).

Two candidate fixes, both avoiding cu_seq_lens_q_padded:

Option A (preferred): Collator records batch["num_real_seqs"] = number of real sequences before any mock-pad append. The perf_logger slices cu_seq_lens_q to the first num_real_seqs deltas for useful-work Σ(Lᵢ²); keeps the full cu_seq_lens_q for hardware-work Σ(Lᵢ²). Handles codonfm's multi-mock chunking for free.

Option B: Collator stashes the pre-mutation cu_seq_lens_q under a new key (e.g. cu_seq_lens_q_real). Logger prefers that key when present. Functionally equivalent but duplicates a small tensor per batch.

Option A is marginally simpler (one int scalar) and generalizes cleanly to "N real + M mock" cases.

References

  • PR Generalize MFU/FLOPs module across recipes with log_mfu training hook #1548 — review thread on bionemo-recipes/recipes/esm2_native_te/README.md:387
  • _pt_pad_to_multiple_of: bionemo-recipes/models/esm2/collator.py:871
  • _pad_batch_to_multiple_of: bionemo-recipes/models/esm2/collator.py:187
  • CodonTHDCollator.__call__: bionemo-recipes/recipes/codonfm_native_te/dataset.py:270-344
  • _attn_work_from_batch: each recipe's perf_logger.py (ESM-2 at :114, llama3 at :112, og2 at :120, codonfm at :115)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions