Add RoPE to ConformerEncoder#15714
Conversation
|
/ok to test 44348a9 |
pzelasko
left a comment
There was a problem hiding this comment.
Thanks for the PR. The overall direction looks reasonable to me, and I'm leaving this as a non-blocking review comment. I don't see the global defaults changing for existing configs (self_attention_model remains rel_pos, use_pytorch_sdpa remains False), but I found a few compatibility issues that should be addressed before merge.
-
RoPE skips the existing pre-encoder scale/dropout contract
nemo/collections/asr/modules/conformer_encoder.py, around the new RoPE branch inforward_internal.Existing
rel_pos/abs_pospaths callself.pos_enc(x=audio_signal, ...), which appliesxscaleanddropout_pre_encoder. The new RoPE path setspos_emb = Noneand leavesaudio_signaluntouched, whileRotaryPositionalEncodingonly rotates Q/K later in attention. As a result,xscaling=Trueanddropout_pre_encodersilently become no-ops forself_attention_model='rope'.I would keep the Q/K rotation inside
RoPEMultiHeadAttention, but preserve the existing encoder input transform semantics by applying the same input scale/dropout once before the layer stack. -
The public
change_attention_model()wrapper can corrupt the saved confignemo/collections/asr/parts/mixins/mixins.py, in theupdate_configblock.The encoder resolves
Nonevalues to the current attention model/context, but the mixin writes the raw arguments back intoself.cfg.encoder. For example, callingchange_attention_model(rotary_fraction=0.5)on an existing RoPE model can leavecfg.encoder.self_attention_model = Noneandcfg.encoder.att_context_size = None, and it skips copying the updated RoPE params becauseself_attention_model != 'rope'.The top-level config should mirror the resolved values from
self.encoderafter the call, e.g.self.encoder.self_attention_model,self.encoder.att_context_size, and the resolved encoder RoPE fields. -
change_attention_model('rope')does not preserveuse_bias=Falsenemo/collections/asr/modules/conformer_encoder.py, in the RoPEnew_attn = RoPEMultiHeadAttention(...)conversion path.The constructor path forwards
use_biasintoConformerLayer/RoPEMultiHeadAttention, but the conversion helper creates RoPE attention withoutuse_bias, so it defaults toTrue. A bias-free model converted to RoPE silently gains randomly initialized bias parameters and no longer matches its config.Please pass through
getattr(self._cfg, 'use_bias', True)or an equivalent stored value, and add a regression test for converting ause_bias=Falseencoder to RoPE.
|
/claude review |
There was a problem hiding this comment.
Good addition — the implementation is clean, the RoPE math is correct (split-half rotation with duplicated cos/sin), and the test suite is thorough (translation invariance, norm preservation, reference equivalence, streaming parity).
One issue flagged inline:
- dropout_pre_encoder silently ignored for RoPE: For rel_pos and abs_pos, self.pos_enc(x=audio_signal, ...) applies dropout_pre_encoder to the audio signal. For the rope path, this call is skipped entirely, so any configured dropout_pre_encoder > 0 has no effect. This could surprise users switching between attention models or using change_attention_model.
Co-authored-by: Copilot <copilot@github.com> Signed-off-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
Co-authored-by: Copilot <copilot@github.com> Signed-off-by: MahmoudAshraf97 <hassouna97.ma@gmail.com>
f8edc20 to
ba98801
Compare
|
@pzelasko I implemented fixes for all three issues mentioned
as for changing the defaults, I can do that, do I have to also change note, the |
Important
The
Update branchbutton must only be pressed in very rare occassions.An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.
What does this PR do ?
Adds RoPE as a positional encoding method to ConformerEncoder, it enables the use of torch SDPA while maintaining accuracy
This is a least-friction implementation, there is still path for more improvements (such as flex attention and fused qkv proj) but I preferred to not diverge from other conformer submodules
Collection: asr
Motivation
Since #9590, SDPA caused loss explosion in training when Relative Positional Embeddings were used, investigation showed that SDPA had issues with gradients when attention additive biases were used
as per this paper, RoPe has the same baseline or better performance than RelPos and trains faster because of its ability to use SDPA
Results
I tested two Fastconformer medium models, everything was constant except the positional encoding method (RoPE vs RelPos) and attention implementation (SDPA vs NeMo), both models were randomly initialized, although I forgot to fix the seed.
One takeaway is that when the two models were initialized from a pretrained checkpoint except for the positional encoding parameters, RelPos converged faster, implying that RoPE attention geometry is different than RelPos and not necessarily transferable
Improvements
The Rope was around 6% faster E2E, and used less memory which allowed for higher fused batch size for the RNNT loss and better compute utilization. The training time was dominated by RNNT loss calculation, so it diminishes any benefits from SDPA or
torch.compileRNNT loss compute time is data-dependent, that's why the step time is much lower at the beginning of the run when the model hasn't learned anything yet, once it starts to generate tokens, the time goes up
Changelog
Usage
self_attention_model: ropeto encoder configGitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
cc @pzelasko
Additional Information