Update modernbert-base-pretrain.yaml with a few comments by ahxxm · Pull Request #213 · AnswerDotAI/ModernBERT

ahxxm · 2025-03-19T10:19:30Z

Changes

a few comments and enable a few monitors

Discussions

wondering what's blocking this doc branch being merged..

Tests

Not needed? I tested with my own dataset.

and enable a few monitors

NohTow · 2025-03-19T10:39:06Z

Hello,

I am not sure those kind of comments belong to a config file; any other configs won't have it.
The sequence_packing only working with streaming dataset should rather be an assert in the code.

As to what's blocking the doc branch being merged, mainly us having a lot of stuff to do and trying to focus on blocking stuff and wanting to make sure everything is correct before merging (as well as adding stuff that are missing, notably for the reproduction on all the experiments).
But I agree we could merge the branch as is so people can run the pre-training without looking into issues, cc @warner-benjamin

ahxxm · 2025-03-19T10:57:06Z

At least "legacy reasons" deserves some explaination?

Agree that sequence_packing one should rather be an assert in the code, I only found this after seeing exceptions like ".mds" file not found but files are all there, so I think it might help others.

As for auto_resume, the paper mentioned that training was restarted somewhere, it should be restarted by either load_path(which needs to be a pt file instead of dir, and again, can't be inferred from this option name) or auto_resume, I also tried to restart pretraining, and that option confused me

ahxxm · 2025-03-19T11:05:16Z

Slightly related question to reproduction: did you pre-tokenize all samples?

I found that resume training from a checkpoint takes much longer than expected, and I guess it's because that it has to tokenize text training dataset again to deterministically(hopefully) skip exact tokens already used for training -- which might better be a section in README? Dataset format determines unit choice of tok or sp in training config.

Update modernbert-base-pretrain.yaml

94c82d1

and enable a few monitors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update modernbert-base-pretrain.yaml with a few comments#213

Update modernbert-base-pretrain.yaml with a few comments#213
ahxxm wants to merge 1 commit intoAnswerDotAI:pretraining_documentationfrom
ahxxm:pretraining_documentation

ahxxm commented Mar 19, 2025

Uh oh!

NohTow commented Mar 19, 2025

Uh oh!

ahxxm commented Mar 19, 2025

Uh oh!

ahxxm commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ahxxm commented Mar 19, 2025

Uh oh!

NohTow commented Mar 19, 2025

Uh oh!

ahxxm commented Mar 19, 2025

Uh oh!

ahxxm commented Mar 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants