Conversation
added 11 commits
May 6, 2025 16:00
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR creates a child class of
CLTTrainercalledCLTTrainerRaythat enables training the model using Ray. It primarily modifies the train() func to report metrics to Ray and uses Ray's approach to checkpointing. Also, it adds a new script in the root dir calledtune_clt_local_ray.pythat mirrorstrain_clt_local.pybut runsCLTTrainerRayin a function that can be used by a Ray Tuner to run training runs in parallel for different hyperparameter combinations.When
tune_clt_local_ray.pyis run, it will create a working dir in the specifiedargs.output_dir, within which, for each hyperparameter combination (a "trial" in Ray terminology), Ray will create a working dir for that trial, marked by the hyperparameters for that trial. In the trial working dir, checkpoints will be saved according toargs.checkpoint_intervalfor the model and the activation store at that training step. Reported metrics for each trial can be viewed atprogress.csvandresult.jsonin the trial's working dir. See photo below:Some notes:
hpsdict intune_clt_local_ray.pymust be manually editedCLTTrainerRayThis code was tested by first generating activations for 1000 tokens from the
monology/pile-uncopyrighteddataset for theEleutherAI/pythia-70mmodel. Then, the script was run as follows:python tune_clt_local_ray.py --activation-path ./tutorial_activations_local_1k_pythia/EleutherAI/pythia-70m/pile-uncopyrighted_train --model-name EleutherAI/pythia-70m --num-features 2048 --training-steps 5 --n-workers 2To visualize results, you can use tensorboard like so:
tensorboard --logdir ./clt_train_local_1746647195