Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic tracker API and implementation of Aimstack tracker #89

Merged
merged 7 commits into from
May 9, 2024

Conversation

dushyantbehl
Copy link
Contributor

Description of the change

  1. This PR adds a tracker API with Aimstack as the default tracker. This is simple plug and play architecture which can support multiple trackers.
  2. The tracker config is now taken as command line arguments (making it easier for any automation to pass tracker arguments).
  3. With the new API I have added support to track any additional metrics of interest.
  4. As an example I have added one single line to track model_load_time in AIM possibly fixing Add support for collecting metrics programmatically #33

Related issue number

  1. Add support for registering custom AIM metadata #34
  2. Add support for collecting metrics programmatically #33

How to verify the PR

Example of how new api can be invoked

 torchrun --nnodes=1 --nproc_per_node=2 --master_port=1234 tuning/sft_trainer.py --tokenizer_name_or_path ${MODEL_PATH} --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --use_peft --bf16 True --output_dir ${OUTPUT_PATH} --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_config tuning/config/fsdp_config.json --bf16 True --response_template "\n### Response:" --dataset_text_field "output" --tracker aim --aim_repo /data/aim --experiment sft-llama7b-test 

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
@Ssukriti Ssukriti requested a review from kmehant May 2, 2024 22:38
.gitignore Outdated Show resolved Hide resolved
tuning/utils/import_utils.py Outdated Show resolved Hide resolved
tuning/utils/import_utils.py Outdated Show resolved Hide resolved
tuning/config/tracker_configs.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/trackers/tracker_factory.py Outdated Show resolved Hide resolved
tuning/trackers/tracker_factory.py Show resolved Hide resolved
tuning/utils/import_utils.py Outdated Show resolved Hide resolved
@dushyantbehl
Copy link
Contributor Author

@Ssukriti @alex-jw-brooks some more changes and fixes as requested by @kmehant

PR is ready to be reviewed. @tharapalanivel I will run some more e2e tests and bump you, if you have any comments in the meantime on the design, happy to take them.

@tharapalanivel
Copy link
Collaborator

@dushyantbehl this is great, thank you! One thing we should be mindful of is that the image build process still works and picks up the new changes. For this we should either change build/launch_training.py to use sft_trainer.main() or move the tracker/callback logic to build/utils.process_launch_training_args() so launch_training.py will use it automatically. Images can be built off of the branch and tested locally to check that everything works as expected, you can find docs for this here. I know this might involve mocking/patching a few things but can we look into whether we can add any meaningful tests for this please? Let me know if you have any questions, thanks!

@dushyantbehl
Copy link
Contributor Author

Thanks @tharapalanivel should be easy to change it like that.

tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
@dushyantbehl dushyantbehl force-pushed the generic_tracker branch 3 times, most recently from 06ccfc8 to dbd175b Compare May 8, 2024 13:12
@dushyantbehl dushyantbehl mentioned this pull request May 8, 2024
2 tasks
@dushyantbehl dushyantbehl requested a review from kmehant May 8, 2024 14:41
tuning/sft_trainer.py Outdated Show resolved Hide resolved
Co-authored-by: Sukriti Sharma <[email protected]>
Signed-off-by: Dushyant Behl <[email protected]>
tuning/sft_trainer.py Outdated Show resolved Hide resolved
tuning/config/configs.py Outdated Show resolved Hide resolved
tuning/sft_trainer.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@Ssukriti Ssukriti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comments only on default value of arguments.

By default I want to ensure existing behavior is retained - any users that call .train() or main() function using command line, should see log files like before. So filelogger has to be default for train() and through command line.
Only if aim is explictly passed, it should be used.

@kmehant kmehant merged commit 27289f3 into foundation-model-stack:main May 9, 2024
6 checks passed
@dushyantbehl dushyantbehl deleted the generic_tracker branch May 9, 2024 12:41
@dushyantbehl dushyantbehl restored the generic_tracker branch May 9, 2024 13:11
@dushyantbehl dushyantbehl deleted the generic_tracker branch December 20, 2024 05:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants