Generic tracker API and implementation of Aimstack tracker #89

dushyantbehl · 2024-03-13T06:01:10Z

Description of the change

This PR adds a tracker API with Aimstack as the default tracker. This is simple plug and play architecture which can support multiple trackers.
The tracker config is now taken as command line arguments (making it easier for any automation to pass tracker arguments).
With the new API I have added support to track any additional metrics of interest.
As an example I have added one single line to track model_load_time in AIM possibly fixing Add support for collecting metrics programmatically #33

Related issue number

How to verify the PR

Example of how new api can be invoked

 torchrun --nnodes=1 --nproc_per_node=2 --master_port=1234 tuning/sft_trainer.py --tokenizer_name_or_path ${MODEL_PATH} --model_name_or_path ${MODEL_PATH} --data_path ${DATA_PATH} --use_peft --bf16 True --output_dir ${OUTPUT_PATH} --num_train_epochs 1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 1 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --fsdp "full_shard auto_wrap" --fsdp_config tuning/config/fsdp_config.json --bf16 True --response_template "\n### Response:" --dataset_text_field "output" --tracker aim --aim_repo /data/aim --experiment sft-llama7b-test

Was the PR tested

I have added >=1 unit test(s) for every new method I have added.
I have ensured all unit tests pass

tuning/sft_trainer.py

.gitignore

tuning/utils/import_utils.py

tuning/config/tracker_configs.py

tuning/sft_trainer.py

tuning/trackers/tracker_factory.py

tuning/utils/import_utils.py

dushyantbehl · 2024-05-03T11:29:54Z

@Ssukriti @alex-jw-brooks some more changes and fixes as requested by @kmehant

PR is ready to be reviewed. @tharapalanivel I will run some more e2e tests and bump you, if you have any comments in the meantime on the design, happy to take them.

tharapalanivel · 2024-05-03T16:54:19Z

@dushyantbehl this is great, thank you! One thing we should be mindful of is that the image build process still works and picks up the new changes. For this we should either change build/launch_training.py to use sft_trainer.main() or move the tracker/callback logic to build/utils.process_launch_training_args() so launch_training.py will use it automatically. Images can be built off of the branch and tested locally to check that everything works as expected, you can find docs for this here. I know this might involve mocking/patching a few things but can we look into whether we can add any meaningful tests for this please? Let me know if you have any questions, thanks!

dushyantbehl · 2024-05-04T04:54:40Z

Thanks @tharapalanivel should be easy to change it like that.

tuning/sft_trainer.py

Signed-off-by: Dushyant Behl <[email protected]>

tuning/sft_trainer.py

Co-authored-by: Sukriti Sharma <[email protected]> Signed-off-by: Dushyant Behl <[email protected]>

tuning/sft_trainer.py

tuning/config/configs.py

tuning/sft_trainer.py

tuning/trackers/__init__.py

tuning/sft_trainer.py

Ssukriti

comments only on default value of arguments.

By default I want to ensure existing behavior is retained - any users that call .train() or main() function using command line, should see log files like before. So filelogger has to be default for train() and through command line.
Only if aim is explictly passed, it should be used.

Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl requested review from anhuong, Ssukriti and alex-jw-brooks as code owners March 13, 2024 06:01

dushyantbehl mentioned this pull request Mar 13, 2024

feat: change tracker API to initialize tracker early and track additional metrics. #50

Closed

dushyantbehl force-pushed the generic_tracker branch 4 times, most recently from 41e01cb to bd7cf2f Compare May 2, 2024 07:20

Ssukriti reviewed May 2, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 2, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 2, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti requested a review from kmehant May 2, 2024 22:38

kmehant requested changes May 3, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

tuning/utils/import_utils.py Outdated Show resolved Hide resolved

dushyantbehl force-pushed the generic_tracker branch from f810f50 to b27ba7e Compare May 3, 2024 07:06

dushyantbehl mentioned this pull request May 3, 2024

fix: bug with AIM server availability check #139

Closed

2 tasks

kmehant requested changes May 3, 2024

View reviewed changes

tuning/utils/import_utils.py Outdated Show resolved Hide resolved

kmehant reviewed May 3, 2024

View reviewed changes

dushyantbehl force-pushed the generic_tracker branch from b27ba7e to 9bc30d1 Compare May 3, 2024 08:54

dushyantbehl mentioned this pull request May 3, 2024

feat: support for robust benchmarking of fms-hf-tuning #142

Open

dushyantbehl force-pushed the generic_tracker branch 2 times, most recently from 614a8f7 to 59cd647 Compare May 3, 2024 11:27

dushyantbehl force-pushed the generic_tracker branch from 59cd647 to 9e749ba Compare May 4, 2024 05:01

Ssukriti reviewed May 6, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 6, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

dushyantbehl added 3 commits May 8, 2024 17:03

Generic tracker API and implementation of Aimstack tracker

674bb34

Signed-off-by: Dushyant Behl <[email protected]>

Takes in a list of trackers.

837ec47

Signed-off-by: Dushyant Behl <[email protected]>

fix fmt and lint

9e9915c

Signed-off-by: Dushyant Behl <[email protected]>

fix some bugs and review comments

dae30eb

Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl force-pushed the generic_tracker branch 3 times, most recently from 06ccfc8 to dbd175b Compare May 8, 2024 13:12

Change file logging callback to a tracker and small bug fixes.

0388b89

Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl force-pushed the generic_tracker branch from dbd175b to 0388b89 Compare May 8, 2024 14:12

dushyantbehl mentioned this pull request May 8, 2024

feat: Aim runid export #90

Merged

2 tasks

dushyantbehl requested a review from kmehant May 8, 2024 14:41

Ssukriti reviewed May 8, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Update tuning/sft_trainer.py

147b702

Co-authored-by: Sukriti Sharma <[email protected]> Signed-off-by: Dushyant Behl <[email protected]>

Ssukriti reviewed May 8, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 8, 2024

View reviewed changes

tuning/config/configs.py Outdated Show resolved Hide resolved

tharapalanivel reviewed May 8, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 8, 2024

View reviewed changes

tuning/sft_trainer.py Show resolved Hide resolved

tharapalanivel reviewed May 8, 2024

View reviewed changes

tuning/trackers/__init__.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 8, 2024

View reviewed changes

tuning/sft_trainer.py Outdated Show resolved Hide resolved

Ssukriti reviewed May 8, 2024

View reviewed changes

dushyantbehl force-pushed the generic_tracker branch from ad34c62 to 7eac51f Compare May 9, 2024 09:25

Added ADR and minor design changes.

8507fbf

Signed-off-by: Dushyant Behl <[email protected]>

dushyantbehl force-pushed the generic_tracker branch from 7eac51f to 8507fbf Compare May 9, 2024 10:48

kmehant approved these changes May 9, 2024

View reviewed changes

kmehant merged commit 27289f3 into foundation-model-stack:main May 9, 2024
6 checks passed

dushyantbehl deleted the generic_tracker branch May 9, 2024 12:41

dushyantbehl restored the generic_tracker branch May 9, 2024 13:11

anhuong mentioned this pull request Jun 3, 2024

refactor: remove launch_training and call sft_trainer directly #164

Merged

2 tasks

dushyantbehl deleted the generic_tracker branch December 20, 2024 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic tracker API and implementation of Aimstack tracker #89

Generic tracker API and implementation of Aimstack tracker #89

dushyantbehl commented Mar 13, 2024

dushyantbehl commented May 3, 2024

tharapalanivel commented May 3, 2024

dushyantbehl commented May 4, 2024

Ssukriti left a comment

Generic tracker API and implementation of Aimstack tracker #89

Generic tracker API and implementation of Aimstack tracker #89

Conversation

dushyantbehl commented Mar 13, 2024

Description of the change

Related issue number

How to verify the PR

Was the PR tested

dushyantbehl commented May 3, 2024

tharapalanivel commented May 3, 2024

dushyantbehl commented May 4, 2024

Ssukriti left a comment

Choose a reason for hiding this comment