-
Notifications
You must be signed in to change notification settings - Fork 352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PyTorch] Adding TP overlap support for te.Linear
with parallel_mode="column"
#1343
[PyTorch] Adding TP overlap support for te.Linear
with parallel_mode="column"
#1343
Conversation
90458d4
to
4e3e61a
Compare
/te-ci pytorch L1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, pending CI.
ub_overlap_ag: bool = False, | ||
ub_overlap_rs: bool = False, | ||
ub_bulk_dgrad: bool = False, | ||
ub_bulk_wgrad: bool = False, | ||
ub_name: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should seriously consider deprecating these UB options and just passing in a dict. The UB interface is unstable and will likely be so for some while. A dict would be better for backward compatibility (reinterpret old options) and forward compatibility (ignore unknown options). This would be especially helpful for Mcore integration.
For example, the operation-based API passes in UB options with a dict:
userbuffers_options: Optional[dict[str, Any]] = None, |
assert not (self.ub_overlap_rs_fprop and self.ub_overlap_ag_fprop), "Internal TE error!" | ||
assert not (self.ub_overlap_ag_dgrad and self.ub_overlap_rs_dgrad), "Internal TE error!" | ||
assert not ( | ||
self.ub_overlap_rs_dgrad and (self.ub_bulk_dgrad or self.ub_bulk_wgrad) | ||
), "Internal TE error!" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More descriptive error messages would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, much needed
3951993
to
360c127
Compare
/te-ci pytorch L1 |
1 similar comment
/te-ci pytorch L1 |
… in sequence-parallel Linear backward Signed-off-by: Alp Dener <[email protected]>
Signed-off-by: Alp Dener <[email protected]>
…dated unit tests Signed-off-by: Alp Dener <[email protected]>
for more information, see https://pre-commit.ci
…ons in te.Linear Signed-off-by: Alp Dener <[email protected]>
Signed-off-by: Alp Dener <[email protected]>
Signed-off-by: Alp Dener <[email protected]>
for more information, see https://pre-commit.ci
744a96f
to
9adf99f
Compare
Description
te.Linear
currently only supports TP overlap withparallel_mode="row"
where it overlaps reduce-scatter in the forward pass, and all-gather with dgrad in the backward pass.This PR adds new options to enable all-gather overlap in the forward pass, and reduce-scatter overlap with dgrad in the backward pass, when
parallel_mode="column"
.Fixes #1312
Type of change
Checklist: