-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shader Execution Reordering (SER) proposal #277
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks ready to take into the repo for further review. Before completing the PR, please pick the next free proposal number (at time of writing this comment, that would be 0021) and rename the file and the reference to it on line 4 appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll disclose that I'm not a DXR expert. A lot of my feedback is copy editing, trying to make the document easier to read. I feel a bit stronger about my thoughts on testability and DXIL op feedback.
@microsoft-github-policy-service agree company="NVIDIA" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's a large number of unresolved conversations on this PR - these either need to be moved into issues or addressed in the spec before we can merge it.
Co-authored-by: Greg Roth <[email protected]>
Co-authored-by: Greg Roth <[email protected]>
I have picked proposal number 0025. All pending changes have been made. If anything is left to discuss or change I suggest we open separate issues after the merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have picked proposal number 0025. All pending changes have been made. If anything is left to discuss or change I suggest we open separate issues after the merge.
This seems like a good approach to me, it's getting quite hard to keep track of everything in this PR! Thank you to everyone for all the detailed comments and responses!
I'm going to mark this as approved, but I think we'll still need to get all the conversations marked as resolved before the PR can be merged, and this repo is set to require two approvals as well.
For resolving the conversations we should follow up with the original authors and/or file issues (and link them in the conversation before resolving it) if necessary.
similarity of subsequent work being performed by threads. The resolution of | ||
the hint is implementation-specific. If an implementation cannot resolve all | ||
values of `CoherenceHint`, it is free to ignore an arbitrary number of least | ||
significant bits. The thread ordering resulting from this call may be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd like to have a clearer description of how hit is mapping to coherence.
Could we add the sentence that higher bit of hint has more impact on coherence then lower ones.
e.g. consider 3 hints, and chose the most coherent pair of hint:
A = 0b100000
B = 0b111111
C = 0b011111
For example 3 methods for extracting coherence by defining opposite divergence metric:
- If higher bit has more coherence then all lower bits after it,
A
,B
pair is most coherent. Divergence more or less xor(...) - If its arithmetic difference then
A
andC
. Divergence is more or less abs(substract(...)). - It could be also divergence = bitcount(xor) that tries to find most similar bits. then B and C is most coherent.
3 doesn't cope well with "it is free to ignore an arbitrary number of least
significant bits", but 1 and 2 fit well into definition.
I'm mentioning that because we discuss this feature only as reordering and forget that based on the ordering, the system needs to extract waves to execute.
Without understanding what is the actual "distance" between the hint codes its hard. It would be good to have that understanding on both sides of GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe arithmetic difference is the most flexible and intuitive definition for users. This approach allows a hint to incorporate both hierarchical aspects (with more important elements in MSBs) and ranges of bits that represent closeness along some axis. I agree it should be spelled out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that it makes sense to add some details what we mean with a hierarchical hint, and giving an example.
However, I think the numerical value of the XOR makes more sense than arithmetic difference. We could also mention (numerical) sorting, even if the implementation doesn't explicitly sort. It coincides with lexicographical sorting by individual bits, which nicely matches the hierarchical interpretation.
Consider:
hint1 = 0b0000
hint2 = 0b0111
hint3 = 0b1000
If the bits in the hint are intended to be used hierarchically (as per my understanding), then hint2
is closer to hint1
than to hint3
. That is reflected in the numerical value of the XOR: hint2 ^ hint1 = 0b111
, while hint2 ^ hint3 = 0b1111
. However, using arithmetic difference, hint2
would be closer to hint3
.
Edit: Fixed the example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
XOR prioritizes the strictly hierarchical use case. The problem with XOR is when a bit range represents similarity along some linear axis and we want to group based on closeness.
Take for example the following X,Y hint pairs along some linear axis of similarity:
X Y Arith XOR
5 0 5 5
5 1 4 4
5 2 3 7
5 3 2 6
5 4 1 1
5 5 0 0
5 6 1 3
5 7 2 2
5 8 3 13
5 9 4 12
In this use case, arithmetic difference produces a desirable metric but not XOR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was interpreting the hints to have hierarchical semantics, but checking the spec again I couldn't find any explicit mentioning of that, so fine with me.
Again, good point that we should explain this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The most important thing about arithmetic is that for me its not even good for arithmetic based HW if it drops bits.
So impl has to do rounding when doing dropping.
e.g.
original codes given by app:
A 101000000
B 100111111
C 101111111
if implementation drops 6 bits it sees:
A 101__________
B 100__________
C 101__________
For arithmetic the closest codes changes from A and B to A and C when bits are dropped.
for XOR it stays A and C.
Simply arithmetic is not stable for dropping of bits.
I wonder if we could pass literal with mask that says at least how the structure of hint looks like. for instance if hint by app is of structure:
struct HintStructA {
uchar featureA : 1;
uchar featureB : 1;
uchar LittleEnum : 2;
uchar featureD : 1;
uchar LittleArithCode : 3;
};
than ReorderThreads instead of having important bit count argument could have mask (compile time constant!) that marks on which bit each field starts.
For above hintStructA
: it would be 0b11101100
;
Dropping bits impl may be dropping whole fields. App can express its wish of understanding the hint including the ones we are discussing above:
e.g.
arithmetic 8bit 0b10000000
xor 8bit 0b11111111
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest disconnect for me is how these concepts map to the implementation. As mentioned earlier, arithmetic order and xor order is the same. The problem appears when we start talking about distances.
Distances are only useful if wave dispatch isn't strictly in order. We would have to extract the most similar hint pairs iteratively with a greedy approach. But that would leave the remaining work even less coherent, so it seems questionable as a global optimization. It's also unclear to me how such an algorithm would be parallelized and how hint distance would be weighted against thread occupancy.
If we specify the behavior as arithmetic or xor distance I'm afraid that apps will try strange mappings to convert between the two, wasting bits and being less efficient, for no gain on many implementations that don't care about distance.
Having a mask that indicates the start of each field is interesting and would allow the user to express intent exactly without introducing waste. If the mask is added by hand it may be confusing for developers, but having it generated from a struct may be a sweet spot. I think we would have to introduce a struct attribute for hints (similar to [raypayload]
) which only allows bit fields.
I propose that we either go with @rdrabins' struct idea or that we don't specify anything about distances in the spec and instead talk about order. I have a slight preference to the latter since I don't see exactly how the information would be used in practice and since it adds complexity. The advantage of the struct is that it strictly gives more information about intent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns around using a struct with bitfields because we have a bunch of bugs related to the implementation of bitfields in HLSL 2021. I don't think we're in a position to have new APIs recommending or requiring the use of bitfields.
There's also a host of challenges that we didn't really plan for around bitfields because the layout of bitfields in C++ isn't defined (it varies by compiler and target architecture). For this reason HLSL bitfields can never really be used safely for passing data between the CPU and GPU.
We could define the bitfield layout in HLSL explicitly, although we didn't adequately specify this in HLSL 2021 so we may encounter issues with subtle differences between the DXIL and SPIR-V generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding "distance" metrics:
We have some interesting discussions about different ways to compare hint values and compute distances, but as @rasmusnv points out:
The biggest disconnect for me is how these concepts map to the implementation.
Language specifying that higher bits of the hint value are generally considered more important than lower bits might help, but this likely needs qualifiers to temper expectations.
For instance, an implementation may combine buckets linearly to get to some size after sorting (as suggested by rasmusnv), which doesn't guarantee that groups of threads are reliably split along the highest-bit lines first, or even that the closest buckets by any "distance" metric are combined.
Example - selecting 16 threads from buckets sorted in ascending order:
A 0b0000 x 8
B 0b0100 x 8
C 0b0110 x 4
D 0b0111 x 4
A simple, greedy schedular might schedule A&B together before scheduling C&D, when globally it appears that B&C&D should be combined instead, since they have the same highest bit, and are even closest in arithmetic space. If you use the concept of "distance", then in both XOR and in arithmetic space, B&C&D should be combined instead of A&B, yet that's not what this simple scheduler does.
I'm not saying this is how it should work, but I think this should be considered a valid implementation.
Given this, I suspect the language in the spec will need to remain somewhat vague to avoid reliance on any particular concept of "distance".
I suggest something like this:
In general, more significant bits of the hint value are more important than less significant bits for determining coherency, though scheduling behavior is implementation specific.
Regarding structured hints:
This thread started on the basis of an insufficiently clear definition of how the hint values impact scheduling coherence. If we cannot clearly define how more complex structured hint definitions map to behavior, the problem gets worse, not better. The way I see it, a more complex hint definition sets up additional expectations for users that aren't likely to be met by implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tex3d I committed your suggested change.
Co-authored-by: Radek Drabinski <[email protected]>
Design meeting: assigning to @derlemsft as the PM ushering this through. |
Co-authored-by: Tex Riddell <[email protected]>
@@ -724,7 +724,10 @@ Parameter | Definition | |||
|
|||
This variant of `ReorderThread` reorders threads based on a generic | |||
user-provided hint. Similarity of hint values should indicate expected | |||
similarity of subsequent work being performed by threads. The resolution of | |||
similarity of subsequent work being performed by threads. In general, more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the "In general". I realize you took this wording straight from @tex3d's comment, but the wording is incorrect. Saying "in general" means they are "usually" more important, but as a specification we should be explicit.
More significant bits of the hint value are more important than less significant bits for determining coherency. Specific scheduling behavior may vary by implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we resolve these issues without something vague, unless we prescribe only specific allowed interpretations.
- If they are always more important, that may imply XOR distance comparisons for grouping, but an arithmetic ordering may violate this expectation when grouping threads.
- They are more important in the sense that if bits are dropped, it will be some arbitrary number of low bits, making the lower bits less important.
- If they are not always more important, then what language do we use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like this:
Thread grouping may be based on an arithmetic ordering of threads by hint value, or some other method. Since an arbitrary number of lowest bits may be ignored, lower bits are less important than higher bits in the hint value. However, thread grouping is not required to minimize any notion of distance between hint values within groups of threads, since this would greatly complicate thread grouping implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, if all parties are happy with @llvm-beanz's addition of Specific scheduling behavior may vary by implementation
covering the divergence from any notion of distance, we could just accept that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where 1 << 12
would be a "less important" hint than 1 << 2
?
These are hints, so we don't specifically dictate what the implementation must do with the hint, but being clear about the ordering importance of bits seems like something we should all be able to agree on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my last comment on the other thread for an example of how threads with a different high bit could be grouped together while threads where the high bit is the same and only the lower bits are different are separated, when it could have grouped threads by the high bit value first instead. If the high bit is always more important for grouping threads, the behavior observed from a simple greedy sorting scheduler may be unexpected.
In any case, I already said we could go with your suggested edit if everyone is satisfied with that.
Add Shader Execution Reordering (SER) proposal for consideration.