-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shader Execution Reordering (SER) proposal #277
base: main
Are you sure you want to change the base?
Changes from 1 commit
6b05dbd
50bd7e0
4d8349f
cf1318e
a589481
ff837e5
1b915ca
ee455fe
496c699
76a74dc
1b353ea
8a2e45c
7a63102
d8f95b4
1f4f412
414fb44
05ca446
d60a3b8
a81e008
bb617df
f135b8c
a3ca3f9
bd31e5e
94d08fd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -724,7 +724,10 @@ Parameter | Definition | |
|
||
This variant of `ReorderThread` reorders threads based on a generic | ||
user-provided hint. Similarity of hint values should indicate expected | ||
similarity of subsequent work being performed by threads. The resolution of | ||
similarity of subsequent work being performed by threads. In general, more | ||
significant bits of the hint value are more important than less significant | ||
bits for determining coherency, though scheduling behavior is implementation | ||
specific. The resolution of | ||
the hint is implementation-specific. If an implementation cannot resolve all | ||
values of `CoherenceHint`, it is free to ignore an arbitrary number of least | ||
significant bits. The thread ordering resulting from this call may be | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We'd like to have a clearer description of how hit is mapping to coherence. e.g. consider 3 hints, and chose the most coherent pair of hint: For example 3 methods for extracting coherence by defining opposite divergence metric:
3 doesn't cope well with "it is free to ignore an arbitrary number of least I'm mentioning that because we discuss this feature only as reordering and forget that based on the ordering, the system needs to extract waves to execute. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe arithmetic difference is the most flexible and intuitive definition for users. This approach allows a hint to incorporate both hierarchical aspects (with more important elements in MSBs) and ranges of bits that represent closeness along some axis. I agree it should be spelled out. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed that it makes sense to add some details what we mean with a hierarchical hint, and giving an example. However, I think the numerical value of the XOR makes more sense than arithmetic difference. We could also mention (numerical) sorting, even if the implementation doesn't explicitly sort. It coincides with lexicographical sorting by individual bits, which nicely matches the hierarchical interpretation. Consider:
If the bits in the hint are intended to be used hierarchically (as per my understanding), then Edit: Fixed the example. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. XOR prioritizes the strictly hierarchical use case. The problem with XOR is when a bit range represents similarity along some linear axis and we want to group based on closeness. Take for example the following X,Y hint pairs along some linear axis of similarity:
In this use case, arithmetic difference produces a desirable metric but not XOR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was interpreting the hints to have hierarchical semantics, but checking the spec again I couldn't find any explicit mentioning of that, so fine with me. Again, good point that we should explain this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The most important thing about arithmetic is that for me its not even good for arithmetic based HW if it drops bits. e.g. For arithmetic the closest codes changes from A and B to A and C when bits are dropped. I wonder if we could pass literal with mask that says at least how the structure of hint looks like. for instance if hint by app is of structure: struct HintStructA {
uchar featureA : 1;
uchar featureB : 1;
uchar LittleEnum : 2;
uchar featureD : 1;
uchar LittleArithCode : 3;
}; than ReorderThreads instead of having important bit count argument could have mask (compile time constant!) that marks on which bit each field starts. Dropping bits impl may be dropping whole fields. App can express its wish of understanding the hint including the ones we are discussing above: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The biggest disconnect for me is how these concepts map to the implementation. As mentioned earlier, arithmetic order and xor order is the same. The problem appears when we start talking about distances. Distances are only useful if wave dispatch isn't strictly in order. We would have to extract the most similar hint pairs iteratively with a greedy approach. But that would leave the remaining work even less coherent, so it seems questionable as a global optimization. It's also unclear to me how such an algorithm would be parallelized and how hint distance would be weighted against thread occupancy. If we specify the behavior as arithmetic or xor distance I'm afraid that apps will try strange mappings to convert between the two, wasting bits and being less efficient, for no gain on many implementations that don't care about distance. Having a mask that indicates the start of each field is interesting and would allow the user to express intent exactly without introducing waste. If the mask is added by hand it may be confusing for developers, but having it generated from a struct may be a sweet spot. I think we would have to introduce a struct attribute for hints (similar to I propose that we either go with @rdrabins' struct idea or that we don't specify anything about distances in the spec and instead talk about order. I have a slight preference to the latter since I don't see exactly how the information would be used in practice and since it adds complexity. The advantage of the struct is that it strictly gives more information about intent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have some concerns around using a struct with bitfields because we have a bunch of bugs related to the implementation of bitfields in HLSL 2021. I don't think we're in a position to have new APIs recommending or requiring the use of bitfields. There's also a host of challenges that we didn't really plan for around bitfields because the layout of bitfields in C++ isn't defined (it varies by compiler and target architecture). For this reason HLSL bitfields can never really be used safely for passing data between the CPU and GPU. We could define the bitfield layout in HLSL explicitly, although we didn't adequately specify this in HLSL 2021 so we may encounter issues with subtle differences between the DXIL and SPIR-V generation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding "distance" metrics: We have some interesting discussions about different ways to compare hint values and compute distances, but as @rasmusnv points out:
Language specifying that higher bits of the hint value are generally considered more important than lower bits might help, but this likely needs qualifiers to temper expectations. For instance, an implementation may combine buckets linearly to get to some size after sorting (as suggested by rasmusnv), which doesn't guarantee that groups of threads are reliably split along the highest-bit lines first, or even that the closest buckets by any "distance" metric are combined. Example - selecting 16 threads from buckets sorted in ascending order:
A simple, greedy schedular might schedule A&B together before scheduling C&D, when globally it appears that B&C&D should be combined instead, since they have the same highest bit, and are even closest in arithmetic space. If you use the concept of "distance", then in both XOR and in arithmetic space, B&C&D should be combined instead of A&B, yet that's not what this simple scheduler does. I'm not saying this is how it should work, but I think this should be considered a valid implementation. Given this, I suspect the language in the spec will need to remain somewhat vague to avoid reliance on any particular concept of "distance". I suggest something like this:
Regarding structured hints: This thread started on the basis of an insufficiently clear definition of how the hint values impact scheduling coherence. If we cannot clearly define how more complex structured hint definitions map to behavior, the problem gets worse, not better. The way I see it, a more complex hint definition sets up additional expectations for users that aren't likely to be met by implementations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tex3d I committed your suggested change. |
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should remove the "In general". I realize you took this wording straight from @tex3d's comment, but the wording is incorrect. Saying "in general" means they are "usually" more important, but as a specification we should be explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how we resolve these issues without something vague, unless we prescribe only specific allowed interpretations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about something like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then again, if all parties are happy with @llvm-beanz's addition of
Specific scheduling behavior may vary by implementation
covering the divergence from any notion of distance, we could just accept that.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any situation where
1 << 12
would be a "less important" hint than1 << 2
?These are hints, so we don't specifically dictate what the implementation must do with the hint, but being clear about the ordering importance of bits seems like something we should all be able to agree on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my last comment on the other thread for an example of how threads with a different high bit could be grouped together while threads where the high bit is the same and only the lower bits are different are separated, when it could have grouped threads by the high bit value first instead. If the high bit is always more important for grouping threads, the behavior observed from a simple greedy sorting scheduler may be unexpected.
In any case, I already said we could go with your suggested edit if everyone is satisfied with that.