-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cl_ext_buffer_device_address #1159
base: main
Are you sure you want to change the base?
Conversation
any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed. |
Yep, this is still being discussed in the WG. I personally think it's useful as is and shouldn't harm anything if merged as it even has 2 implementations now. |
Thanks @SunSerega |
Alright, and now the problem I found in #1171 is visible here because the |
This comment was marked as resolved.
This comment was marked as resolved.
Yes, this was the idea. I'll add a mention in the next update. |
Updated according to @karolherbst comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've already implemented the extension fully in rusticl/mesa (including sharing the same address across devices) and I think it's fine, however I'd still urge to address the concerns I have for layered implementations implementing it on top of Vulkan. I've already considered the constraints when implementing it, however I think it's better to provide clients to query if address sharing across multiple devices is supported or not.
b8df46b
to
1931416
Compare
@SunSerega thanks! |
@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also. |
I haven't written any more tests and only used the one in pocl. But I can probably help out with writing tests here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this proposal!
The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities. The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.
Co-authored-by: Sun Serega <[email protected]>
Changed the CL_MEM_DEVICE_ADDRESS_EXT wording for multi-device cases "all", not "any", covering a case where not all devices can ensure the same address across the context. In that case CL_INVALID_VALUE can be returned. Defined sub-buffer address computation to be 'base_addr + origin'. Added error conditions for clSetKernelExecInfo when the device doesn't support device pointers.
...and renamed them to CL_MEM_DEVICE_SHARED_ADDRESS_EXT and CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT. The first one guarantees the same address across all devices in the context, whereas the latter allows per-device addresses.
9ad04fb
to
edcff73
Compare
Also made the enums globally unique.
ba243d4
to
c99804d
Compare
Not really familiar with the build process and such, so I'm wondering how I'd get an updated |
Good question. I'm new to it too. @SunSerega ? |
My intuition tells me to expect the build workflow, with a single job called "Build spec artifacts" - to produce artifacts. And even if it did, it would probably only be HTML and PDF artifacts, because header automation is in a separate repo: https://github.com/KhronosGroup/OpenCL-Headers But my stake here is the XML consumption for my own bindings, I never used C bindings, so I pass the baton to @bashbaug |
@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile: diff --git a/xml/cl.xml b/xml/cl.xml
index d7bdb1a..47f52b0 100644
--- a/xml/cl.xml
+++ b/xml/cl.xml
@@ -3736,7 +3736,7 @@ server's OpenCL/api-docs repository.
<proto><type>cl_int</type> <name>clSetKernelArgDevicePointerEXT</name></proto>
<param><type>cl_kernel</type> <name>kernel</name></param>
<param><type>cl_uint</type> <name>arg_index</name></param>
- <param>const <type>void</type>* <name>arg_value</name></param>
+ <param>const <type>cl_mem_device_address_ext</type> <name>arg_value</name></param>
</command>
<command suffix="CL_API_SUFFIX__VERSION_2_0">
<proto><type>cl_int</type> <name>clSetKernelExecInfo</name></proto>
@@ -7200,6 +7200,9 @@ server's OpenCL/api-docs repository.
</require>
</extension>
<extension name="cl_ext_buffer_device_address" revision="0.9.1" supported="opencl" depends="CL_VERSION_3_0" provisional="true">
+ <require>
+ <type name="cl_mem_device_address_ext"/>
+ </require>
<require>
<command name="clSetKernelArgDevicePointerEXT"/>
</require>
@@ -7212,9 +7215,6 @@ server's OpenCL/api-docs repository.
<require comment="cl_kernel_exec_info">
<enum name="CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT"/>
</require>
- <require>
- <type name="cl_mem_device_address_ext"/>
- </require>
</extension>
<extension name="cl_khr_command_buffer" revision="0.9.5" supported="opencl" depends="CL_VERSION_1_2" ratified="opencl" provisional="true">
<require> and the CTS PR still uses |
The header generation scripts are in the headers repo, specifically: https://github.com/KhronosGroup/OpenCL-Headers/tree/main/scripts The CMake file for the headers defines two targets related to header generation: headers_generate, and headers_copy. headers_generate generates the headers to an output temporary directory and headers_copy copies the genreated headers into the right place. You will need to tell the header generation scripts where to find the spec XML file, which you can do using the CMake variable OPENCL_HEADERS_XML_PATH. If you want to do everything in one go, a good workflow is something like:
This generates the headers, copies them to the right place, builds all tests, and runs all tests. |
Now with the The good thing about binding buffer objects via Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involing a lot of flushing when it wouldn't be necessary. Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a Or we just stick with sub-buffers after all... EDIT: same issue with |
I noticed the same that it's a bit cumbersome, but it maps to the HIP/CUDA programming model where one device is targeted at the time with the hipSetDevice() API.
Yes, I agree.
In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.
If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one? |
Oh right, implementations have to look at the kernel arguments set on
I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.
|
I'm not aware. If you want to extend the buffer API, I think it should be a separate extension. Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.
Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to? |
Oh right yeah, that's impossible to solve with sub-buffers properly without changing semantics.
It could allow you to skip calling However it kinda matters more with this API, because it's probably not intended to be thread-safe as Like I can see the pattern where an application simply binds all arguments once and then uses commands to update the content of those arguments and never the kernel objects itself. And they can do it from multiple threads concurrently and safely as long as they don't modify the kernel object, like calling If you require to call |
I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case. |
My concern is that most implementations aren't set up to make it thread-safe easily, because it will require reworking how setting arguments is implemented, meaning it's a higher bar to implement this extension. In rusticl it's not an issue, because at the moment it is implemented in a thread-safe way, though I did consider dropping the thread-safety once I get to improve performance more and if it's a significant overhead showing up in applications. Though one could argue, that all the other I think the one benefit of adding is, that the runtime could verify if the passed in handle is indeed a valid one. And it could also catch use-after-free use cases when you set a handle, but the buffer gets deallocated later one and the same address becomes valid in a new allocation hiding the bug pretty well. But if adding it is too much of an issue for chipstar (and other users) then we could also not. I think it makes sense to wait what others think about this problem. I don't have any strong opinions here either way, just wanted to bring this issue up before the extension is finalized. |
I suspect the "others" do not have strong opinions here either since the need for this extension originates from HIP/CUDA/chipStar. Somehow adding the device argument seems wrong since so far cl_kernel has been device independent. I suggest we just go with the current API as it fulfills the original need and way of usage, and focus on USVM. |
FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.
Perhaps consider doing something like |
Good to know!
Hmm, if we change the clSetKernelArgDevicePointerEXT() pointer arg to a list of pointers with a pointer per device in the context it would match the buffer info query's return value meaning we could pass its return value directly to this API. The annoying part here is the potential discrepancy with the clBuildProgram's device list: The kernel might not have been built for all kernels in the context whereas the buffers are by default associated with all devices. If we make the list passed to the arg setter to match with the list of devices passed to clBuildProgram, the client has to filter the pointers from the context superset. Likely not a big deal, I suspect the device list in build program is typically not a subset of the context's devices - a rarely if never used feature. In this sense, Karol's proposal of setting the pointer separately for each device might be actually easier from the client perspective. But then again I don't see the "thread safety" issue of the current proposal a major one as cl_kernels can be cloned etc. In the end I don't have strong preferences and would just go with the current simple option, but can also change it to whatever you prefer. FYI, Michal implemented the current 0.9.1 version in PoCL. |
I forgot that |
Anyway, updated my implementation to 0.9.1 as well, though I still treat |
Another thing I was wondering about today. This extension interacts with SVM in a weird way. One can Now with this extension one can do a This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about. Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now). |
This is a good point. I'd just disallow this as if (CG) SVM is supported by the implementation, then this extension is rather pointless as this extension is supposed to be a simplification to CG SVM. |
...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions? |
yeah.. I mean it shouldn't be hard for the impl to simply return the SVM pointer for those BDA allocations, because the cl_mem object wrapping an SVM allocation is probably 100x the amount of work compared to handling this edge case. The normal host_ptr path can have a different address on the GPU side (e.g. if the host memory couldn't be mapped into the GPUs VM), which I think this extension will also have to clarify, but this guarantee also doesn't exist in the core spec (unless it's an SVM allocation). |
Right. I'll add the other option, returning the SVM pointer in this case, in the next specification revision. |
The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.
The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.