Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cl_ext_buffer_device_address #1159

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pjaaskel
Copy link
Contributor

@pjaaskel pjaaskel commented May 7, 2024

The basic cl_mem buffer API doesn't enable access to the underlying raw pointers in the device memory, preventing its use in host side data structures that need pointer references to objects. This API adds a minimal increment on top of cl_mem that provides such capabilities.

The version 0.1.0 is implemented in PoCL and rusticl for prototyping, but everything's still up for discussion. chipStar is the first client that uses the API.

@karolherbst
Copy link
Contributor

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

@pjaaskel
Copy link
Contributor Author

any motivation to get this merged? Or anything else needed to discuss before merging this? Could also try to bring it up at the CL WG if needed.

Yep, this is still being discussed in the WG. I personally think it's useful as is and shouldn't harm anything if merged as it even has 2 implementations now.

xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

Thanks @SunSerega

@SunSerega
Copy link
Contributor

Alright, and now the problem I found in #1171 is visible here because the Presubmit workflow has been properly rerun.

@karolherbst

This comment was marked as resolved.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 4, 2024

One thing I'm wondering about is how should clCreateSubBuffer behave when being executed on a bda memory object? Should it fail or succeed? I'm fine with either, just wondering if some of the behavior needs to be formalized.

CL_MEM_DEVICE_PTR_EXT and CL_MEM_DEVICE_PTRS_EXT comes to my mind, which should probably return the address + the sub buffer offset, at least that's how I have implemented it so far.

Yes, this was the idea. I'll add a mention in the next update.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Sep 9, 2024

Updated according to @karolherbst comments.

Copy link
Contributor

@karolherbst karolherbst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already implemented the extension fully in rusticl/mesa (including sharing the same address across devices) and I think it's fine, however I'd still urge to address the concerns I have for layered implementations implementing it on top of Vulkan. I've already considered the constraints when implementing it, however I think it's better to provide clients to query if address sharing across multiple devices is supported or not.

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from b8df46b to 1931416 Compare September 24, 2024 12:26
xml/cl.xml Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel
Copy link
Contributor Author

@SunSerega thanks!

@pjaaskel
Copy link
Contributor Author

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

@karolherbst
Copy link
Contributor

@karolherbst I asked about this in the CL/memory WG. We need to submit CTS tests and this might be good to go then with this one. Do you have good tests in Rusticl side we could use? The test in PoCL is quite basic (and needs to be updated), but can be used as a starting point also.

I haven't written any more tests and only used the one in pocl. But I can probably help out with writing tests here.

Copy link
Contributor

@kpet kpet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this proposal!

extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
extensions/cl_ext_buffer_device_address.asciidoc Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
pjaaskel and others added 5 commits December 6, 2024 17:47
The basic cl_mem buffer API doesn't enable access to the underlying raw
pointers in the device memory, preventing its use in host side
data structures that need pointer references to objects.
This API adds a minimal increment on top of cl_mem that provides such
capabilities.

The version 0.1.0 is implemented in PoCL and rusticl for prototyping,
but everything's still up for discussion. chipStar is the first client
that uses the API.
Changed the CL_MEM_DEVICE_ADDRESS_EXT wording for multi-device
cases "all", not "any", covering a case where not all devices
can ensure the same address across the context. In that case
CL_INVALID_VALUE can be returned.  Defined sub-buffer address
computation to be 'base_addr + origin'. Added error conditions
for clSetKernelExecInfo when the device doesn't support
device pointers.
...and renamed them to CL_MEM_DEVICE_SHARED_ADDRESS_EXT and
CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT. The first one guarantees the
same address across all devices in the context, whereas the latter
allows per-device addresses.
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from 9ad04fb to edcff73 Compare December 6, 2024 15:53
xml/cl.xml Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
xml/cl.xml Outdated Show resolved Hide resolved
@pjaaskel pjaaskel force-pushed the cl_ext_buffer_device_address branch from ba243d4 to c99804d Compare December 12, 2024 16:38
@karolherbst
Copy link
Contributor

Not really familiar with the build process and such, so I'm wondering how I'd get an updated cl_ext.h file I could use to update my implementation?

@pjaaskel
Copy link
Contributor Author

Not really familiar with the build process and such, so I'm wondering how I'd get an updated cl_ext.h file I could use to update my implementation?

Good question. I'm new to it too. @SunSerega ?

@SunSerega
Copy link
Contributor

My intuition tells me to expect the build workflow, with a single job called "Build spec artifacts" - to produce artifacts.
But it doesn't ¯\_(ツ)_/¯

And even if it did, it would probably only be HTML and PDF artifacts, because header automation is in a separate repo: https://github.com/KhronosGroup/OpenCL-Headers

But my stake here is the XML consumption for my own bindings, I never used C bindings, so I pass the baton to @bashbaug

@pjaaskel
Copy link
Contributor Author

PR for CTS.

@karolherbst
Copy link
Contributor

@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile:

diff --git a/xml/cl.xml b/xml/cl.xml
index d7bdb1a..47f52b0 100644
--- a/xml/cl.xml
+++ b/xml/cl.xml
@@ -3736,7 +3736,7 @@ server's OpenCL/api-docs repository.
             <proto><type>cl_int</type>                                  <name>clSetKernelArgDevicePointerEXT</name></proto>
             <param><type>cl_kernel</type>                               <name>kernel</name></param>
             <param><type>cl_uint</type>                                 <name>arg_index</name></param>
-            <param>const <type>void</type>*                             <name>arg_value</name></param>
+            <param>const <type>cl_mem_device_address_ext</type>         <name>arg_value</name></param>
         </command>
         <command suffix="CL_API_SUFFIX__VERSION_2_0">
             <proto><type>cl_int</type>                                  <name>clSetKernelExecInfo</name></proto>
@@ -7200,6 +7200,9 @@ server's OpenCL/api-docs repository.
             </require>
         </extension>
         <extension name="cl_ext_buffer_device_address" revision="0.9.1" supported="opencl" depends="CL_VERSION_3_0" provisional="true">
+            <require>
+                <type name="cl_mem_device_address_ext"/>
+            </require>
             <require>
                 <command name="clSetKernelArgDevicePointerEXT"/>
             </require>
@@ -7212,9 +7215,6 @@ server's OpenCL/api-docs repository.
             <require comment="cl_kernel_exec_info">
                 <enum name="CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT"/>
             </require>
-            <require>
-                <type name="cl_mem_device_address_ext"/>
-            </require>
         </extension>
         <extension name="cl_khr_command_buffer" revision="0.9.5" supported="opencl" depends="CL_VERSION_1_2" ratified="opencl" provisional="true">
             <require>

and the CTS PR still uses cl_mem_device_address_EXT not cl_mem_device_address_ext

@pjaaskel
Copy link
Contributor Author

@pjaaskel do you have local changes, because locally I had to change a few things to get things to compile:

I hadn't local changes, but I didn't try to generate the header (still yet find how), just built the docs. Fixed. Pinging @franz about the CTS for when he comes back to office.

@bashbaug
Copy link
Contributor

I didn't try to generate the header (still yet find how)

The header generation scripts are in the headers repo, specifically:

https://github.com/KhronosGroup/OpenCL-Headers/tree/main/scripts

The CMake file for the headers defines two targets related to header generation: headers_generate, and headers_copy. headers_generate generates the headers to an output temporary directory and headers_copy copies the genreated headers into the right place. You will need to tell the header generation scripts where to find the spec XML file, which you can do using the CMake variable OPENCL_HEADERS_XML_PATH.

If you want to do everything in one go, a good workflow is something like:

make headers_generate headers_copy all test -j

This generates the headers, copies them to the right place, builds all tests, and runs all tests.

@karolherbst
Copy link
Contributor

karolherbst commented Dec 18, 2024

Now with the CL_MEM_DEVICE_SHARED_ADDRESS option remove, I don't think the current interface of clSetKernelArgDevicePointerEXT makes much sense.

The good thing about binding buffer objects via clSetKernelArg or setting SVM pointers via clSetKernelArgSVMPointer is, that you don't have to change the argument to enqueue the same kernel on a different device.

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involing a lot of flushing when it wouldn't be necessary.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.

Or we just stick with sub-buffers after all...

EDIT: same issue with CL_KERNEL_EXEC_INFO_DEVICE_PTRS_EXT

@pjaaskel
Copy link
Contributor Author

Now with the CL_MEM_DEVICE_SHARED_ADDRESS option remove, I don't think the current interface of clSetKernelArgDevicePointerEXT makes much sense.

I noticed the same that it's a bit cumbersome, but it maps to the HIP/CUDA programming model where one device is targeted at the time with the hipSetDevice() API.

The good thing about binding buffer objects via clSetKernelArg or setting SVM pointers via clSetKernelArgSVMPointer is, that you don't have to change the argument to enqueue the same kernel on a different device.

Yes, I agree.

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involving a lot of flushing when it wouldn't be necessary.

In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.

Or we just stick with sub-buffers after all...

If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one?

@karolherbst
Copy link
Contributor

Now that each device has its own address, one would have to synchronize the set address with the queues launching those kernels, potentially involving a lot of flushing when it wouldn't be necessary.

In practice it means that if you submit the kernel to multiple devices' command queues, you have to set the arg for each submission. The client can still pass the cl_mem handle or subbuffers using the core clSetKernelArg() API if they don't want to touch the per-device pointers or set them one at a time.

Oh right, implementations have to look at the kernel arguments set on clEnqueueNDRangeKernel call time, not sure why I thought it's any different.

Though using sub-buffers if one doesn't want to use the base address is also quite the annoying interface. Maybe we need a clSetKernelArgBuffer, which comes with some useful features, like an offset argument? Might also want to add a couple of more things while at it.
Or we just stick with sub-buffers after all...

If you recall, we discussed and considered this option in the beginning, but it was not good enough for HIP/CUDA due to some subbuffer restrictions (alignment at least, but perhaps something else I forgot) and for the fact that some APIs prefer to use the raw pointer address. It might be doable if we lift those restrictions with the extension and then the runtime has to keep book of what raw address maps to which (sub)buffer when interfacing from APIs with raw pointers, but is that a much better option than this one?

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

@pjaaskel
Copy link
Contributor Author

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

I'm not aware. If you want to extend the buffer API, I think it should be a separate extension.

Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to?

@karolherbst
Copy link
Contributor

I think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

I'm not aware. If you want to extend the buffer API, I think it should be a separate extension.

Another problem in addition to the fixed alignment I recall with the approach of (ab)using sub-buffers for implementing the raw pointer passing is that sub-buffers have a size. We discussed using size 0 sub-buffers for this and it started to seem hacky, therefore we went with this raw pointer API plan, IIRC. I can try to dig up the old notes if you want to go back on the drawing board with this extension.

Oh right yeah, that's impossible to solve with sub-buffers properly without changing semantics.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Do you mean that we could call clSetKernelArgBufferAddress() multiple times for each device holding the buffer? We could, but then again it doesn't differ much from calling it multiple times for each NDRange enqueue of that kernel (to different devices' queues), in which case the device is implied from the queue it is being pushed to. Perhaps I can just add a sentence that highlights the fact that the raw pointer's device is implied from the command queue the kernel is enqueued to?

It could allow you to skip calling clSetKernelArgBufferAddress when and argument won't have to change, but that might be niche enough to not matter.

However it kinda matters more with this API, because it's probably not intended to be thread-safe as clSetKernelArg isn't. And this extension might want to modify the "Multiple Host Threads" section as well. So if you have arguments which would be the same pointer for a given device you could skip those calls if you set the pointer once for each device.

Like I can see the pattern where an application simply binds all arguments once and then uses commands to update the content of those arguments and never the kernel objects itself. And they can do it from multiple threads concurrently and safely as long as they don't modify the kernel object, like calling clSetKernelArg*.

If you require to call clSetKernelArgBufferAddress before each enqueue, application will have to synchronize calls to it from multiple threads. Unless of course clSetKernelArgBufferAddress will be considered thread-safe, which then might make the problem a much smaller one.

@pjaaskel
Copy link
Contributor Author

Unless of course clSetKernelArgBufferAddress will be considered thread-safe, which then might make the problem a much smaller one.

I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

@karolherbst
Copy link
Contributor

I see. What if we simply declare the call thread safe here and add a note of the problem in the spec? Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

My concern is that most implementations aren't set up to make it thread-safe easily, because it will require reworking how setting arguments is implemented, meaning it's a higher bar to implement this extension. In rusticl it's not an issue, because at the moment it is implemented in a thread-safe way, though I did consider dropping the thread-safety once I get to improve performance more and if it's a significant overhead showing up in applications.

Though one could argue, that all the other clSetKernelArg* calls are device independent and the reason they don't have a device arg is, because it was never needed, just clSetKernelArgBufferAddress being the exception here. Though I don't think it's a huge issue to add it. Nor is it a huge issue not to.

I think the one benefit of adding is, that the runtime could verify if the passed in handle is indeed a valid one. And it could also catch use-after-free use cases when you set a handle, but the buffer gets deallocated later one and the same address becomes valid in a new allocation hiding the bug pretty well.

But if adding it is too much of an issue for chipstar (and other users) then we could also not.

I think it makes sense to wait what others think about this problem. I don't have any strong opinions here either way, just wanted to bring this issue up before the extension is finalized.

@pjaaskel
Copy link
Contributor Author

I suspect the "others" do not have strong opinions here either since the need for this extension originates from HIP/CUDA/chipStar. Somehow adding the device argument seems wrong since so far cl_kernel has been device independent. I suggest we just go with the current API as it fulfills the original need and way of usage, and focus on USVM.

@bashbaug
Copy link
Contributor

think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

Perhaps consider doing something like clBuildProgram, which accepts a device list to specify which devices the program should be built for, but the device list can be NULL, in which case the program is built for all devices? I don't know if you'd need a device list or if a single argument would be sufficient.

@pjaaskel
Copy link
Contributor Author

think for alignment it would be much better if we could simply specify the alignment at buffer creation time. Not sure if anything is being worked on in this regard though.

FYI, there is some work ongoing to specify an alignment at buffer creation time, see internal MR 198.

Good to know!

clSetKernelArgBuffer could also get a device parameter added and then the value is specific to the device. That might be feasible and usable? I think that also allows for better debugability and error checking.

Somehow I'm reluctant to add the explicit device argument to the argument setter function as it's not needed in the basic single device use case.

Perhaps consider doing something like clBuildProgram, which accepts a device list to specify which devices the program should be built for, but the device list can be NULL, in which case the program is built for all devices? I don't know if you'd need a device list or if a single argument would be sufficient.

Hmm, if we change the clSetKernelArgDevicePointerEXT() pointer arg to a list of pointers with a pointer per device in the context it would match the buffer info query's return value meaning we could pass its return value directly to this API. The annoying part here is the potential discrepancy with the clBuildProgram's device list: The kernel might not have been built for all kernels in the context whereas the buffers are by default associated with all devices. If we make the list passed to the arg setter to match with the list of devices passed to clBuildProgram, the client has to filter the pointers from the context superset. Likely not a big deal, I suspect the device list in build program is typically not a subset of the context's devices - a rarely if never used feature.

In this sense, Karol's proposal of setting the pointer separately for each device might be actually easier from the client perspective. But then again I don't see the "thread safety" issue of the current proposal a major one as cl_kernels can be cloned etc. In the end I don't have strong preferences and would just go with the current simple option, but can also change it to whatever you prefer.

FYI, Michal implemented the current 0.9.1 version in PoCL.

@karolherbst
Copy link
Contributor

I forgot that clCloneKernel existed... yeah that's definitely the cheapest option to get around the thread-safety problem.

@karolherbst
Copy link
Contributor

karolherbst commented Dec 21, 2024

Anyway, updated my implementation to 0.9.1 as well, though I still treat CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT as a flag, because that's also what the current version of the CTS test is doing. Well.. it should also work as a property, but that's untested.

@karolherbst
Copy link
Contributor

Another thing I was wondering about today. This extension interacts with SVM in a weird way. One can clSVMAlloc memory and then pass it via a host_ptr into clCreateBuffer or clCreateBufferWithProperties. If CL_MEM_USE_HOST_PTR is also specified as a flag when doing so, the cl_mem object basically wraps the SVM allocation (the CTS has a couple of tests for this).

Now with this extension one can do a CL_MEM_USE_HOST_PTR buffer allocation with an SVM pointer and CL_MEM_DEVICE_PRIVATE_ADDRESS_EXT set to CL_TRUE. That should mean that the addresses returned by CL_MEM_DEVICE_ADDRESS_EXT also match the SVM pointer, no? I wonder if this should explicitly be stated or not. And if the CTS should test for this.

This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about.

Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now).

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

This extension could also just disallow this behavior if this sounds too much of an edge case nobody is going to care about.

Just wanted to point out that this is a real possibility, which implementations might have to face supporting both (as I'm now).

This is a good point. I'd just disallow this as if (CG) SVM is supported by the implementation, then this extension is rather pointless as this extension is supposed to be a simplification to CG SVM.

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 2, 2025

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

@karolherbst
Copy link
Contributor

...but having said that, if the implementation does support CG SVM, it might still support BDA for legacy/compatibility reasons and in that case the other behavior (the "device ptr" = SVM pointer) would make sense. Other opinions?

yeah.. I mean it shouldn't be hard for the impl to simply return the SVM pointer for those BDA allocations, because the cl_mem object wrapping an SVM allocation is probably 100x the amount of work compared to handling this edge case.

The normal host_ptr path can have a different address on the GPU side (e.g. if the host memory couldn't be mapped into the GPUs VM), which I think this extension will also have to clarify, but this guarantee also doesn't exist in the core spec (unless it's an SVM allocation).

@pjaaskel
Copy link
Contributor Author

pjaaskel commented Jan 3, 2025

Right. I'll add the other option, returning the SVM pointer in this case, in the next specification revision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants