This program demonstrates a simple implementation of the "SAXPY" kernel. The "S" stands for single-precision (i.e. float
) and "AXPY" stands for the operation performed:
- A number of constants are defined to control the problem details and the kernel launch parameters.
- The two input vectors,
$X$ and$Y$ are instantiated in host memory.$X$ is filled with an incrementing sequence starting from 1, whereas$Y$ is filled with ones. - The necessary amount of device (GPU) memory is allocated and the elements of the input vectors are copied to the device memory.
- A trace message is printed to the standard output.
- The GPU kernel is launched with the previously defined arguments.
- The results are copied back to host vector
$Y$ . - The previously allocated device memory is freed.
- The first few elements of the result vector are printed to the standard output.
hipMalloc
is used to allocate memory in the global memory of the device (GPU). This is usually necessary, since the kernels running on the device cannot access host (CPU) memory (unless it is device-accessible pinned host memory, seehipHostMalloc
). Beware, that the memory returned is uninitialized.hipFree
de-allocates device memory allocated byhipMalloc
. It is necessary to free no longer used memory with this function to avoid resource leakage.hipMemcpy
is used to transfer bytes between the host and the device memory in both directions. A call to it synchronizes the device with the host, meaning that all kernels queued beforehipMemcpy
will finish before the copying starts. The function returns once the copying has finished.myKernelName<<<gridDim, blockDim, dynamicShared, stream>>>(kernelArguments)
queues the execution of the provided kernel on the device. It is asynchronous, the call may return before the execution of the kernel is finished. Its arguments come as the following:- The kernel (
__global__
) function to launch. - The number of blocks in the kernel grid, i.e. the grid size. It can be up to 3 dimensions.
- The number of threads in each block, i.e. the block size. It can be up to 3 dimensions.
- The amount of dynamic shared memory provided for the kernel, in bytes. Not used in this example.
- The device stream, on which the kernel is queued. In this example, the default stream is used.
- All further arguments are passed to the kernel function. Notice, that built-in and simple (POD) types may be passed to the kernel, but complex ones (e.g.
std::vector
) usually cannot be.
- The kernel (
hipGetLastError
returns the error code resulting from the previous operation.
threadIdx
,blockIdx
,blockDim
hipMalloc
hipFree
hipMemcpy
hipMemcpyHostToDevice
hipMemcpyDeviceToHost
hipGetLastError