Concurrents kernels in OpenCL

Hello, I would like know how can I execute two or more different kernels in parallel and at the same time? Obviously in the same GPU using OpenCL.
My main idea is to use two different kernels (kernel A and Kernel B) but they need to use the same memory (I do not want to duplicate the memory by using one buffer for each in the “a” and “b” pointers). So is there another way I can accomplish the dual execution with an efficient memory technique?
The codes of the kernels are the following:
Kernel A:


_kernel  void kernelA(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* c, int cLen0)
{
	int i = get_local_id(0);
	c[(i)].x = a[(i)].x + b[(i)].x;
}

Kernel B:


kernel  void kernelB(global struct VectorStruct* a, int aLen0, global struct VectorStruct* b, int bLen0, global struct VectorStruct* d, int cLen0)
{
	int i = get_local_id(0);
	d[(i)].y = a[(i)].y + b[(i)].y;
}


The definition for the struct VectorStruct is the following:


struct VectorStruct
{
	int x;
	int y;
};


In the host code I have to create four pointers:
VectorStruct* a
VectorStruct* b
VectorStruct* c
VectorStruct* d
The poiner “a” and “b” have the data that I will transfer to GPU. The pointer “c” will storage the results of the kernel A, and the pointer “d” will storage the results of the kernel B.

Hello hanarce.

I don’t see any problem here. clEnqueueNDRangeKernel is asynchronous by nature. You just need to enqueue second kernel right after the first one. Modern gpus can run kernels in parallel but you need to check if it’s your case. Also depends if resources are available.

It also depends on if they use the same resources (for example, if kernelA writes to a buffer than kernelB takes, the runtime should not run them in parallel). A given vendor might make you jump through more hoops, such as using separate command queues for kernelA and kernelB.

This is a deceptively complex question :D.

The short answer is that some OpenCL implementations will support this, usually either by using out-of-order queues or multiple in-order queues or some combination of the two.

The Execution Model chapter in the OpenCL spec describes the conditions by which kernel instances may run in parallel. Note, the OpenCL specification currently doesn’t guarantee that multiple kernel instances will run in parallel, but you can submit command such that they may run in parallel.

https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_execution_model

I’d also recommend reading the Memory Model chapter in the OpenCL spec, which describes how to safely access memory from kernel instances that may run in parallel, particularly if multiple kernel instances will be writing to the same memory.

https://www.khronos.org/registry/OpenCL/specs/2.2/html/OpenCL_API.html#_memory_model

Good luck!

Interesting!

is there a way to identify which devices actually support concurrent kernel execution?

I have been trying using Intel multicore CPUs (that support CL2.0) for a program that communicates via pipes, but it seems that one kernel has to complete execution before others can start:
https://software.intel.com/en-us/forums/opencl/topic/559760

Any comments on this?

Leo