OpenCL finer grained parallelism

Hello! I was interested in using OpenCL on the new Intel Haswell, by creating two kernels, one that will work on the CPU and one that will work on the GPU. The problem I am facing is that the two kernels need to have some synchronization points. I saw that there is a way of creating sync points but at command queue level. I am not interested in that cause than I will have a lot of code on the host code. I would like to have everything done on kernel level. And haswell somewhat provides better situations, because the CPU and GPU are on the same die, and they share the memory hierarchy.

What I did so far was to create a common buffer between the two devices. The two devices should synchronize on that buffer. However there is a weird thing going on. The GPU always sees what the CPU writes, however the CPU does not see what the GPU writes. Why is that? I am using clCreateBuffer with CL_MEM_USE_HOST_PTR. Shouldn’t the values produced either by cpu and gpu reside in the same location. I understood that there might be some data saved on caches and only written back to main memory when the kernel finishes, but I tried huge sizes of vectors and it still did not do anything.

Can some give me some advice or answers?

Thanks,
Doru

You need to use clEnqueueMapBuffer to get host access and then clEnqueueUnmapMemObject to give it back to the GPU.

You can use events to synchronize between CPU and GPU as needed.

You’ll have to wait for OpenCL 2.0 SVM implementations to have finer grained memory sharing.

I see OpenCL 2.0 SVM will have a memory consistency between CPU and GPU or a mechanism do orchestration between the two of them?

Thanks for the information.

However another idea popped up, although it seems it does not work and I was wondering why. The idea is as follows:
I will have the first kernel (the big kernel) that I will launch on the GPU:

__kernel do_work(…args…) {
//code
sync point
//code
sync point
//code
}

Every time I need to sync the CPU with the GPU I will launch another small kernel on the GPU that will have just the sync point to unlock the code in the first big kernel.

In order to do that I need two different command queues for the same device GPU, and I will launch one kernel on the first command queue, while the second one on the second command queue. In order to do the sync I will need an array where I will write 0 and 1 for the pseudo barrier to have any effect. I am passing the same cl_mem object to both kernels. It should work, because the device has the same image of the memory with both kernels.

What am I doing wrong?

Thanks