OpenCL Synchronization between workgroups.

I am actually looping an openCL call to a kernel several times.

The kernel command is called within a loop in the host program and in each iteration we wait for the commands to complete(using clFinish), assign the buffer values to an another buffer(consider it as previous values buffer) and then continue to the next iteration, where the kernel is again called as shown below.

loop n times

{

      1.call kernel(uses the [i][b]previous buffer values[/b][/i] for updating the current buffer values, which includes the 4 neighbours if we consider a 2d grid)

      2.wait for the kernel command to finish.

      3.copy the current buffer values to the previous buffer.

      4.wait for all commands to finish.

}

The values from the previous values buffer could be used to update the current buffer values , however the current buffer values are updated based on the neighboring values(in the corresponding previous values buffer from previous iteration) within the same work-group(wave front) or the next work-groups but the previous work-group values in the previous values buffer are completely neglected by OpenCL. Theoretically all the neighbors including corresponding values in the previous work-groups if present should also be considered. After each clFinish all the values in current buffer are updated then only we copy these values to the previous values buffer, so theses values are available in the next kernel call. My point is why its not working as expected even tough, previous values buffer is declared global and also a read only buffer so we cannot assign values to previous values buffer within the kernel?

Sorry for such a long explanation :slight_smile: , I wanted to make my problem clear. The problem becomes clearer when attached kernel code is seen.

__kernel void ProjectionSolver (__global int* xGridSize, __global int* yGridSize, __global int* zGridSize, __global float* p, 
__global float* p0,__global float* div,__global int* obstacles, float a, float c)

		
	{
		
			int i = get_local_id(0);
			int j = get_local_id(1);
			int k = get_local_id(2);
			int ig = get_group_id(0);
			int jg = get_group_id(1);
			int kg = get_group_id(2);
			int id = get_local_size(0);
			int jd = get_local_size(1);
			int kd = get_local_size(2);
			int igl = get_global_size(0);
			int jgl = get_global_size(1);
			int kgl = get_global_size(2);
			int index = (((ig*id)+i)*jgl*kgl)+(((jg*jd)+j)*kgl)+((kg*kd)+k);
			unsigned int left = (((ig*id)+i-1)*jgl*kgl)+(((jg*jd)+j)*kgl)+((kg*kd)+k);
			unsigned int right = (((ig*id)+i+1)*jgl*kgl)+(((jg*jd)+j)*kgl)+((kg*kd)+k);
			unsigned int up = (((ig*id)+i)*jgl*kgl)+(((jg*jd)+j+1)*kgl)+((kg*kd)+k);
			unsigned int down = (((ig*id)+i)*jgl*kgl)+(((jg*jd)+j-1)*kgl)+((kg*kd)+k);
			unsigned int front = (((ig*id)+i)*jgl*kgl)+(((jg*jd)+j)*kgl)+((kg*kd)+k+1);
			unsigned int back = (((ig*id)+i)*jgl*kgl)+(((jg*jd)+j)*kgl)+((kg*kd)+k-1);
			float xPressure,yPressure,zPressure,currentPressure,frontValue,backValue,upValue,downValue,rightValue,leftValue;
			int solid;
			
				
							
			if(obstacles[index]==0)
			{
				solid = 0;
				if(i-1<0)
				{
					xPressure = p0 [right];
				}
				else if(i+1>igl-1)
				{
					xPressure = p0 [left];
				}
				else
				{
					xPressure = p0 [right] + p0 [left];
					rightValue = p0[right];
					leftValue = p0[left];
				}
				
				
				if(j-1<0)
				{
					yPressure = p0 [up];
				}
				else if(j+1>jgl-1)
				{
					yPressure = p0 [down];
				}
				else
				{
					yPressure = p0 [down] + p0 [up];
					upValue = p0[up];
					downValue = p0[down];
				}
				
				
				if(k-1<0)
				{
					zPressure = p0 [front];
				}
				else if(k+1>zGridSize[0]-1)
				{
					zPressure = p0 [back];
				}
				else
				{
					
					zPressure = p0 [front] + p0 [back];
					frontValue = p0[front];
					backValue = p0[back];
				}
				
					
				currentPressure = (div[index] + (xPressure+yPressure+zPressure)) / c;
				p[index] = currentPressure;
			}
			else
			{
				solid = 1;
			   currentPressure = 0;
			   p[index] = currentPressure;
			   
			}
			barrier(CLK_GLOBAL_MEM_FENCE);
		
		
}

Do you check the error code returned by each OpenCL function?

Can you post the code that calls this kernel and handles the copies?

Why all the math using the work group side, etc? I’d think you could calculate the index of these values simply using the global_id.

Why the barrier at the end of the kernel? It’s not doing anything except slowing you down.

If your goal is to iteratively run this simulation over and over, I’d suggest a double-buffer setup where you run from buffer 1 to buffer 2, then buffer 2 back to buffer 1, etc. Then you don’t need to copy buffers.

Avoid using the CPU to do any of the buffer management.

Pay attention to what “read only” and “write only” mean; with OpenCL buffers this is often from the point of view of the GPU, not the CPU. I’d suggest stating with read/write buffers and only change them after you get it working.

The best way to develop OpenCL kernels is to start with something simple that works and then make it more complex. I’d pare back your kernel to the simplest thing that works and then start making it do what you need it to do.

Like the other poster said, it’s hard to debug this without the host code or the other kernel.

FYI he’s also asked it over here a few days earlier, and has more source + a ton of help.

http://devgurus.amd.com/message/1284667

One hopes he might summarise his findings should he get it to function.