multiple NDRangeKernel calls : output become input

Hello,

I perform a 5-point stencil operation on a 2D array until getting a convergence computed on this 2D array. So I have multiple iterations (until convergence) and for each iteration, I am calling clEnqueueNDRangeKernel function to compute the new values of 2D input array.

I have done a version which works but performances are very bad : below this first solution into the main loop of my code :

while (!convergence) {

clEnqueueNDRangeKernel();

// Read output buffer and put it into xoutput
clEnqueueReadBuffer( xoutput_buffer, xoutput);

// Read error buffer and put it into error
clEnqueueReadBuffer( error_buffer, error);

// Write output array to input buffer
clEnqueueWriteBuffer( xinput_buffer, xoutput)

// put input buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&xinput_buffer);
}

if ((convergence<epsilon) || (step>maxStep)) break;
}

where xinput_buffer is the buffer associated to xinput array and xoutput_buffer is associated to xoutput array.

The kernel code is :

__kernel void kernelStencil(const double diagx, const double diagy,
const double weightx, const double weighty,
const int size_x,
__global double* input_array,
__global double* output_array,
__global double* error_array)
{
int iy = get_global_id(0)+1;
int ix = get_global_id(1)+1;

output_array[iy*(size_x+2)+ix] = input_array[(iy+1)(size_x+2)+ix];
+ input_array[(iy-1)
(size_x+2)+ix];
+ input_array[iy*(size_x+2)+(ix+1)];
+ input_array[iy*(size_x+2)+(ix-1)];
+ input_array[iy*(size_x+2)+ix];
}

In order to get better performances, One advises me to swap the input (xinput_buffer) and output (output_buffer) before the next call of NDrangeKernel().

So I did :

xinput_buffer = create buffer 1
xoutput_buffer = create buffer 2

clEnqueueWriteBuffer(…, xinput_buffer, …);
clEnqueueWriteBuffer(…, xoutput_buffer, …);

cl_mem *ptrInput = &xinput_buffer;
cl_mem *ptrOutput = &xoutput_buffer;

while (!convergence)
{
// Set argument 5 (input_array)
clSetKernelArg(…, 5, ptrInput, …);

// Set argument 6 (output_array)
clSetKernelArg(..., 6, ptrOutput, ...);
clEnqueueNDRangeKernel(...);

// swap buffers
cl_mem *ptrTpm = ptrInput;
ptrInput = ptrOutput;
ptrOuput = ptrTmp;

}

This method works and I want to have an explanation about this approach :

Initially, I thought that I only needed to set output buffer to input buffer for the next call of NDRangeKernel :this way, I put the same buffer before calling kernel code ;

Here this attempt :

while (!convergence) {

clEnqueueNDRangeKernel();

// Read error buffer and put it into error
clEnqueueReadBuffer( error_buffer, error);

// put output_buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&xoutput_buffer);
}

But in my kernel code, I compute the next value (the output value) by :

int iy = get_global_id(0)+1;
int ix = get_global_id(1)+1;
output_array[iy(size_x+2)+ix] =function(input_array);

So I think there are conflicts between work-items threads for the stencil computation, i.e some values are overwritten because I use the same buffer (actually the same pointer for input and output array).

Finally, the solution is to, in addition to set output buffer to input buffer, to set also the input buffer to output one before the call of NDRangeKernel in order to avoid conflicts of overwritten values into the kernel code.

that’s why I need to swap : is this explanation right ?

Any help is welcome

You are doing extensive buffer manipulation in the host code, and this is very likely the source of your performance problems. The host code should only initialize the buffers and get the final results, it should not be doing any work on them for each iteration. Use OpenCL kernels for that. You left out your buffer creation code, but I suspect that you think buffers can only be read-only and write-only, but in fact they can be read-write. You can have two read-write buffers and ping-pong between which is the source and which is the destination instead of copying them using host code (which is going through PCIe and host memory to do the work, when it could be zero cost).