Hello,
I perform a 5-point stencil operation on a 2D array until getting a convergence computed on this 2D array. So I have multiple iterations (until convergence) and for each iteration, I am calling clEnqueueNDRangeKernel function to compute the new values of 2D input array.
I have done a version which works but performances are very bad : below this first solution into the main loop of my code :
while (!convergence) {
clEnqueueNDRangeKernel();
// Read output buffer and put it into xoutput
clEnqueueReadBuffer( xoutput_buffer, xoutput);
// Read error buffer and put it into error
clEnqueueReadBuffer( error_buffer, error);
// Write output array to input buffer
clEnqueueWriteBuffer( xinput_buffer, xoutput)
// put input buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&xinput_buffer);
}
if ((convergence<epsilon) || (step>maxStep)) break;
}
where xinput_buffer is the buffer associated to xinput array and xoutput_buffer is associated to xoutput array.
The kernel code is :
__kernel void kernelStencil(const double diagx, const double diagy,
const double weightx, const double weighty,
const int size_x,
__global double* input_array,
__global double* output_array,
__global double* error_array)
{
int iy = get_global_id(0)+1;
int ix = get_global_id(1)+1;
output_array[iy*(size_x+2)+ix] = input_array[(iy+1)(size_x+2)+ix];
+ input_array[(iy-1)(size_x+2)+ix];
+ input_array[iy*(size_x+2)+(ix+1)];
+ input_array[iy*(size_x+2)+(ix-1)];
+ input_array[iy*(size_x+2)+ix];
}
In order to get better performances, One advises me to swap the input (xinput_buffer) and output (output_buffer) before the next call of NDrangeKernel().
So I did :
xinput_buffer = create buffer 1
xoutput_buffer = create buffer 2
clEnqueueWriteBuffer(…, xinput_buffer, …);
clEnqueueWriteBuffer(…, xoutput_buffer, …);
cl_mem *ptrInput = &xinput_buffer;
cl_mem *ptrOutput = &xoutput_buffer;
while (!convergence)
{
// Set argument 5 (input_array)
clSetKernelArg(…, 5, ptrInput, …);
// Set argument 6 (output_array)
clSetKernelArg(..., 6, ptrOutput, ...);
clEnqueueNDRangeKernel(...);
// swap buffers
cl_mem *ptrTpm = ptrInput;
ptrInput = ptrOutput;
ptrOuput = ptrTmp;
}
This method works and I want to have an explanation about this approach :
Initially, I thought that I only needed to set output buffer to input buffer for the next call of NDRangeKernel :this way, I put the same buffer before calling kernel code ;
Here this attempt :
while (!convergence) {
clEnqueueNDRangeKernel();
// Read error buffer and put it into error
clEnqueueReadBuffer( error_buffer, error);
// put output_buffer into input argument for next call of NDRangeKernel
status = clSetKernelArg(
kernel,
5,
sizeof(cl_mem),
(void*)&xoutput_buffer);
}
But in my kernel code, I compute the next value (the output value) by :
int iy = get_global_id(0)+1;
int ix = get_global_id(1)+1;
output_array[iy(size_x+2)+ix] =function(input_array);
So I think there are conflicts between work-items threads for the stencil computation, i.e some values are overwritten because I use the same buffer (actually the same pointer for input and output array).
Finally, the solution is to, in addition to set output buffer to input buffer, to set also the input buffer to output one before the call of NDRangeKernel in order to avoid conflicts of overwritten values into the kernel code.
that’s why I need to swap : is this explanation right ?
Any help is welcome