Hello,
My input to the kernel is 4 x 2D matrices each contains 256x32 float numbers.
The size of the output is the same.
So in the host I called to:
size_t dim = 2;
size_t global_offset[] = {0, 0};
size_t global_size[] = {4 , 256 * 32};
err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,
global_size, 0, 0 ,NULL, &prof_event);
I dicided that each element in the output will be a work item.
Not sure it is wise.
The kernel function is:
__kernel void id_check(__global float *in,
__global float *out,
int n_in_matrices,
int n_out_matrices)
In order to run faster I changed to:
__kernel void id_check(__global float4 *in,
__global float4 *out,
int n_in_matrices,
int n_out_matrices)
Of course that I changed the code of the kernel so that 4 elements will be processed at single clock.
In both cases I got the same results and the same processing time.
It does not make sense !!!
The second version should work 4 times faster.
What should I change in the host code ?
Thanks,
Zvika