Hello,
We wrote a host code that runs a very simple CL code.
The CL code is:
__kernel void id_check(__global float *in,int n,__local float *out)
{
for (i=0;i<n;i++)
{
a=i;
vstore4 (a,i,in);
}
}
The host code runs the CL code on a single core:
size_t global_offset[] = {0};
size_t global_size[] = {1};
/* Enqueue kernel */
err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,global_size, 0, 0 ,NULL, &prof_event);
if(err < 0)
{
perror(“Couldn’t enqueue the kernel”);
exit(1);
}
clFinish(queue);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,sizeof(time_end), &time_end, NULL);
The run time is : time_end - time_start.
Can you tell if this is the right way to measure time ?
The result we got is: ~50nsec for writing 32bits (1 float) to the global memory.
This performance is very low. The rate is ~610Mb/sec
Can you tell how can we get better performance ?
Is there another better way to write results back to the global memory ?
When we changed the CL code to:
vstore4 (a,i,out);
The performance was almost the same. We expected that writing to local memory will be x100 better then writing to global memory.
Thank you in advance,
Zvika