Low performance when writing results back to global memory

Hello,

We wrote a host code that runs a very simple CL code.

The CL code is:

__kernel void id_check(__global float *in,int n,__local float *out)
{
for (i=0;i<n;i++)
{
a=i;
vstore4 (a,i,in);
}
}

The host code runs the CL code on a single core:

size_t global_offset[] = {0};
size_t global_size[] = {1};

/* Enqueue kernel */

err = clEnqueueNDRangeKernel(queue, kernel, dim, global_offset,global_size, 0, 0 ,NULL, &prof_event);
if(err < 0)
{
perror(“Couldn’t enqueue the kernel”);
exit(1);
}

clFinish(queue);

clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_START,sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(prof_event, CL_PROFILING_COMMAND_END,sizeof(time_end), &time_end, NULL);
The run time is : time_end - time_start.

Can you tell if this is the right way to measure time ?

The result we got is: ~50nsec for writing 32bits (1 float) to the global memory.

This performance is very low. The rate is ~610Mb/sec

Can you tell how can we get better performance ?

Is there another better way to write results back to the global memory ?

When we changed the CL code to:

vstore4 (a,i,out);

The performance was almost the same. We expected that writing to local memory will be x100 better then writing to global memory.

Thank you in advance,

Zvika