Hi, I wrote an OpenCL kernel doing the dot product between two double arrays. This is the code:
_kernel void evaluate_product(__global const double *pFirstArray, const int n,
__global const double *pSecondArray, __global double* pOutArray)
int gid = get_global_id(0); int size = get_global_size(0);
if (gid>=0 && gid <size) {
double output = 0.0f;
for (int k=0; k<n; k++)
output += pLocal[k]*pSecondArray[k];
pOutArray[gid] = output;

Why this kernel took 30 ms on NVIDIA GTX 260, while on ARI Radeon HD 6900 it took less then 10 ms?
Any ideas? Or some optimization to use in kernel for NVIDIA card?