Best way to implement this?

In advance, I do not expect you to do my work for me, I would just like some thoughts.

I have a kernel that needs to scan every item in an array of data. (pseudocode)


kernal void myKernel(
global const float* arrayValues,
global const float* arrayMult,
global const float* output)
{
   int index = get_global_id(0);
   int value = 0;
   for(int i = 0; i < arrayValues.length; i++)
   {
       int x = algorithm;
       value += arrayMult[x] * arrayValues[i];
   }
    output[index] = value;
}


So I have a lot of access into global memory, and the inputArray is too large to fit into my local mem, so what would be the best way to approach this?

To specify (maybe it will help), I am running a grid of weights over a larger grid of values. For each value, I am recalculating it based on the (values around it) * (weights that correspond), but it is making so many reads from global memory that it goes slower on the GPU than the CPU. Here is the code that I am using to call the kernel:


//sizeIn is equal to the size of an array of floats
clEnqueueNDRangeKernel(queue, kernel, 1, 0, &sizeIn, NULL, 0, NULL, NULL);

and here is the code from my kernel that is weighing down the efficiency


__kernel void simple(
	global const float* input,
	global float* output,
	constant float* weightsIn,
	private int halfVal,
	private int numData,
	private int numWeights)

...

for(int yIn = boundsYLeft; yIn <= boundsYRight; yIn++)
	{
		for(int xIn = boundsXLeft; xIn <= boundsXRight; xIn++)
		{
			weight += weightsIn[(xIn-boundsXLeft)+(yIn-boundsYLeft)*numWeights] * input[xIn+yIn*numData];
		}
	}
output[index] = weight;

my problem is the for loop here that loops through the smaller grid of weights, and applies changes to the large grid that is passed in, reading every piece of data from global memory. Is there a way to make this more efficient, such as global_work_size in a for loop, or somehow limiting the reads from global memory? Any ideas would be helpful

maybe prefetch or async_copy?