In advance, I do not expect you to do my work for me, I would just like some thoughts.

I have a kernel that needs to scan every item in an array of data. (pseudocode)

Code :
kernal void myKernel(
global const float* arrayValues,
global const float* arrayMult,
global const float* output)
   int index = get_global_id(0);
   int value = 0;
   for(int i = 0; i < arrayValues.length; i++)
       int x = algorithm;
       value += arrayMult[x] * arrayValues[i];
    output[index] = value;

So I have a lot of access into global memory, and the inputArray is too large to fit into my local mem, so what would be the best way to approach this?