Hey Gamingdrake,
your code certainly fails, when using local memory for the input.
See top of page 27 from OpenCL 1.1 Specification (revision 44, June 1, 2011) from http://www.khronos.org/registry/cl/, where a table is shown, in which for local memory it is stated:
Host: Dynamic allocation, no access.
But you use
local const float* input
for your input data. But since the host cannot access the input array, it does not contain any data (well, there might be data garbage in it).
Actually, I am working on a quite similar topic. I pass the input data to the device using global memory, and then copy the data to a separate array in local memory. For you, this would look like:
__kernel void simple(
global const float* input,
global float* output,
constant float* weightsIn,
private int numWeights,
private int numData,
local float* localArray)
{
However, since the local memory is only accessible by work-items of one work-group, you have to consider, which data you need for a specific work-item. I have not done it yet, but I will most likely do this similar to a Matrix multiplication approach, which uses local memory.
You are right, unrolling loops in a kernel certainly makes it more efficient. But I don’t think, you will be able to unroll the two loops in your shown kernel. The indexing with the help of local or global IDs would be really complicated (if not impossible)…
So, to make your kernel work, you have to pass your input data to global memory, copy it to a separate array in local memory, and then do your calculation in the loops. I don’t think there is any other way, if I’m wrong, it would be nice if somebody corrects me.