Hey guys,
I have been looking for answers now for about a week and cant find anything useful, so here goes.
I have a kernel that takes a global float* as an input parameter, and another as an output. Due to the massive number of global accesses, the CPU is doing the algorithm quicker than the GPU, and I need it the other way around. I tried passing in a local float* to hold temp data from global to local, but it causes the code to error, and it outputs the exact same numbers it did last time I ran my program.
I tried this:
__kernel void simple(
global const float* input1, //input
global float* input2, //output
constant float* input3, //another input
local float* tempArg, //temp array
private int numData,
private int numData2)
{
int index = get_global_id(0);
...
//for testing purposes
tempArg[index] = index;
write_mem_fence(CLK_GLOBAL_MEM_FENCE);
...
output[index] = tempArg[index]; // this is where it breaks, giving me incorrect values
//output[index] = index //works, if I dont have the local arg in the kernel parameters
is it because I am running out of memory, or is it because something else is wrong? I am trying to make it faster, but it just keeps giving me crap values