Barrier and Array copying

Gamingdrake · June 21, 2011, 7:58am

Hello,

I am having a problem with the efficiency of my kernel due to too many global reads. Therefore, I would like to copy the global array into a “shared” array inside my kernel. The code that I have does not work though. How would I go about changing this so that the barrier is properly working and the assignment is correct. Currently, I am getting random values for the array.


//correctly getting index
local float* temp = new float[SIZE];
temp[index] = input[index]; //input == the array passed to the kernel
barrier(CLK_GLOBAL_MEM_FENCE);

Gamingdrake · June 21, 2011, 10:09am

I tried passing the array (float*) into the kernel as an empty array that had been initialized. It is not even giving me answers that make sense. Im telling it to print out 15, but it just prints out what was there last time I ran it. I think it is running out of local memory, because if I just leave the local float* parameter out, it runs fine, but if I add it in, it breaks

centershock · June 21, 2011, 11:39pm

Hey gamingdrake,

OpenCL cannot dynamically allocate local memory. You either use a specific number for the size of the array, or #define the size above your kernel code.

Secondly, it is hard to get what you want with such small code fragments…

Hope that helped

Gamingdrake · June 23, 2011, 3:00pm

Again thank you for your answer. The reason I am keeping the code small is because I am technically not allowed to show the full code, due to a pending patent. So to keep things simple, I have a basic image convolution kernel,


__kernel void simple(
	local const float* input,
	global float* output,
	constant float* weightsIn,
	private int numWeights,
	private int numData)
{
	const int halfVal = numWeights/2;
	const int x = get_global_id(0);
	const int y = get_global_id(1);

	float weight = 0;
	
	for(int y2 = 0; y2 < numWeights; y2++) // for all weights y
	{
		for(int x2 = 0; x2 < numWeights; x2++) // for all weights x
		{
			weight += weightsIn[x2+y2*numWeights] * input[x+x2 + (y+y2)*numData];
		}
	}	
	
	output[x+y*numData] = weight;
}

and I am trying to make it more efficient. One way I know of doing it is to unroll the loops, but the code above fails. If I change the “local” to “global” then it works and prints out the correct values. If I keep it local, it fails completely, either printing out previous values or printing out values that dont make sense.
I am running the kernel “p” times, where p is the number of squares I cut the image into. I then read the buffer back to the new image I want to make. The problem is that the weights work when its global, but dont when I make it local. Is there anyway I can make the input local without killing the kernel?

centershock · June 27, 2011, 12:07am

Hey Gamingdrake,

your code certainly fails, when using local memory for the input.

See top of page 27 from OpenCL 1.1 Specification (revision 44, June 1, 2011) from http://www.khronos.org/registry/cl/, where a table is shown, in which for local memory it is stated:

Host: Dynamic allocation, no access.

But you use

local const float* input

for your input data. But since the host cannot access the input array, it does not contain any data (well, there might be data garbage in it).

Actually, I am working on a quite similar topic. I pass the input data to the device using global memory, and then copy the data to a separate array in local memory. For you, this would look like:

__kernel void simple(
   global const float* input,
   global float* output,
   constant float* weightsIn,
   private int numWeights,
   private int numData,
   local float* localArray)
{

However, since the local memory is only accessible by work-items of one work-group, you have to consider, which data you need for a specific work-item. I have not done it yet, but I will most likely do this similar to a Matrix multiplication approach, which uses local memory.

You are right, unrolling loops in a kernel certainly makes it more efficient. But I don’t think, you will be able to unroll the two loops in your shown kernel. The indexing with the help of local or global IDs would be really complicated (if not impossible)…

So, to make your kernel work, you have to pass your input data to global memory, copy it to a separate array in local memory, and then do your calculation in the loops. I don’t think there is any other way, if I’m wrong, it would be nice if somebody corrects me.

Gamingdrake · June 28, 2011, 1:00pm

Thank you for your help Centershock. I have been looking for answers to this problem for awhile. I was able to unroll the loops, by cutting the data up into another grid. Turns out copying it to local memory made it upwards of 700% slower, so I scapped that idea. You helped me to understand local memory a little better though, so thank you again.