Using Local Memory with Large Data Buffer

Backround Info: I’m working on a image debayering algorithm in Opencl. Basically what this does is take 1 channel image data taken by a camera (which can only capture one color at each point) , and interpolates for the other two color values by using the adjacent color values and then storing this information as a three channel image. The particular image size i’m working with is 3280 by 4904, so a fairly large image. My original program used only global memory, and processed a 2x2 square at each kernel call. I have tons of memory accesses in the program and I was wondering if it is possible to use local memory to somehow improve the run time. Right now it runs at about .22 seconds, i’d like to get it below .1 seconds.

Problem
So essentially the problem i’m dealing with is that I have a data array that’s way to large to read into local memory at once, and by doing it in a segmented fashion (as I am now) i’m going to end up rereading data that has already been read to local memory (each 2 by 2 square need the surrounding 12 pixels to debayer the image, thus resulting in an overlap). Plus, I can think of a way to read the data into local memory in a coalesced fashion since the data each work group needs is spread apart by multiple rows of image data. So in my situation is local memory even a viable solution? I’m new to local memory and I’m sure I’m probably missing a lot, so any help really would be greatly appreciated. I think part of my problem is that I really don’t understand the connection between creating local memory objects, and the work group size, if there even is one. Also, an other general ideas on how performance could be improved would also be great. Here’s the code.

Ok, so this is my attempted implementation of the algorithm using local memory. This might be a little more helpful in figuring out where my thinking has gone wrong than the non local memory code.


Code Removed

It runs a little bit faster but not much, and the output image is jagged (the non local memory version did not produce a jagged image).

Update
See my post below for updated information.

hi,

to clarify your questions regarding local memory: Its a fast part of the memory only the active local thread group can access. In your example the global Workgroupsize (WGS) is {3280x4904} and your local WGS might than be {410, 1}. That means these 410 threads will share the same local memory. Local memory becomes more effective the more you use it. There is a way to calculate when local memory becomes useful but i dont know the cycles for global mem acces. But lets take these values for example:
Global mem access := 200 cycles
Local mem access := 30 cycles
This would mean, reading two times the same value from global memory in one local Thread group would slow down your program. But it’s not that simple, because global memory reads are in larger chunks (256 byte on sm_1, smaller on later sm).
So you should generate one local memory array with locdata[4][413] containing the data for all the workitems you have in your local thread group. Then you split the work in two parts:

  1. Each thread copies locdata[0][locID(0)+1], locdata[1][locID(0)+1], locdata[2][locID(0)+1], locdata[3][locID(0)+1]. After that, locdata[0-4][0] and locdata[0-4][412-413] should be copied somehow.
    LOCAL_MEM_BARRIER
  2. Each thread computes their data into a private register. And in the end they are setting the result like you did now.

And you should try to set your results vector with vstoren(). Build up an float8 + float4 with your result data locally. use vstore8(float8, threadidx0) and vstore4(float4, threadidx0+8 )

Thanks, I think the local memory concept is starting to come together for me, but i’m a little confused on a few aspects.

  1. Where did you get the size [4][413]? So if I’m on the same page, element 0 is for the 4 pieces of data before the first 2 by 2 square, and element 411 is for the 4 points after the last 2 by 2 square, so are 412 to 413 padding?

  2. With the vector data types, is there a way I could group the vector data into 6’s instead? The first six elements get stored at a specific offset in the global result array, and the next six get stored at a location that is offset by the value of cols. If I store them in groups of 8 and 4, the middle two value aren’t going to be in the right locations in the global data array. At least, that is the way I’m understanding it. I know there is no float6, but is there a way to get around this problem? Or do I just have to use a local array of floats like “float locResult[12]”? Plus, wouldn’t my local result buffer have to be scaled according to my group size like the locData buffer. So, “float locResult[WGS][12]”? Or am I just on the wrong track completely…

This is what I’ve got so far for my updated code. This is assuming I left my local work group size is left at 64.


Code Removed

This code runs without distortion, but the image is a little off color, with some artifacts (my guess is that’s from assigning elements [0-4][0], the way I did it was wrong but I’m still figuring out a way to do it). The more more important issue is that it only runs a little faster than the original. So my question about that, is am I accessing the memory in a way that’s inefficient, and if so how can I improve it?

Hi,

you can completely delete the last barrier. because this will only stall all objects already finished. the barrier just means all threads have to get thair before continuing. so your threads wait for all the others before terminating -> useless.

To your questions:
i yould have splitted up your problem in LWS 410x1 and GWS 3280x4904, This would mean every line in your image is devided into 4 local workgroups. this might not be optimal because of a little more stress on the local memory. just have a try.
Elementsize of 413 because thread 0 needs id-1, id, id+1 and id+2. So you will heed an array that is one pixel larger to the left and two pixel larger to the right, isnt it?

There is no float6 so you could only use a float 4 and float2. I dont know if its faster but with a float4 opencl knows that there are 4 values coming and that he dont has to wait until rIndex is computed. this could maybe speed things up a little.

if there is no speedup with your local memory maybe you can get more computations to that kernel to reuse that local memory more often.

One big problem is that you copy loc[0][0] with EACH THREAD. you need an if(lID == 0) around that!

Try skipping the second barrier too and directly compy the local mem values to result.

Nice weekend :slight_smile:

Thanks, you’ve been very helpful. I appreciate your time.

Why have you deleted your code?

I was required to.

Did you get a speedup?

With local memory I did get a bit of a speed up. I went from 2.0 to 1.6. Not the as big as I was hoping, but better than nothing. Increasing the LWGS beyond 64 didn’t seem to make things much faster/ However This might be partly because I have a fairly low end GPU.