How to get neighborhood pixels with shared memory

Hello Guys,

I have some image manipulation algorithms that I have converted to OpenCL, the problem is that a lot of these algorithms uses masks, normally 3x3 masks, for example the sobel algorithm, now I’m leaning how to use local memory to share data between threads, my idea was to use this to get the neighborhood of the pixel from the local memory so I don’t need to load all the pixels from the global memory in every thread.
The problem is that I don’t know if this idea is possible with shared memory, because I can’t get some of the neighborhood from boundaries pixels of the work group, as some of these pixels will belong to another work group.

So… my question is if this is possible, and if so, how?

Thanks!

Sure it’s possible, but depending on the hardware might not help much - it adds some overheads. You need to do 2 loads for example, for the boundary data. If you’re reading vector data remember the local store is 32-bits, so it might need breaking into components to avoid bank conflicts.

I have tried a lot of variations, but basically they boil down to:


// for the case of loading 1 row of data for sharing amongst threads
work group size = 64x1
x = global x id
lx = local x id
// do the following 3 times, with y=gy-1, y=gy, y=gy+1
// x and y need to be clamped/mirrored before using it to read memory
if (lx < boundary extra pixels total)
 load the border pixels beyond the current workgroup size, save to local array
  i.e from x+64-1, y
fi
// then do this 3 times, saving each value in a different register
load the pixels for the rest of the workgroup, save to local array
  - i.e. from x-1, y
barrier(local);
// now 'this' thread already has the first column of data, and it only needs
// to read 6 values from local

This way each ‘thread’ already has one column of values in a register, which saves an additional local load. Doing a XxN where N>1 can help more too. i.e. you can do 4 rows of output for each 6 rows of loads if you have a worksize 64x4, rather than 1 row of output for 3 rows of loads. The loading gets more complex though.

Unfortunately I can’t remember with a small kernel like 3x3 whether it was worth it, i think for some hardware without array cache it might be for arrays. For images I think it ends up being a wash at this size.

Hello notzed, thanks for the reply, but unfortunely I can’t figure out what you mean in your example, Could you please clarify this example for me ?

I’ve tried somethings on my own too, but the perfomance I got from it was worse than just getting all neighborhood from global memory, here is the code:


  int local_x = get_local_id(0);
  int local_y = get_local_id(1);

  shared_input[pos] = 255.f * read_imagef(input, sampler, (int2) (x, y));
 
  if (local_x == 0)
  {
    shared_input[pos - 1] = 255.f * read_imagef(input, sampler, (int2) (x - 1, y));
    if (local_y == 0)
      shared_input[pos - 10 - 1] = 255.f * read_imagef(input, sampler, (int2) (x - 1, y - 1));
  }
 
  if (local_y == 0)
  {
    shared_input[pos - 10] = 255.f * read_imagef(input, sampler, (int2) (x, y));
    if (local_x == 7)
      shared_input[pos - 10 + 1] = 255.f * read_imagef(input, sampler, (int2) (x + 1, y - 1));
  }
 
 
  if (local_x == 7)
  {
    shared_input[pos + 1] = 255.f * read_imagef(input, sampler, (int2) (x + 1, y));
    if (local_y == 7)
      shared_input[pos + 10 + 1] = 255.f * read_imagef(input, sampler, (int2) (x + 1, y + 1));
  }
 
  if (local_y == 7)
  {
    shared_input[pos + 10] = 255.f * read_imagef(input, sampler, (int2) (x, y + 1));
    if (local_x == 0)
      shared_input[pos + 10 - 1] = 255.f * read_imagef(input, sampler, (int2) (x - 1, y + 1));
  }

As you can see, what I did was simply send the thread pixels to the local array and then verify if this thread pixels is in a work-group boundary, if true, it will send the pixels necessary pixels from the other work-groups too, the problem is that this code is very inneficient.

Thanks !

Well I was talking about array access, where you orient your data access differently to benefit from coalescing, and some hardware has no l1 cache at all. It wasn’t clear in your initial query - otherwise I would’ve just said not to bother.

image reads are cached (on any existing hardware), so for this sized problem you’re pretty unlikely to benefit from using LS, and it will almost certainly make it slower. If the problem gets too big the tiny texture cache becomes ineffective but for a 3x3 it works well.

Try using a local work size size of 16x16, if you’re not using too many registers this is usually (in my experience) more efficient than 8x8 (which you seem to be using).