Block edges updates

Hello,

I am working on an algorithm where each pixel needs its four neighbors.

So for each 1616 block of threads, I want to copy the global memory into 1818 shared memory blocks (This way pixels on block edges will have correct neighbors) :

        
        -> [z z z z z]
[x x x] -> [z x x x z]
[x x x] -> [z x x x z]
[x x x] -> [z x x x z]
        -> [z z z z z]

This is my simple code:


__kernel void test(__global float* u,  __local volatile float* uLocal)
{
    // Block index
    int bi = get_group_id(1);
    int bj = get_group_id(0);

    // Local coordinates
    int li = get_local_id(1)+1;
    int lj = get_local_id(0)+1;
	
	// Global coordinates
	int gi = get_global_id(1);
	int gj = get_global_id(0);

	// Local height and width
	int lHeight = get_local_size(1)+2;
	int lWidth  = get_local_size(0)+2;

	// Global height and width
	int gHeight = get_global_size(1);
	int gWidth  = get_global_size(0);

	if( (gi-1<0) || (gi+1>gHeight-1) || (gj-1<0) || (gj+1>gWidth-1) )
		return;

	uLocal[li*lWidth+lj] = u[gi*gWidth+gj];

	for(int i=0 ; i<N ; i++)
	{
		// put edges in local memory
		if(li-1 == 0)
			uLocal[(li-1)*lWidth+lj] = u[(gi-1)*gWidth+gj];
		else if(li+1 == lHeight-1)
			uLocal[(li+1)*lWidth+lj] = u[(gi+1)*gWidth+gj];

		if(lj-1 == 0)
			uLocal[li*lWidth+lj-1] = u[gi*gWidth+gj-1];
		else if(lj+1 == lWidth-1)
			uLocal[li*lWidth+lj+1] = u[gi*gWidth+gj+1];

		//do something with uLocal...
		uLocal[li*lWidth+lj] = 0.5*uLocal[li*lWidth+lj+1] - 0.5*uLocal[li*lWidth+lj-1];

		barrier(CLK_LOCAL_MEM_FENCE);

		u[gi*gWidth+gj] = uLocal[li*lWidth+lj];
		barrier(CLK_GLOBAL_MEM_FENCE);
	}
}

It does not work (even without the for loop).

It is different from the oclMedianFilter example (which also read and write edges), I really don’t understand the point of this complex code.

What should I do ??
Help !

Could you elaborate a bit on how the code doesn’t work? Do you have sample inputs and outputs? How are the outputs different from what you expected?

Well, I have 3 problems, but what I want to know is: is my code correct ? Why does the “edge update” differ from the oclMedianFilter example ?

On of my problem is that with my previous code, I get some vertical black lines between the blocks depending on which thread has been executed first.

My second problem:

I tried this:


__kernel void test2(__global float* u,  __local volatile float* uLocal)
{
        // same init...

	uLocal[li*lWidth+lj] = 0;

	if( bi == 2 && bj == 2 )
		uLocal[li*lWidth+lj] = u[gi*gWidth+gj];
	
	if( bi == 2 && bj == 4 )
	{
		uLocal[li*lWidth+lj] = u[gi*gWidth+gj];
		if(li-1 == 0)
			uLocal[(li-1)*lWidth+lj] = 1;
		else if(li+1 == lHeight-1)
			uLocal[(li+1)*lWidth+lj] = 1;

		if(lj-1 == 0)
			uLocal[li*lWidth+lj-1] = 1;
		else if(lj+1 == lWidth-1)
			uLocal[li*lWidth+lj+1] = 1;

		// do something....

		if(li-1 == 0)
			u[(gi-1)*gWidth+gj] = uLocal[(li-1)*lWidth+lj];
		else if(li+1 == lHeight-1)
			u[(gi+1)*gWidth+gj] = uLocal[(li+1)*lWidth+lj];

		if(lj-1 == 0)
			u[gi*gWidth+gj-1] =  uLocal[li*lWidth+lj-1];
		else if(lj+1 == lWidth-1)
			u[gi*gWidth+gj+1] = uLocal[li*lWidth+lj+1];
	}

	u[gi*gWidth+gj] = uLocal[li*lWidth+lj];
}

And the bottom edge of the second block does note appear.
(I know that this program is not a perfect example because execution paths diverge)

And my third problem concerns the for loop:
How to synchronize threads when I must update edges (the threads on blocks’ edges will have more work) in the for loop. The barrier() and if() else() statements are not compatible and the whole program is skipped when threads are not synchronized (the GPU does nothing, the execution time is almost 0).

Maybe I must remove the for loop from the kernel and put the kernel into a for loop, but I guess that would require many clEnqueueRead/WirteBuffer and slow down the algorithm.

Thanks for helping !
Arthur

Thanks for the additional info. Notice that you are changing the image in place. What measures are you taking to prevent a previous work-group from updating the image before you load the border pixels into local memory?

Let’s say that your work-groups are of size 3x3 and your original image looks like this:


a a a b b b
a a a b b b
a a a b b b

What you want is each work group to load a tile of data into local memory like this:


work-group 'a' wants to load:

x x x x x
x a a a b
x a a a b
x a a a b
x x x x x

work-group 'b' wants to load

x x x x x
a b b b x
a b b b x
a b b b x
x x x x x

However, with the code you’ve shown, when work-group a (3x3) is executed the image is updated and now looks like this:


c c c b b b
c c c b b b
c c c b b b

…so when it’s the turn of work-group ‘b’ it will read this in local memory:


x x x x x
c b b b x
c b b b x
c b b b x
x x x x x

instead of


x x x x x
a b b b x
a b b b x
a b b b x
x x x x x

oclMedianFilter possibly gets away with it because the values from ‘a’ and ‘c’ look very similar (it’s computing a median filter after all). However, in your example ‘a’ and ‘c’ will look very different and that’s why you notice the artifacts.

You will have to read from an image X and write into a different image Y if you wan to avoid the artifacts.

Thank you very much for this clear answer :slight_smile: !
If I understand correctly, this problem can not be fixed with barrier or mem_fence functions since it would only synchronize memory within blocks ?

Anyway I tried to write in another variable, that works well, thanks.

I still have the other problems though :wink:

If I understand correctly, this problem can not be fixed with barrier or mem_fence functions since it would only synchronize memory within blocks ?

That’s right. There’s no way in OpenCL to communicate between work-groups.

I’ll look at the other questions after work :slight_smile:

ahah, thank you :slight_smile:

For my loop problem, I guess there is no other way than executing the kernel n times (instead of having the for loop inside the kernel).

It works but it really slows down the program…

And I still don’t know why the local edges did not update (my second problem), I tried to write the result in another variable and that did not fix the problem…

It works but it really slows down the program…

Does it make such a big difference? What size is the image and how do you make the calls to repeat the operation multiple times?

And I still don’t know why the local edges did not update (my second problem)

It’s basically the same problem as your first one, right? The white borders you are drawing in work group (2,4) are overwritten by the other work groups. This code is causing the issue:


      if(li-1 == 0)
         u[(gi-1)*gWidth+gj] = uLocal[(li-1)*lWidth+lj];
      else if(li+1 == lHeight-1)
         u[(gi+1)*gWidth+gj] = uLocal[(li+1)*lWidth+lj];

      if(lj-1 == 0)
         u[gi*gWidth+gj-1] =  uLocal[li*lWidth+lj-1];
      else if(lj+1 == lWidth-1)
         u[gi*gWidth+gj+1] = uLocal[li*lWidth+lj+1];

If you want it to go away then comment the last line in the kernel:


//u[gi*gWidth+gj] = uLocal[li*lWidth+lj];

Yes that’s right, this is the same problem. I realized it when I woke up this morning :wink: