Can't write in buffer

wrx · March 16, 2011, 2:35am

Hello,

A simple write in buffer fail, host :


cl_mem gpu_buffer_in = clCreateBuffer (	context,
												CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
												sizeof(unsigned char)*SIZE,
												&cpu_buffer_in,
												&error);
...
	const size_t global_work_size[2] = {WIDTH, HEIGHT};

	// NDRANGEKERNEL
	result = clEnqueueNDRangeKernel (	command_queue,
										kernel,
										2,
										NULL,
										&global_work_size[0],
										NULL,
										0, NULL, NULL);

kernel :


__kernel void Closing(__global const unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(0);	//width
    int col = get_global_id(1);	//height

	a[row*WIDTH + col] = 100 ;	// writing problem
	c[row*WIDTH+col] = a[row*WIDTH + col];
}

The problem occurs when I use :
a[row*WIDTH + col] = /something/ 100 ;

(c[row*WIDTH+col] = /something/ 100; ) works well.

Did you have any experience and knowledges about this problem?
Thanks!

wrx · March 16, 2011, 4:32am

I thought that was :


__kernel void Closing(__global const unsigned char *a, __global unsigned char *c)
{...}

but


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)

doesn’t work too.

wrx · March 16, 2011, 4:58am

I think it works without const.

Sorry, for the post. Maybe can be helpful.

david.garcia · March 16, 2011, 11:07am

Notice also that “row” and “col” are inverted.

get_global_id(0) will return values between 0 and WIDTH -1, so it should be assigned to “col”.

wrx · March 17, 2011, 1:12am

Yes, you’re right. thanks. It works cause height and width have same size.

Furthermore, I think I can continue here, cause the code than I exposed was a part of the problem.
I’m working on dilate, erode, closing, opening filter. Dilatation and erosion works well.But closing operation give partial non-treated (or something else) pixels lines:

If I re-execute the same code, the new image have other lines with the problem :

writing to buffer (CL_MEM_READ_WRITE) is done after barrier:


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
	unsigned char pixel,neighborhood[9];

//*
	if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
	{
		neighborhood[0]= a[(row-1)*WIDTH + col-1];
		neighborhood[1]= a[(row-1)*WIDTH + col];
		neighborhood[2]= a[(row-1)*WIDTH + col+1];

		neighborhood[3]= a[row*WIDTH + col-1];
		neighborhood[4]= a[row*WIDTH + col];
		neighborhood[5]= a[row*WIDTH + col+1];

		neighborhood[6]= a[(row+1)*WIDTH + col-1];
		neighborhood[7]= a[(row+1)*WIDTH + col];
		neighborhood[8]= a[(row+1)*WIDTH + col+1];

		pixel = Max(neighborhood, 9);	//dilate
	}
	else
	{
		pixel = a[row*WIDTH+col];
	}
//*/

	barrier(CLK_GLOBAL_MEM_FENCE);
	a[row*WIDTH + col] = pixel ;	// writing problem ? or barrier problem or maybe not here
	barrier(CLK_GLOBAL_MEM_FENCE);

//*
	pixel = a[row*WIDTH+col];

	if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
	{
		neighborhood[0]= a[(row-1)*WIDTH + col-1];
		neighborhood[1]= a[(row-1)*WIDTH + col];
		neighborhood[2]= a[(row-1)*WIDTH + col+1];

		neighborhood[3]= a[row*WIDTH + col-1];
		neighborhood[4]= a[row*WIDTH + col];
		neighborhood[5]= a[row*WIDTH + col+1];

		neighborhood[6]= a[(row+1)*WIDTH + col-1];
		neighborhood[7]= a[(row+1)*WIDTH + col];
		neighborhood[8]= a[(row+1)*WIDTH + col+1];

		pixel = Min(neighborhood, 9);	//erode

	}
	else
	{
		pixel = a[row*WIDTH+col];
	}
//*/
	c[row*WIDTH+col] = pixel;
}

Did you have an opinion on this issue?
Thanks!

wrx · March 17, 2011, 7:27am

I tested it several time and the problem occurs when I write something to buffer.

For exemple, the following code create correct image:


__kernel void Dilate(__global unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
   unsigned char pixel,neighborhood[9];

//*
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= a[(row-1)*WIDTH + col-1];
      neighborhood[1]= a[(row-1)*WIDTH + col];
      neighborhood[2]= a[(row-1)*WIDTH + col+1];

      neighborhood[3]= a[row*WIDTH + col-1];
      neighborhood[4]= a[row*WIDTH + col];
      neighborhood[5]= a[row*WIDTH + col+1];

      neighborhood[6]= a[(row+1)*WIDTH + col-1];
      neighborhood[7]= a[(row+1)*WIDTH + col];
      neighborhood[8]= a[(row+1)*WIDTH + col+1];

      pixel = Max(neighborhood, 9);   //dilate
   }
   else
   {
      pixel = a[row*WIDTH+col];
   }
//*/

//   a[row*WIDTH + col] = pixel ;   // writing problem ? or barrier problem or maybe not here
   c[row*WIDTH+col] = pixel;
}

If I decomment :


//   a[row*WIDTH + col] = pixel ;

, it create the problem cited previously.

david.garcia · March 17, 2011, 3:51pm

What you are seeing is correct. In other words, the algorithm is doing exactly what you told it to do.

Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.

In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.

wrx · March 18, 2011, 12:53am

Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.

Isn’t barrier supposed to synchronize work-item in work-group together, or it does it on work-group internally?

In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.

I tried to do before and I tried now, with 3th argument in kernel like (brief code):


__kernel void Closing(__global unsigned char *a,__global unsigned char *b, __global unsigned char *c)
{
   ....
   pixel = dilated_pixel(...);  // reading "a" buffer, like doing previous post
   b[row*WIDTH + col] = pixel ;
   barrier(CLK_GLOBAL_MEM_FENCE);
   pixel = eroded_pixel(...); // reading "b" buffer
   c[row*WIDTH + col] = pixel ;
}

, and same problem occurs.

The following code create black image, probably because of big size which can’t be allocate :


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)
{
   ...
   unsigned char b[HEIGHT*WIDTH];
   pixel = dilated_pixel(...);  // reading "a" buffer, like doing previous post
   b[row*WIDTH + col] = pixel ;
   barrier(CLK_GLOBAL_MEM_FENCE);
   pixel = eroded_pixel(...); // reading "b" buffer
   c[row*WIDTH + col] = pixel ;
}

wrx · March 18, 2011, 1:03am

You can no longer edit or delete that post.

-> P.S.: Should I do dilate, then erode operation one after another in 2 step , I mean transferring datas to GPU 2 times, first for dilate, second for erode?

wrx · March 18, 2011, 7:04am

So, the way to do by send datas to GPU 2 times works. My goal was to do it in 1 time on the kernel. If this is the only way to do that , it’s OK for me. Else I’m open to suggestions.

Thank you for guiding me.

david.garcia · March 18, 2011, 3:42pm

Okay, since the explanation didn’t work I will show one simple way to make it work (there are better ways):


__kernel void Closing_first(const __global unsigned char *in,
                                __global unsigned char *out)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
    unsigned char pixel,neighborhood[9];

   // Read from input image
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= in[(row-1)*WIDTH + col-1];
      neighborhood[1]= in[(row-1)*WIDTH + col];
      neighborhood[2]= in[(row-1)*WIDTH + col+1];
      ...
      pixel = Max(neighborhood, 9);   //dilate
   }
   else
   {
      pixel = in[row*WIDTH+col];
   }
   
   // Write into destination image.
   // Do not write into input image! You would destroy it and cause artifacts.
   out[row*WIDTH + col] = pixel ;
}


__kernel void Closing_second(const __global unsigned char *in,
                                __global unsigned char *out)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
    unsigned char pixel,neighborhood[9];

   pixel = in[row*WIDTH+col];

   // Read from input image
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= in[(row-1)*WIDTH + col-1];
      neighborhood[1]= in[(row-1)*WIDTH + col];
      neighborhood[2]= in[(row-1)*WIDTH + col+1];
      ...

      pixel = Min(neighborhood, 9);   //erode
   }
   else
   {
      pixel = in[row*WIDTH+col];
   }

   // Write into destination image.
   // Do not write into input image! You would destroy it and cause artifacts.
   out[row*WIDTH+col] = pixel;
}

From the API side you have to enqueue two kernels: one for Closing_first and then another for closing_second. Make sure that the output from Closing_first becomes the input of Closing_second.

Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.

wrx · March 21, 2011, 1:36am

Yes, that’s exactly what I meant by “transfer datas to GPU 2 times” and I did exactly what you wrote, and it had worked well.

Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.

For now, I try to do some example on global memory only. Moreover, I have a real question about this, generally:

, should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?
or, should I work on local memory as soon as possible, and not waste time. ?

(I can ask the question on more appropriate thread ?)

david.garcia · March 21, 2011, 5:31am

should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?

Yes, that’s what I would do. It’s easier to write a first version that works using global memory only. Also, it may be fast enough for what you need.

This is a general rule for writing software that has worked well in the past: always start with the smallest and simplest version of your program that you can imagine. After you get that one working you can add features and complexity later.

wrx · March 21, 2011, 6:44am

OK, thank you!

And I’ll remember (this kind of) trick you told me with local memory when I’ll try to improve my code.