Can't write in buffer

Hello,

A simple write in buffer fail, host :


cl_mem gpu_buffer_in = clCreateBuffer (	context,
												CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
												sizeof(unsigned char)*SIZE,
												&cpu_buffer_in,
												&error);
...
	const size_t global_work_size[2] = {WIDTH, HEIGHT};

	// NDRANGEKERNEL
	result = clEnqueueNDRangeKernel (	command_queue,
										kernel,
										2,
										NULL,
										&global_work_size[0],
										NULL,
										0, NULL, NULL);

kernel :


__kernel void Closing(__global const unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(0);	//width
    int col = get_global_id(1);	//height

	a[row*WIDTH + col] = 100 ;	// writing problem
	c[row*WIDTH+col] = a[row*WIDTH + col];
}

The problem occurs when I use :
a[row*WIDTH + col] = /something/ 100 ;

(c[row*WIDTH+col] = /something/ 100; ) works well.

Did you have any experience and knowledges about this problem?
Thanks!

I thought that was :


__kernel void Closing(__global const unsigned char *a, __global unsigned char *c)
{...}

but


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)

doesn’t work too.

I think it works without const.

Sorry, for the post. Maybe can be helpful.

Notice also that “row” and “col” are inverted.

get_global_id(0) will return values between 0 and WIDTH -1, so it should be assigned to “col”.

Yes, you’re right. thanks. It works cause height and width have same size.

Furthermore, I think I can continue here, cause the code than I exposed was a part of the problem.
I’m working on dilate, erode, closing, opening filter. Dilatation and erosion works well.But closing operation give partial non-treated (or something else) pixels lines:

If I re-execute the same code, the new image have other lines with the problem :

writing to buffer (CL_MEM_READ_WRITE) is done after barrier:


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
	unsigned char pixel,neighborhood[9];

//*
	if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
	{
		neighborhood[0]= a[(row-1)*WIDTH + col-1];
		neighborhood[1]= a[(row-1)*WIDTH + col];
		neighborhood[2]= a[(row-1)*WIDTH + col+1];

		neighborhood[3]= a[row*WIDTH + col-1];
		neighborhood[4]= a[row*WIDTH + col];
		neighborhood[5]= a[row*WIDTH + col+1];

		neighborhood[6]= a[(row+1)*WIDTH + col-1];
		neighborhood[7]= a[(row+1)*WIDTH + col];
		neighborhood[8]= a[(row+1)*WIDTH + col+1];

		pixel = Max(neighborhood, 9);	//dilate
	}
	else
	{
		pixel = a[row*WIDTH+col];
	}
//*/

	barrier(CLK_GLOBAL_MEM_FENCE);
	a[row*WIDTH + col] = pixel ;	// writing problem ? or barrier problem or maybe not here
	barrier(CLK_GLOBAL_MEM_FENCE);

//*
	pixel = a[row*WIDTH+col];

	if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
	{
		neighborhood[0]= a[(row-1)*WIDTH + col-1];
		neighborhood[1]= a[(row-1)*WIDTH + col];
		neighborhood[2]= a[(row-1)*WIDTH + col+1];

		neighborhood[3]= a[row*WIDTH + col-1];
		neighborhood[4]= a[row*WIDTH + col];
		neighborhood[5]= a[row*WIDTH + col+1];

		neighborhood[6]= a[(row+1)*WIDTH + col-1];
		neighborhood[7]= a[(row+1)*WIDTH + col];
		neighborhood[8]= a[(row+1)*WIDTH + col+1];

		pixel = Min(neighborhood, 9);	//erode

	}
	else
	{
		pixel = a[row*WIDTH+col];
	}
//*/
	c[row*WIDTH+col] = pixel;
}

Did you have an opinion on this issue?
Thanks!

I tested it several time and the problem occurs when I write something to buffer.

For exemple, the following code create correct image:


__kernel void Dilate(__global unsigned char *a, __global unsigned char *c)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
   unsigned char pixel,neighborhood[9];

//*
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= a[(row-1)*WIDTH + col-1];
      neighborhood[1]= a[(row-1)*WIDTH + col];
      neighborhood[2]= a[(row-1)*WIDTH + col+1];

      neighborhood[3]= a[row*WIDTH + col-1];
      neighborhood[4]= a[row*WIDTH + col];
      neighborhood[5]= a[row*WIDTH + col+1];

      neighborhood[6]= a[(row+1)*WIDTH + col-1];
      neighborhood[7]= a[(row+1)*WIDTH + col];
      neighborhood[8]= a[(row+1)*WIDTH + col+1];

      pixel = Max(neighborhood, 9);   //dilate
   }
   else
   {
      pixel = a[row*WIDTH+col];
   }
//*/

//   a[row*WIDTH + col] = pixel ;   // writing problem ? or barrier problem or maybe not here
   c[row*WIDTH+col] = pixel;
}

If I decomment :


//   a[row*WIDTH + col] = pixel ;

, it create the problem cited previously.

What you are seeing is correct. In other words, the algorithm is doing exactly what you told it to do.

Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.

In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.

Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.

Isn’t barrier supposed to synchronize work-item in work-group together, or it does it on work-group internally?

In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.

I tried to do before and I tried now, with 3th argument in kernel like (brief code):


__kernel void Closing(__global unsigned char *a,__global unsigned char *b, __global unsigned char *c)
{
   ....
   pixel = dilated_pixel(...);  // reading "a" buffer, like doing previous post
   b[row*WIDTH + col] = pixel ;
   barrier(CLK_GLOBAL_MEM_FENCE);
   pixel = eroded_pixel(...); // reading "b" buffer
   c[row*WIDTH + col] = pixel ;
}

, and same problem occurs.

The following code create black image, probably because of big size which can’t be allocate :


__kernel void Closing(__global unsigned char *a, __global unsigned char *c)
{
   ...
   unsigned char b[HEIGHT*WIDTH];
   pixel = dilated_pixel(...);  // reading "a" buffer, like doing previous post
   b[row*WIDTH + col] = pixel ;
   barrier(CLK_GLOBAL_MEM_FENCE);
   pixel = eroded_pixel(...); // reading "b" buffer
   c[row*WIDTH + col] = pixel ;
}

You can no longer edit or delete that post.

-> P.S.: Should I do dilate, then erode operation one after another in 2 step , I mean transferring datas to GPU 2 times, first for dilate, second for erode?

So, the way to do by send datas to GPU 2 times works. My goal was to do it in 1 time on the kernel. If this is the only way to do that , it’s OK for me. Else I’m open to suggestions.

Thank you for guiding me. :wink:

Okay, since the explanation didn’t work I will show one simple way to make it work (there are better ways):


__kernel void Closing_first(const __global unsigned char *in,
                                __global unsigned char *out)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
    unsigned char pixel,neighborhood[9];

   // Read from input image
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= in[(row-1)*WIDTH + col-1];
      neighborhood[1]= in[(row-1)*WIDTH + col];
      neighborhood[2]= in[(row-1)*WIDTH + col+1];
      ...
      pixel = Max(neighborhood, 9);   //dilate
   }
   else
   {
      pixel = in[row*WIDTH+col];
   }
   
   // Write into destination image.
   // Do not write into input image! You would destroy it and cause artifacts.
   out[row*WIDTH + col] = pixel ;
}


__kernel void Closing_second(const __global unsigned char *in,
                                __global unsigned char *out)
{
    int row = get_global_id(1);
    int col = get_global_id(0);
    unsigned char pixel,neighborhood[9];

   pixel = in[row*WIDTH+col];

   // Read from input image
   if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
   {
      neighborhood[0]= in[(row-1)*WIDTH + col-1];
      neighborhood[1]= in[(row-1)*WIDTH + col];
      neighborhood[2]= in[(row-1)*WIDTH + col+1];
      ...

      pixel = Min(neighborhood, 9);   //erode
   }
   else
   {
      pixel = in[row*WIDTH+col];
   }

   // Write into destination image.
   // Do not write into input image! You would destroy it and cause artifacts.
   out[row*WIDTH+col] = pixel;
}

From the API side you have to enqueue two kernels: one for Closing_first and then another for closing_second. Make sure that the output from Closing_first becomes the input of Closing_second.

Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.

Yes, that’s exactly what I meant by “transfer datas to GPU 2 times” and I did exactly what you wrote, and it had worked well.

Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.

For now, I try to do some example on global memory only. Moreover, I have a real question about this, generally:

  • , should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?
  • or, should I work on local memory as soon as possible, and not waste time. ?

(I can ask the question on more appropriate thread ?)

should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?

Yes, that’s what I would do. It’s easier to write a first version that works using global memory only. Also, it may be fast enough for what you need.

This is a general rule for writing software that has worked well in the past: always start with the smallest and simplest version of your program that you can imagine. After you get that one working you can add features and complexity later.

OK, thank you!

And I’ll remember (this kind of) trick you told me with local memory when I’ll try to improve my code.