Yes, you’re right. thanks. It works cause height and width have same size.
Furthermore, I think I can continue here, cause the code than I exposed was a part of the problem.
I’m working on dilate, erode, closing, opening filter. Dilatation and erosion works well.But closing operation give partial non-treated (or something else) pixels lines:
If I re-execute the same code, the new image have other lines with the problem :
writing to buffer (CL_MEM_READ_WRITE) is done after barrier:
What you are seeing is correct. In other words, the algorithm is doing exactly what you told it to do.
Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.
In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.
Notice that in the boundary between two work-group, your algorithm is not doing what you think it’s doing. After one work-group has finished executing, the pixels in that region of the image have already been altered. When another work-group reads from those pixels, the values are not what they were originally in the image.
Isn’t barrier supposed to synchronize work-item in work-group together, or it does it on work-group internally?
In other words, as long as you try to do the image operation in place you will see artifacts. What you should do instead is to read from one image and write into a different image. Then the artifacts will disappear.
I tried to do before and I tried now, with 3th argument in kernel like (brief code):
-> P.S.: Should I do dilate, then erode operation one after another in 2 step , I mean transferring datas to GPU 2 times, first for dilate, second for erode?
So, the way to do by send datas to GPU 2 times works. My goal was to do it in 1 time on the kernel. If this is the only way to do that , it’s OK for me. Else I’m open to suggestions.
Okay, since the explanation didn’t work I will show one simple way to make it work (there are better ways):
__kernel void Closing_first(const __global unsigned char *in,
__global unsigned char *out)
{
int row = get_global_id(1);
int col = get_global_id(0);
unsigned char pixel,neighborhood[9];
// Read from input image
if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
{
neighborhood[0]= in[(row-1)*WIDTH + col-1];
neighborhood[1]= in[(row-1)*WIDTH + col];
neighborhood[2]= in[(row-1)*WIDTH + col+1];
...
pixel = Max(neighborhood, 9); //dilate
}
else
{
pixel = in[row*WIDTH+col];
}
// Write into destination image.
// Do not write into input image! You would destroy it and cause artifacts.
out[row*WIDTH + col] = pixel ;
}
__kernel void Closing_second(const __global unsigned char *in,
__global unsigned char *out)
{
int row = get_global_id(1);
int col = get_global_id(0);
unsigned char pixel,neighborhood[9];
pixel = in[row*WIDTH+col];
// Read from input image
if(row > 0 && row < HEIGHT-1 && col > 0 && col < WIDTH-1)
{
neighborhood[0]= in[(row-1)*WIDTH + col-1];
neighborhood[1]= in[(row-1)*WIDTH + col];
neighborhood[2]= in[(row-1)*WIDTH + col+1];
...
pixel = Min(neighborhood, 9); //erode
}
else
{
pixel = in[row*WIDTH+col];
}
// Write into destination image.
// Do not write into input image! You would destroy it and cause artifacts.
out[row*WIDTH+col] = pixel;
}
From the API side you have to enqueue two kernels: one for Closing_first and then another for closing_second. Make sure that the output from Closing_first becomes the input of Closing_second.
Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.
Yes, that’s exactly what I meant by “transfer datas to GPU 2 times” and I did exactly what you wrote, and it had worked well.
Once you get that version working, you can try a better algorithm using local memory. The idea is to store the intermediate result in local memory instead of having two enqueue two kernels.
For now, I try to do some example on global memory only. Moreover, I have a real question about this, generally:
, should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?
or, should I work on local memory as soon as possible, and not waste time. ?
(I can ask the question on more appropriate thread ?)
should I do lot of example on global memory first for understand how kernels work, then improve them by working on local memory?
Yes, that’s what I would do. It’s easier to write a first version that works using global memory only. Also, it may be fast enough for what you need.
This is a general rule for writing software that has worked well in the past: always start with the smallest and simplest version of your program that you can imagine. After you get that one working you can add features and complexity later.