2nd Output Array gives garbage!

Hi,

I am a newbie to OpenCL. I have been tasked to do some image processing stuff.
Anyways, I am passing 2 sets of YUV data (left and right images of size 320 x 168) to a kernel function, which will compute the gradient of each pixel using SAD (Sum of absolute differences). For my first output, I only use 1 set of YUV data and it works nicely with the values same as computed by the CPU.

However, when I try to add a 2nd set of YUV data to the kernel function and compute for the 2nd output array, nothing seems to work (on the 2nd output array). I tried hard coding all members to 5 but the output array still shows values of its own.

Here is the kernel function implementation (output arrays are the first 2 parameters):
const char grad_l_h_cl[] = "
__kernel void grad_l_h
(
__global unsigned char* img_grad_left_hor
, __global unsigned char* img_grad_right_hor
, __global unsigned char* p1_y
, __global unsigned char* p1_u
, __global unsigned char* p1_v
, __global unsigned char* p2_y
, __global unsigned char* p2_u
, __global unsigned char* p2_v
, int width
, int height
)
{
const uint index = get_global_id(0);
unsigned char diff_y = p2_y[index]-p2_y[index+1], diff_u = 0, diff_v = 0;

if (index % width == width - 1){
img_grad_left_hor[index] = abs_diff(p1_y[index-1],p1_y[index]) + abs_diff(p1_u[index-1],p1_u[index])+ abs_diff(p1_v[index-1],p1_v[index]);
img_grad_right_hor[index] = abs_diff(p2_y[index-1],p2_y[index]) + abs_diff(p2_u[index-1],p2_u[index])+ abs_diff(p2_v[index-1],p2_v[index]);
} else {
img_grad_left_hor[index] = abs_diff(p1_y[index],p1_y[index+1]) + abs_diff(p1_u[index],p1_u[index+1])+ abs_diff(p1_v[index],p1_v[index+1]);
img_grad_right_hor[index] = abs_diff(p2_y[index],p2_y[index+1]) + abs_diff(p2_u[index],p2_u[index+1])+ abs_diff(p2_v[index],p2_v[index+1]);
}
if (img_grad_left_hor[index] == 0) {
img_grad_left_hor[index] = 0;
}
if (img_grad_left_hor[index] > 255) {
img_grad_left_hor[index] = 255;
}
}
";

Here is how I perform the operation (g_worksize = 8 x 320 x 168, l_worksize = 256):
error=clEnqueueNDRangeKernel(cq, k_cfg, 1, NULL, &g_worksize, &l_worksize, 0, NULL, NULL);

I have created input buffers as such (work = 320 x 168):
memp1_u=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);

I suspect some memory settings are required to hold the data in the 2nd output array but I have no idea how to do so.

Please kindly help or advise.

*My system is:
GT220 - 6 multiprocessors, 48 CUDA cores, Compute Capability 1.2
GPU Computing SDK 3.2
WinXP Pro

Have you tried writing the kernel so that it only processes one image and then call clEnqueueNDRangeKernel() twice, one time for the left image and another time for the right image? Would that be a problem?

Also, since you are not using local memory for anything, I would suggest passing NULL instead of &l_worksize when you call clEnqueueNDRangeKernel.

Could you explain what’s the reason for “g_worksize = 8 x 320 x 168”? If the images are of size 320x168, wouldn’t it make more sense to have the global work size equal to 320x168?

Finally, please show us how you create the buffer objects for all the kernel arguments (including img_grad_{left,right}_hor) and also the calls to clSetKernelArg(). There may be something there causing the problem you see.

Thanks for your reply, David.

I am currently using your proposed method after giving up to make the code posted work.
I still can’t understand why the code fails to work. I have seen sample codes with more than 1 output array and many more elements.

I use 256 for l_worksize 'cos I read from some presentation that it’s reasonable value for parallelism (and I have a matching number of cores). Anyway, I will pass NULL.

Basically, I thought the memory space required for the whole kernel to run would need 8 (2 x 3 sets of YUV data and 2 x output) even though I have already set up buffers for the inputs and outputs. I was just trying all sorts of method to get it working. I really have no idea why the 2nd output array contains all “un-editable” values.

This is how I create the buffers and set the kernel args (worksize is 320 x 168):
cl_mem memp1_y, memp1_u, memp1_v, memp2_y, memp2_u, memp2_v;
cl_mem memgrad_l_h, memgrad_l_v, memgrad_r_h, memgrad_r_v;
memp1_y=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);
memp1_u=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);
memp1_v=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);
memp2_y=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);
memp2_u=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);
memp2_v=clCreateBuffer(context, CL_MEM_READ_ONLY, worksize, NULL, &error);

memgrad_l_h=clCreateBuffer(context, CL_MEM_WRITE_ONLY, worksize, NULL, &error);
memgrad_r_h=clCreateBuffer(context, CL_MEM_WRITE_ONLY, worksize, NULL, &error);

// get a handle and map parameters for the kernel
cl_kernel k_cfg=clCreateKernel(prog, "cfg", &error);
clSetKernelArg(k_cfg, 0, sizeof(cl_mem), &memgrad_l_h);
clSetKernelArg(k_cfg, 1, sizeof(cl_mem), &memgrad_r_h);
clSetKernelArg(k_cfg, 2, sizeof(cl_mem), &memp1_y);
clSetKernelArg(k_cfg, 3, sizeof(cl_mem), &memp1_u);
clSetKernelArg(k_cfg, 4, sizeof(cl_mem), &memp1_v);
clSetKernelArg(k_cfg, 5, sizeof(cl_mem), &memp2_y);
clSetKernelArg(k_cfg, 6, sizeof(cl_mem), &memp2_u);
clSetKernelArg(k_cfg, 7, sizeof(cl_mem), &memp2_v);
clSetKernelArg(k_cfg, 8, sizeof(cl_int), &width);
clSetKernelArg(k_cfg, 9, sizeof(cl_int), &height);

I use 256 for l_worksize 'cos I read from some presentation that it’s reasonable value for parallelism (and I have a matching number of cores). Anyway, I will pass NULL.

The local work size has no relationship to the number of cores in your system. Whether it’s a good number or not depends on what your kernel is doing. Using a fixed number for the work size is bound to fail on some hardware for some kernels.

You should always check that the device can run a kernel at a particular work-size using clGetKernelWorkGroupInfo(…, CL_KERNEL_WORK_GROUP_SIZE, …) or simply passing NULL, which is what I suggested earlier.

Basically, I thought the memory space required for the whole kernel to run would need 8 (2 x 3 sets of YUV data and 2 x output)

The global work size has nothing to do with how much memory is required for an operation. The global work size represents how much computation needs to be performed. One way to think about it is: if you were running the code in a loop serially, how many loop iterations would you need? In your case you have 320x168 pixels, so the loop would run 320x168 times – once per pixel. That’s the global work size you need.

The code you showed in the last post looks correct assuming that after every call you check that errcode is equal to CL_SUCCESS and that you also check whether clSetKernelArg() returns CL_SUCCESS. Maybe the problem occurs when you read back the computed results?

Hi David,

Thank you so much for your advice.

I added the log to check the result of each call and everything is fine.
The magical thing is just changing &l_worksize to NULL in clEnqueueNDRangeKernel and setting g_worksize to 320 x 168 and everything WORKS now.

Just 2 modification points (as you advised)… and of course, most importantly, understanding what I was doing.

As a rule of thumb, can I safely say that I should set worksize to default value (by passing NULL) and assume optimal number of threads will be running to execute the kernel function? Then, what is real purpose of this worksize??

One more conceptual question: by setting global work size to 320 x 168, am I setting 320 x 168 (or as many supported by the hardware) threads running in parallel, and not in serial?

As a rule of thumb, can I safely say that I should set worksize to default value (by passing NULL) and assume optimal number of threads will be running to execute the kernel function? Then, what is real purpose of this worksize??

These are great questions. Typically, the situations where you want to control the work group size are those in which the kernel is using local memory. Why? Because local memory is small and it’s shared among all work-items in a work group so you will have to balance between group size and how much local memory you have.

If your kernel doesn’t use any local memory, then I recommend passing NULL. The OpenCL implementation will choose for you a reasonable value. Maybe it won’t be optimal, but it should be good enough.

One more conceptual question: by setting global work size to 320 x 168, am I setting 320 x 168 (or as many supported by the hardware) threads running in parallel, and not in serial?

The global work size represents the amount of work that could be executed in parallel if your hardware was large enough. In reality what happens is that the hardware only has a limited number of “compute units”. Each compute unit runs one work-group at a time. In other words, all work-items in a work-group execute roughly at the same time. However, since the number of compute units in your hardware is limited, different work-groups will run earlier or later than other work-groups.

You can think of each compute unit as a cashier in a supermarket and each work-group as a customer: typically you have many more customers than cashiers and there will be queue(s) of customers.

One of the great things about OpenCL is that you simply describe how much computation you want to execute and OpenCL will distribute that work across all the compute units for you, so that you don’t need to rewrite different versions of your program for hardware that has more or less compute units.

Thanks, man!