Problem with frequency algorithm

JUHAN · May 27, 2011, 5:37am

Hey guys,

i have a problem with an algorithm.

I have point lists with x- and y-coordinates. I calculate a result for each combination of the points with a lot of operations in each opencl process.
Then I want to count how often the different results have been found (frequency of each result).
At the moment the calculation of the results is done in a kernel. The results are saved to list.
In the C+±part i count the frequency in an array…

Is it possible to make this in OpenCL? I tried it, but i got problems. I think it couldn’t work because it isn’t synchronised.

Can anybody help me? I need an information about this or an information to find informations about this.

Thanks!

david.garcia · May 27, 2011, 3:38pm

Basically what you want to do is to compute a histogram in parallel. I believe both NVidia and AMD include that in their list of examples.

JUHAN · May 28, 2011, 3:17pm

This is right. Thanks!

In the actual version i merge the results in C++. So i have a number of global threads of ~5000000 (~3000 points). My “histogramm”-memory has 200*200 dimensions.
Now i have to use local threads.
What would be the best distribution?

JUHAN · May 30, 2011, 1:32pm

I tried the following properties:

Kernel arguments:

// set the argument list for the kernel command
	clSetKernelArg(kernel_houghaccu, 0, sizeof(cl_mem), &inputX);
	clSetKernelArg(kernel_houghaccu, 1, sizeof(cl_mem), &inputY);
	clSetKernelArg(kernel_houghaccu, 2, sizeof(cl_mem), &inputZ);
	clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);
	clSetKernelArg(kernel_houghaccu, 4, sizeof(cl_mem), &completeArray);

Kernel startet with:


size_t globalThreads = 1024 * 1024;
size_t localThreads = 256;

clEnqueueNDRangeKernel(command_queue, kernel_houghaccu, 1, NULL, &globalThreads,
		&localThreads, 0, NULL, NULL);

For a test i tried following kernel_code:

__kernel void houghaccu
		(__global float *inputX, 
		 __global float *inputY,
		 __global float *inputZ, 
		 __local  int *sharedArray,
		 __global int *completeArray)
{
	size_t id = get_global_id(0);

	size_t localId = get_local_id(0);
	size_t groupSize = get_local_size(0);

	/* initial any field with 0 */
	for(int i = 0; i < 201*201; i++)
        sharedArray[localId * 201*201 + i] = 0;

	barrier(CLK_LOCAL_MEM_FENCE);
    
	int value = 1;
	
    /* calculate thread-histograms */
    for(int i = 0; i < 201*201; ++i)
    {
        sharedArray[localId * 201*201 + value]++;
    }
    
    barrier(CLK_LOCAL_MEM_FENCE); 

	/* merge all thread-histograms into block-histogram */
    for(int i = 0; i < 201*201; i++)
    {
        uint binCount = 0;
        for(int j = 0; j < groupSize; j++)
            binCount += sharedArray[j * 201*201 + i];
            
        completeArray[i] = binCount;
    }

}

I don’t understand which mistake i have done. compleArray[i] is always empty. But at one position it should be filled.

david.garcia · May 30, 2011, 3:46pm

Did you get any error codes from any of the API calls you made, such as clEnqueueNDRangeKernel()? Did you pass a pfn_notify function to clCreateContext()?

I doubt that this like of code will work:

clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);

It is requesting 40MB of local memory and your hardware probably cannot support so much.

JUHAN · May 31, 2011, 6:37am

I tried it now with a group size of 8.
Now the is used 1,25 MB.

This isn’t the “main” problem…

david.garcia · May 31, 2011, 3:01pm

Could you answer the other questions about error codes, etc?

As for the usage of local memory, 1.25MB is still a lot more than what your hardware probably supports. You can query the amount of available local memory with clGetDeviceInfo(…, CL_DEVICE_LOCAL_MEM_SIZE, …).

As long as clEnqueueNDRangeKernel() requires more local memory than is available on your system the program will not work.

JUHAN · June 1, 2011, 6:13am

Sorry, i mistake CL_DEVICE_MAX_MEM_ALLOC_SIZE for this value…

CL_DEVICE_LOCAL_MEM_SIZE only has 16kByte.

Now i have a problem.

In my old version (without groups) i merge the results in c++ slowly. Is it possible to do it in one “histogram” if it is synchronized?

JUHAN · June 1, 2011, 6:34am

Now i tried the counting of the results in one “histogram” without synchonizing.

~53000 (of ~5000000) results were not counted in this version…

david.garcia · June 1, 2011, 3:45pm

You can compute partial histograms within a work-group using barriers and finally add together the partial histograms using atomic operations like atom_add().

JUHAN · June 2, 2011, 4:11am

Yes, i see the option with the atomic operations.

Work groups won’t work because one instance of the memory needs ~160kB (201 * 201 * sizeof(int))… Do you see an other way with work groups?

I will try to increment the histogramm memory with atom_add(). But therefor i need OpenCl 1.1. I don’t have it at the moment.

david.garcia · June 2, 2011, 5:56am

Atomics are also available on OpenCL 1.0 through an extension. All you need to do is add this at the beginning of your kernel source code:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

Although perhaps you are saying that your device doesn’t support that extension? In that case you can compute partial histograms, store them in global memory and then launch one more NDRange kernel with a single work-group to add together the partial histograms.

JUHAN · June 2, 2011, 2:24pm

Now i updated my system to OpenCL 1.1. I had to change some things. Now i get this error message when compiling the kernel:

Try to compile the program... Error: Failed to build program executable!
Error: Code selection failed to select: 0xf764e38: i32,ch = AtomicLoadAdd 0xf764
1e8, 0xf76ce78, 0xf764d98 <0xeff5eb8:0> <volatile> alignment=4

This happens when i use:

atom_inc(&result[position]);

I got following informations with the NVDIA OpenCL Device Query about the CL_DEVICE_EXTENSIONS:
CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_d3d10_sharing
cl_khr_d3d10_sharing
cl_nv_d3d11_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll

So i can’t use atomic operations with my machine?!

The problem of the solution with partial histograms is that i need 201201sizeof(int) for each partial histogram. That can’t work because i the maximum possible to allocate are 16kB for local variables…

david.garcia · June 3, 2011, 8:04pm

The problem of the solution with partial histograms is that i need 201201sizeof(int) for each partial histogram.

You can do it in multiple phases. Instead of computing a histogram with 201x201
bins, you can first compute partial histograms with (for example) only 201 bins each and then refine each of them in another step. Think of it as a multiresolution approach. That’s how I would do it.

If I were you I would go to citeseer.ist.psu.edu and try to find if there are some papers on that topic. It must be a well-researched area.