PDA

View Full Version : Problem with frequency algorithm

JUHAN
05-27-2011, 05:37 AM
Hey guys,

i have a problem with an algorithm.

I have point lists with x- and y-coordinates. I calculate a result for each combination of the points with a lot of operations in each opencl process.
Then I want to count how often the different results have been found (frequency of each result).
At the moment the calculation of the results is done in a kernel. The results are saved to list.
In the C++-part i count the frequency in an array...

Is it possible to make this in OpenCL? I tried it, but i got problems. I think it couldn't work because it isn't synchronised.

Thanks!

david.garcia
05-27-2011, 03:38 PM
Basically what you want to do is to compute a histogram in parallel. I believe both NVidia and AMD include that in their list of examples.

JUHAN
05-28-2011, 03:17 PM
This is right. Thanks!

In the actual version i merge the results in C++. So i have a number of global threads of ~5000000 (~3000 points). My "histogramm"-memory has 200*200 dimensions.
Now i have to use local threads.
What would be the best distribution?

JUHAN
05-30-2011, 01:32 PM
I tried the following properties:

Kernel arguments:

// set the argument list for the kernel command
clSetKernelArg(kernel_houghaccu, 0, sizeof(cl_mem), &inputX);
clSetKernelArg(kernel_houghaccu, 1, sizeof(cl_mem), &inputY);
clSetKernelArg(kernel_houghaccu, 2, sizeof(cl_mem), &inputZ);
clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);
clSetKernelArg(kernel_houghaccu, 4, sizeof(cl_mem), &completeArray);

Kernel startet with:

size_t globalThreads = 1024 * 1024;

For a test i tried following kernel_code:

__kernel void houghaccu
(__global float *inputX,
__global float *inputY,
__global float *inputZ,
__local int *sharedArray,
__global int *completeArray)
{
size_t id = get_global_id(0);

size_t localId = get_local_id(0);
size_t groupSize = get_local_size(0);

/* initial any field with 0 */
for(int i = 0; i < 201*201; i++)
sharedArray[localId * 201*201 + i] = 0;

barrier(CLK_LOCAL_MEM_FENCE);

int value = 1;

for(int i = 0; i < 201*201; ++i)
{
sharedArray[localId * 201*201 + value]++;
}

barrier(CLK_LOCAL_MEM_FENCE);

/* merge all thread-histograms into block-histogram */
for(int i = 0; i < 201*201; i++)
{
uint binCount = 0;
for(int j = 0; j < groupSize; j++)
binCount += sharedArray[j * 201*201 + i];

completeArray[i] = binCount;
}

}

I don't understand which mistake i have done. compleArray[i] is always empty. But at one position it should be filled.

david.garcia
05-30-2011, 03:46 PM
Did you get any error codes from any of the API calls you made, such as clEnqueueNDRangeKernel()? Did you pass a pfn_notify function to clCreateContext()?

I doubt that this like of code will work:

clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);

It is requesting 40MB of local memory and your hardware probably cannot support so much.

JUHAN
05-31-2011, 06:37 AM
I tried it now with a group size of 8.
Now the is used 1,25 MB.

This isn't the "main" problem...

david.garcia
05-31-2011, 03:01 PM

As for the usage of local memory, 1.25MB is still a lot more than what your hardware probably supports. You can query the amount of available local memory with clGetDeviceInfo(..., CL_DEVICE_LOCAL_MEM_SIZE, ...).

As long as clEnqueueNDRangeKernel() requires more local memory than is available on your system the program will not work.

JUHAN
06-01-2011, 06:13 AM
Sorry, i mistake CL_DEVICE_MAX_MEM_ALLOC_SIZE for this value...

CL_DEVICE_LOCAL_MEM_SIZE only has 16kByte.

Now i have a problem.

In my old version (without groups) i merge the results in c++ slowly. Is it possible to do it in one "histogram" if it is synchronized?

JUHAN
06-01-2011, 06:34 AM
Now i tried the counting of the results in one "histogram" without synchonizing.

~53000 (of ~5000000) results were not counted in this version...

david.garcia
06-01-2011, 03:45 PM
You can compute partial histograms within a work-group using barriers and finally add together the partial histograms using atomic operations like atom_add().

JUHAN
06-02-2011, 04:11 AM
Yes, i see the option with the atomic operations.

Work groups won't work because one instance of the memory needs ~160kB (201 * 201 * sizeof(int))... Do you see an other way with work groups?

I will try to increment the histogramm memory with atom_add(). But therefor i need OpenCl 1.1. I don't have it at the moment.

david.garcia
06-02-2011, 05:56 AM
Atomics are also available on OpenCL 1.0 through an extension. All you need to do is add this at the beginning of your kernel source code:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

Although perhaps you are saying that your device doesn't support that extension? In that case you can compute partial histograms, store them in global memory and then launch one more NDRange kernel with a single work-group to add together the partial histograms.

JUHAN
06-02-2011, 02:24 PM
Now i updated my system to OpenCL 1.1. I had to change some things. Now i get this error message when compiling the kernel:

Try to compile the program... Error: Failed to build program executable!
1e8, 0xf76ce78, 0xf764d98 <0xeff5eb8:0> <volatile> alignment=4

This happens when i use:

atom_inc(&result[position]);

I got following informations with the NVDIA OpenCL Device Query about the CL_DEVICE_EXTENSIONS:
cl_khr_icd
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_d3d10_sharing
cl_khr_d3d10_sharing
cl_nv_d3d11_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll

So i can't use atomic operations with my machine?!

The problem of the solution with partial histograms is that i need 201*201*sizeof(int) for each partial histogram. That can't work because i the maximum possible to allocate are 16kB for local variables...

david.garcia
06-03-2011, 08:04 PM
The problem of the solution with partial histograms is that i need 201*201*sizeof(int) for each partial histogram.

You can do it in multiple phases. Instead of computing a histogram with 201x201
bins, you can first compute partial histograms with (for example) only 201 bins each and then refine each of them in another step. Think of it as a multiresolution approach. That's how I would do it.

If I were you I would go to citeseer.ist.psu.edu and try to find if there are some papers on that topic. It must be a well-researched area.