Some startup tips

Hi,

I was thinking of trying out OpenCL for running calculations on a Mandelbrot program and I have a few questions.

As far as I know you want to have as many workers as possible (well not as many as you can maybe but quite a few anyway). Mandelbrot is calculated by performing a number of iterations for each pixel on the screen.
Is it a good idea to have one worker for each pixel (this would limit the window to 512x512 on my computer it seems since local_work_size == global_work_size == 512) or what do you think?

If I have one worker per pixel how do I get the index of the pixel? I’ve tried get_global_id(0)*512 + get_local_id(0) but that didn’t seem to work at all.

Otherwise I could just calculate each row in one worker but the problem is if I have more than 512 rows, how is this best soloved?

Regards
Nicklas

As far as I know you want to have as many workers as possible (well not as many as you can maybe but quite a few anyway).

Yes, you are right.

Is it a good idea to have one worker for each pixel (this would limit the window to 512x512 on my computer it seems since local_work_size == global_work_size == 512) or what do you think?

I think that one work-item per pixel is a great place to start (*). I don’t think you will have to limit yourself to 512x512 since there’s no need for the local work size to be a particular number. Correct me if I’m wrong, but in naive Mandelbrot computations each pixel is independent of the rest. If that is the case, then the local size does not matter and you can make the picture as large as you want.

If I have one worker per pixel how do I get the index of the pixel? I’ve tried get_global_id(0)*512 + get_local_id(0) but that didn’t seem to work at all.

What you want is: x = get_global_id(0); y = get_global_id(1);

Otherwise I could just calculate each row in one worker but the problem is if I have more than 512 rows, how is this best soloved?

Computing one row in each work-item would produce too few work-items for the GPU to perform well.

(*) The only downside of that approach is that some pixels in the Mandelbrot set take much longer to compute than some others and they will become the bottleneck of the algorithm. However, one work-item per pixel is definitely the right place to start; don’t worry about performance too much at this stage.

Oh, I though that the global size was the number of “threads” and local size was the number of items per thread and that you couldn’t have more “threads” than cores which is 512 in my case. Maybe this is incorrect?

Ahh, thanks!

Oh, I though that the global size was the number of “threads” and local size was the number of items per thread and that you couldn’t have more “threads” than cores which is 512 in my case. Maybe this is incorrect?

The standard intentionally avoids the term “thread” since it means very different things to different people. The global size represents the total number of work-items you want to spawn. You can think of each work-item as a scalar processor.

A small collection of work-items forms a work-group. Work-items within the same work-group can communicate through local memory and synchronize using execution barriers. Local memory and some other shared resources are the reason why you can’t have very large work groups.

Thanks for a very detailed answer :slight_smile:
I’ll continue to mess around with it and see how it goes.

BTW, what’s the best/easiest way to represent floating numbers with a large amount of decimals in opencl?

BTW, what’s the best/easiest way to represent floating numbers with a large amount of decimals in opencl?

If your device supports double-precision floats, that’s an easy way that may have enough range/precision for your needs. Query the device extension string for cl_khr_fp64 and start your kernels with this:

#pragma OPENCL EXTENSION
cl_khr_fp64 : enable

If doubles are not good enough there are other --more troublesome-- ways to go about it.