Some newbie questions about workitems and workgroup sizes.

Please go easy on me and help me understand some things. I have read a lot of documentation but am still confused on some parts and I hope you can help break it down into simpler terms for me.

  1. Does the number of work-groups affect execution in any meaningful way? Or, are they simply there to provide an optional means of simplifying a problem for the developer?

  2. How does one queue an arbitrary number of workitems on a GPU? For example, say my algorithm requires me to execute 233 instances of a kernel in parallel, using the GPU. How is this typically done?

On my machine, 512 seems to be the minimum number of work-items. Would I queue up 512 instances of the kernel (workitems), and have the last 279 instances do nothing? Thanks ahead of time, and I appreciate any well-thought-out responses.

  1. Does the number of work-groups affect execution in any meaningful way? Or, are they simply there to provide an optional means of simplifying a problem for the developer?

This is how it works: each compute unit in your hardware can execute one work-group at a time. The number of work-groups you choose to execute depends on the amount of computation that your algorithm requires. If you have a lot of computation to do, you will typically need a lot of work-groups.

  1. How does one queue an arbitrary number of workitems on a GPU? For example, say my algorithm requires me to execute 233 instances of a kernel in parallel, using the GPU. How is this typically done?

The application chooses the number of work-items to execute when calling clEnqueueNDRangeKernel(). Do you see the global_work_size parameter? That’s how you choose how many work-items you want to run.

On my machine, 512 seems to be the minimum number of work-items.

That’s probably the maximum number of work-items per work-group, not the minimum. The minimum is one in any hardware.

That’s probably the maximum number of work-items per work-group, not the minimum. The minimum is one in any hardware.[/quote:30qif8yn]

This is what is confusing me. I’m following along with the hello.c program listed here: http://developer.apple.com/library/mac/ … llo_c.html
On my machine, if I set global_work_size to point to a value of 512, 1024, or 2048 (etc), it works fine. But any other non-power-of-two, or a value less than 512, will produce errors.

That code is written to square 1024 floats. What if I only wanted to square 900 floats? If I simply chance 1024 to 900 in that code, I get nothing but errors. Thanks for your patience, I appreciate it.

Actually, no. AMD and NVidia GPUs are running several work-groups at single compute unit.

That code is written to square 1024 floats. What if I only wanted to square 900 floats? If I simply chance 1024 to 900 in that code, I get nothing but errors.

That’s because the code you linked to is explicitly specifying a work-group size when it calls clEnqueueNDRangeKernel(). Notice that global_work_size must always be a multiple of local_work_size.

There are three ways to solve that issue. Either you pass a local work size that is a multiple of the global work size you want, or you pass NULL as the local work size, or you do something like this inside your kernel:


__kernel void foo(..., uint max_size)
{
    if(get_global_id(0) < max_size)
    {
        // Kernel code here.
    }
}