Maximum number of work-items

My GPU contains 18 compute units and each work-group supports a maximum of 256 work-items. When I execute my kernel with 16 * 256 items, OpenCL creates 16 work-groups and I get the right answer. But when I execute with 32 * 256 items, OpenCL creates 32 work-groups and I get the wrong answer.

Does the maximum # of items equal compute_units * max_work_group_size? Or is there a way to code kernels to support more work-items?

How do the extra work-groups access local memory if there are only 18 local memory blocks on the device? For example, my kernel uses barrier(CLK_LOCAL_MEM_FENCE) to synchronize local memory access. Is that causing the problem?

Does the maximum # of items equal compute_units * max_work_group_size? Or is there a way to code kernels to support more work-items?

There is no upper limit on the number of work-items you can enqueue in a single NDRange. It doesn’t matter what your hardware looks like; your OpenCL implementation has to make it work.

How do the extra work-groups access local memory if there are only 18 local memory blocks on the device? For example, my kernel uses barrier(CLK_LOCAL_MEM_FENCE) to synchronize local memory access. Is that causing the problem?

Sequentially! :slight_smile: Let’s say you have 10 physical cores, each of them capable of executing a whole work-group at a time. Let’s say you enqueue an NDRange that has 20 work-groups. Your hardware will execute 10 work-groups at a time. So barriers or local memory won’t be a problem.

It’s not possible to diagnose the problem you are seeing without having some more information. Have you checked that your buffers and images are large enough? Does the error always start happening after a certain NDRange size or is it random? Can you show us the source code of the kernel causing problems?