help with work items in work groups

gatodelsol · September 13, 2011, 1:41pm

I do not understand the basics of the global_work_size and local_work_size. I have used CUDA and new to OpenCL.

I have set my work_items to 64, 1, 1 and work groups to 512, 128, 1. If I do the following:

__kernel void testKernel(__global uint* output, uint x)
{
uint4 gid = (uint4)(get_global_id(0), get_global_id(1), get_global_id(2), 1);
uint width = get_global_size(0);
uint height = get_global_size(1);
uint index = gid.x + (gid.y * width) + (gid.z * width * height);
output[index] = x;
}
things work. I was expecting the width to be get_local_size(0) * get_global_size(0);
I expected gid.x to go from 0 to (64 * 512) like CUDA 64 threads per 512 blocks.
What are the work items for, just shared memory grouping? Are there really (512 / 64), 128, 1 blocks containg 64 threads?

I am sure there is a simple answer to my problem. I have searched which leads me to the conclusion my confusion is very basic and if you please do not direct me some place because I have been there and I still do not understand.

I really do not understand the groups, get_num_groups and get_group_id. The kernel values in them, I do not see the use.

Thanks

david.garcia · September 13, 2011, 4:14pm

Have you tried reading section 3.2 of the OpenCL 1.1 specification? I think it answers your questions.

notzed · September 13, 2011, 5:06pm

gatodelsol:

I do not understand the basics of the global_work_size and local_work_size. I have used CUDA and new to OpenCL.

I have set my work_items to 64, 1, 1 and work groups to 512, 128, 1. If I do the following:

__kernel void testKernel(__global uint* output, uint x)
{
uint4 gid = (uint4)(get_global_id(0), get_global_id(1), get_global_id(2), 1);
uint width = get_global_size(0);
uint height = get_global_size(1);
uint index = gid.x + (gid.y * width) + (gid.z * width * height);
output[index] = x;
}
things work. I was expecting the width to be get_local_size(0) * get_global_size(0);
I expected gid.x to go from 0 to (64 * 512) like CUDA 64 threads per 512 blocks.
What are the work items for, just shared memory grouping? Are there really (512 / 64), 128, 1 blocks containg 64 threads?

I am sure there is a simple answer to my problem. I have searched which leads me to the conclusion my confusion is very basic and if you please do not direct me some place because I have been there and I still do not understand.

I really do not understand the groups, get_num_groups and get_group_id. The kernel values in them, I do not see the use.

Thanks

Your query is a little confusing so forgive me if i didn’t understand it correctly.

You state you set the ‘work groups’ to 512,128,1: but you can’t set the work-groups (explicitly), you only set the global work size.

This is probably the confusion.

If you set the global work size to 512,128,1 then gid will go from (0,0,0) to (512,128,1) only, unique in every ‘thread’.

And basically get_local_id(0) will return get_global_id(0) mod get_local_size(0) (64).

So yes, there will be (512/64),128,1 work-groups, and it’s for local memory, register allocation, threads, and so on. (I think …) On nvidia hardware this doesn’t map 1:1 to ‘blocks’, it may have more than 1 work-group in a block if there are enough resources to fit them.

get_global_size() will be exactly the global sizes passed in: i.e. 512,128,1, and get_local_size() will be exactly the local sizes passed in: i.e. 64,1,1
get_num_groups(x) will be get get_global_size(x) / get_local_size(x)
get_group_id(x) will be get_global_id(x) / get_local_size(x)
etc.

See section 6.11.1 (Work Item Functions) in the spec.

gatodelsol · September 14, 2011, 9:12am

Thanks notzed for the reply. I get it now. I needed a simple explanation that was different from the spec and other things I found on the net to get passed my confusion of globals, locals and work groups.

Thanks