I do not understand the basics of the global_work_size and local_work_size. I have used CUDA and new to OpenCL.

I have set my work_items to 64, 1, 1 and work groups to 512, 128, 1. If I do the following:

__kernel void testKernel(__global uint* output, uint x)
uint4 gid = (uint4)(get_global_id(0), get_global_id(1), get_global_id(2), 1);
uint width = get_global_size(0);
uint height = get_global_size(1);
uint index = gid.x + (gid.y * width) + (gid.z * width * height);
output[index] = x;
things work. I was expecting the width to be get_local_size(0) * get_global_size(0);
I expected gid.x to go from 0 to (64 * 512) like CUDA 64 threads per 512 blocks.
What are the work items for, just shared memory grouping? Are there really (512 / 64), 128, 1 blocks containg 64 threads?

I am sure there is a simple answer to my problem. I have searched which leads me to the conclusion my confusion is very basic and if you please do not direct me some place because I have been there and I still do not understand.

I really do not understand the groups, get_num_groups and get_group_id. The kernel values in them, I do not see the use.