Hi!
I am looking at this kernel I found (and scratching my head):
__kernel void sqrt_sum(
__global float * vec1,
__global float * result)
{
for (unsigned int stride = get_global_size(0)/2; stride > 0; stride /= 2)
{
if (get_global_id(0) < stride)
vec1[get_global_id(0)] += vec1[get_global_id(0)+stride];
barrier(CLK_GLOBAL_MEM_FENCE);
}
if (get_global_id(0) == 0)
*result = sqrt(vec1[0]);
}
I understand the for-loop. The problem is the synchronization. The barrier is defined to
work only within the same work_group: “All the work-items of a work-group must execute the barrier before any are allowed to continue execution beyond the barrier.”
This implies that work_group size for this kernel to work, must be equal to get_global_size(0) to make sure that only one work group is launched (running on the same compute unit)?
Maximum work_group_size is limited with CL_KERNEL_WORK_GROUP_SIZE to typically 512. This means that all dimensions x * y * z across all work groups cannot exceed 512, but it is possible to have x = 512, y = 1, z =1.
This kernel thus works only for get_global_size(0) of less than 512 (for AMD) with condition that only one work_group of equal size as get_global_size is specified when clEnqueNDRangeKernel is called?
get_global_size(0) == get_local_size(0) //??
Thanks!
Atmapuri