Atmapuri

05-18-2011, 02:22 AM

Hi!

I am looking at this kernel I found (and scratching my head):

__kernel void sqrt_sum(

__global float * vec1,

__global float * result)

{

for (unsigned int stride = get_global_size(0)/2; stride > 0; stride /= 2)

{

if (get_global_id(0) < stride)

vec1[get_global_id(0)] += vec1[get_global_id(0)+stride];

barrier(CLK_GLOBAL_MEM_FENCE);

}

if (get_global_id(0) == 0)

*result = sqrt(vec1[0]);

}

I understand the for-loop. The problem is the synchronization. The barrier is defined to

work only within the same work_group: "All the work-items of a work-group must execute the barrier before any are allowed to continue execution beyond the barrier."

This implies that work_group size for this kernel to work, must be equal to get_global_size(0) to make sure that only one work group is launched (running on the same compute unit)?

Maximum work_group_size is limited with CL_KERNEL_WORK_GROUP_SIZE to typically 512. This means that all dimensions x * y * z across all work groups cannot exceed 512, but it is possible to have x = 512, y = 1, z =1.

This kernel thus works only for get_global_size(0) of less than 512 (for AMD) with condition that only one work_group of equal size as get_global_size is specified when clEnqueNDRangeKernel is called?

get_global_size(0) == get_local_size(0) //??

Thanks!

Atmapuri

I am looking at this kernel I found (and scratching my head):

__kernel void sqrt_sum(

__global float * vec1,

__global float * result)

{

for (unsigned int stride = get_global_size(0)/2; stride > 0; stride /= 2)

{

if (get_global_id(0) < stride)

vec1[get_global_id(0)] += vec1[get_global_id(0)+stride];

barrier(CLK_GLOBAL_MEM_FENCE);

}

if (get_global_id(0) == 0)

*result = sqrt(vec1[0]);

}

I understand the for-loop. The problem is the synchronization. The barrier is defined to

work only within the same work_group: "All the work-items of a work-group must execute the barrier before any are allowed to continue execution beyond the barrier."

This implies that work_group size for this kernel to work, must be equal to get_global_size(0) to make sure that only one work group is launched (running on the same compute unit)?

Maximum work_group_size is limited with CL_KERNEL_WORK_GROUP_SIZE to typically 512. This means that all dimensions x * y * z across all work groups cannot exceed 512, but it is possible to have x = 512, y = 1, z =1.

This kernel thus works only for get_global_size(0) of less than 512 (for AMD) with condition that only one work_group of equal size as get_global_size is specified when clEnqueNDRangeKernel is called?

get_global_size(0) == get_local_size(0) //??

Thanks!

Atmapuri