get_global_id(0) & get_global_size problem and array indexing problem

Hello everyone,

I would like to request some help from you. First of all, I am, by no means, a good programmer.I am trying to phase unwrapping in open Cl. I do not understand the difference between get_global_id(0) & get_global_size? And also in the following code why:
int stride = cols * get_global_size(0); // why it is multiplied by cols
int row_start = cols * get_global_id(0);// why it is multiplied by cols
and what is the exact meaning of this array indexing phase[stride + row_start + i] ?
Would anyone able to clearly explain me the way array has been indexed here and also the concept of get_global_id(0) & get_global_size? I have tried to use google and also read a book. but failed to understand clearly.
Thank you.

Code:

__kernel void unwrap_rows_new (__global Scalar *phase, __global char *mask, int cols)
// must be lunched rows times (for each row)
{
int stride = cols * get_global_size(0);
int row_start = cols * get_global_id(0);
int i;
Scalar Jump = 0;
for (i = 1; i<cols; i ++)
// if (mask )
{
if (phase[row_start + i] - phase[row_start + i - 1] > M_PI) Jump -= 2 * M_PI;
else if (phase[row_start + i] - phase[row_start + i - 1] < - M_PI) Jump += 2 * M_PI;
phase[stride + row_start + i] = phase[row_start + i] + Jump;
}
}

get_global_size is the same for all work iteam (aka threads). If you enqueued your kernel for 512 work items it would be 512.
get_global_id is unique for each work items. If you enqueued your kernel for 512 work items, it would be equal to 0 for one of them, 1 for another, 2 for another, etc. up to 511 for the last.

The code you show is calculating “row_start” to be a unique index into the “phase” array for each work item, and each will be “cols” in size. Then the loop calculates indexes into the “phase” array using the unique start position (row_start) plus the column index “i”. It stores the result after the source data using the “string” offset. So the first half of the “phase” array contains input data and the second half (starting at “stride”) contains output data.

Note that because of the discontiguous access this kernel is not doing optimal memory access patterns. It would be faster if it did. Also, it reads from “phase” four times for each loop iteration, which is very wasteful. This can be reduced to a single read per iteration.