First local memory work

Hi,

I do my first steps for local memory working and try to begin with this global memory working code :
datas : a={0,1,2,3,…}, b={0,3,6,9,…}



__kernel void localTest(__global const unsigned char *a,
                        __global const unsigned char *b,
                        __global unsigned char *c)
{
    int x = get_global_id(0);
    c[x] = a[x]+ b[x];
}

, where it’s OK. (c={0,4,8,12,…})

The following code generates wrong value :


#define TILE_DIM 16

__kernel void localTest(__global const unsigned char *a,
                        __global const unsigned char *b,
                        __global unsigned char *c)
{
    int x = get_global_id(0);

    __local unsigned char localTemp[2];

    localTemp[0] = a[x]; //(*)
    localTemp[1] = b[x]; //(**)

    c[x] = localTemp[0] + localTemp[1];
}

If I only comment (**) and do (c[x]=localTemp[0];), values are correct,(c={0,1,2,3,…})
if I only comment (*) and do (c[x]=localTemp[1];), values are correct too.(c={0,3,6,9,…})
But the code above give : (c={15,18,21,24,…})

Can you help me to resolve this issue?
Thanks in advance.

Local memory is shared across all work-items in a work-group. Right now with that code, all work-items are fighting to read and write into the same localTemp[0] and localTemp[1].

I suggest looking at some examples like convolution using local memory.

I’ll do that but some are already complex for begin :
http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/OpenCL/website/samples.html

I “find” how to do work my little code :


#define TILE_DIM 16

__kernel void localTest(__global const unsigned char *a,
                        __global const unsigned char *b,
                        __global unsigned char *c)
{
    int x = get_global_id(0);
    int lx = get_local_id(0);

    __local unsigned char localTemp1[TILE_DIM];
    __local unsigned char localTemp2[TILE_DIM];
   
    localTemp1[lx] = a[x];
    localTemp2[lx] = b[x];

	c[x] = localTemp1[lx] + localTemp2[lx];
}

, but I ask me if it’s the good way to do, I mean create 2 local array? And, of course, it’s not my goal to do work this code.

If I can, let me ask some questions :
1 - Does work with local memory means (always or not) work with group size, group id? For the example above , ‘a’ or ‘b’ array is much bigger then local arrays (max local memory size), how to manage it? “Max group size” by “max group size” data processing on local memory?
2 - Does a[x+TILE_DIM] not trying to write to same location localTemp1[lx], but it seems like it work ?

Thanks.

Right. I should have provided some more info. The way you are doing it now looks good. Yes, the code looks good with two arrays since the data comes from two different places. You could put all the data into a single array, but it would not help.

Does work with local memory means (always or not) work with group size, group id? For the example above , ‘a’ or ‘b’ array is much bigger then local arrays (max local memory size), how to manage it? “Max group size” by “max group size” data processing on local memory?

The main idea with local memory is this: if you know that many work-items in the same work-group are going to read from the same locations in global memory, then put that data into local memory and read from there instead. Even though you pay the cost of copying data from global to local, it can be a big win if it saves you from having to read from global memory many times.

Of course, local memory is much smaller than global memory, so you have to divide the work into pieces. Each work-group reads a small piece of data (some KB) from global memory into local memory, does some operations using that local memory and finally writes the result out.

Do you remember the “local_work_size” parameter to clEnqueueNDRangeKernel()? It’s very useful when your kernel uses local memory. In your example, you want local_work_size to be the same as TILE_DIM, so that work-size and local memory variables always match.

Notice that in your example where you are adding two vectors together, local memory cannot improve performance because each work-item reads from two different places and writes into one. There’s no overlap between work-items, so there’s no benefit from local memory.

Does a[x+TILE_DIM] not trying to write to same location localTemp1[lx], but it seems like it work ?

Yes, you could do everything with a single variable but for what purpose? The code would look like this:


#define TILE_DIM 16

__kernel void localTest(__global const unsigned char *a,
                        __global const unsigned char *b,
                        __global unsigned char *c)
{
    int x = get_global_id(0);
    int lx = get_local_id(0);

    __local unsigned char myBigLocalVariable[TILE_DIM + TILE_DIM];
   
    myBigLocalVariable[lx] = a[x];
    myBigLocalVariable[TILE_DIM+lx] = b[x];

   c[x] = myBigLocalVariable[lx] + myBigLocalVariable[TILE_DIM+lx];
}

Thank you very much for taking time to answer.

[wisdom]
I’ll meditate it, and take time to practice.
[/wisdom]

I agree with you (I have no choice, you’re right : ),and I agree that there is no interest to really use local memory in this case.

For now, my questions are :
Theoric:
-> How to handle local memory in work-groups?
-> How to handle local memory size versus group size?
Technical:
-> Is that all work-items is assigned a local id modulo of work-group size or global work size ?

Here is an other exemple done after your explanation with using of local memory:
Description :
‘a’ buffer (size = 36):
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35

let’s try to do : a[x] = a[x-1] + a[x] + a[x+1], except extremities.
for exemple : a[5] = a[4] + a[5]+a[6] = 4+5+6 = 15.
waiting result :
'c ’ buffer :
0 3 6 9 12 15
18 21 24 27 30 33
36 39 42 45 48 51
54 57 60 63 66 69
72 75 78 81 84 87
90 93 96 99 102 35

code :



#define LOCAL_MEM_SIZE 4  

__kernel void localTest(__global const short *a,
                        __global short *c)
{
      unsigned int x = get_global_id(0);
    int lx = get_local_id(0);


    __local short myBigLocalVariable[LOCAL_MEM_SIZE * LOCAL_MEM_SIZE];
//    barrier(CLK_LOCAL_MEM_FENCE);   
    unsigned int i;


    if(lx >0 && lx < LOCAL_MEM_SIZE * LOCAL_MEM_SIZE-1)
    {
        c[x] = myBigLocalVariable[lx] + myBigLocalVariable[lx-1] + myBigLocalVariable[lx+ 1];
    }
    else
    {
        c[x] = myBigLocalVariable[lx] ;
    }
}

This code returns : c =
0 3 6 9 12 15
18 21 24 27 30 33
36 39 42
15 16 17
18 19 20 21 22 23
24 25 26 27 28 29
30 31 32 33 34 35

Of course, it works well if LOCAL_MEM_SIZE 6, but I wanted to illustrate this situation where all local variable array are used and need to reassign values to local memory.

Hope I managed to explain.

How to handle local memory in work-groups?

Sorry. I don’t understand this question.

How to handle local memory size versus group size?

As a first step, try to use all your local memory and try to make your work-groups larger. Once you start fine tuning you will have to measure the performance you get with different work-group sizes. Not sure this answers your question.

Is that all work-items is assigned a local id modulo of work-group size or global work size ?

That is not technically necessary. However, for typical algorithms that’s what you want to do. That way, each work-item is responsible for loading/storing one “piece” of local memory.

Of course, it works well if LOCAL_MEM_SIZE 6, but I wanted to illustrate this situation where all local variable array are used and need to reassign values to local memory.

I think there’s something missing from the code you posted since myBigLocalVariable is never initialized. I don’t see any difficulty making this work with local memory. Look at this:


#define LOCAL_MEM_SIZE 4  

__kernel void localTest(__global const short *a,
                        __global short *c)
{
    int x  = get_global_id(0);
    int lx = get_local_id(0);


    __local short localblock[LOCAL_MEM_SIZE];
    
    // Copy a portion of global memory into local memory.
    // Notice that each work-group will copy a different piece of local memory
    localblock[lx] = a[x];
    
    // Synchronize all work-items so that they all
    // see the latest state of "localblock"
    barrier(CLK_LOCAL_MEM_FENCE);   

    if(lx >0 && lx < LOCAL_MEM_SIZE -1)
    {
        // Fast case: all the desired data is already in local memory :)
        c[x] = localblock[lx] + localblock[lx-1] + localblock[lx+ 1];
    }
    else
    {
        // Slow case: one of the samples is not in local memory.
        // ...so we pick the data directly from global memory.
        
        // Handle extreme cases
        if(x > 0 && x < get_global_size(0)-1)
        {
            c[x] = a[x-1] + a[x] + a[x+1];
        }
        else
        {
            c[x] = a[x];
        }
    }
}

Yes, I had comments to delete and I delete initalizing line too, sorry:


  ...
  unsigned int i;
  myBigLocalVariable[lx] = a[x];
  
  if(if(lx >0 && lx < LOCAL_MEM_SIZE * LOCAL_MEM_SIZE-1))
  {}
  .....

You wrote good example for my question(s) : when I see fast and slow cases on your code, I understand that, in one work group(tell me if I’m right), you compute as much as possible in local memory (that’s the goal), and the rest in global memory.
-> Has all work-groups load only 4 datas in their local memory : for example if work-group = 512, 4 datas are compute in local memory and 508 on global memory?
-> Is it like this that we should do ?Compute datas that are in local memory,and the rest in global memory ?
-> Can’t/shouldn’t we reload/reassign remaining datas in local memory for work with ?
-> Or should assign it before, one time, intelligently?

And maybe the real question is
-> Am I boring? :expressionless:

Has all work-groups load only 4 datas in their local memory : for example if work-group = 512, 4 datas are compute in local memory and 508 on global memory?

I’m not sure I understand the question. Generally what you do is to reduce the size of your work-groups until the point that you have enough local memory to do all your computations with it and you (almost) don’t need to access global memory. Of course, if this means that your work-groups become very small then it’s time to re-think the algorithm (small work-groups means less performance).

In my opinion the best way to learn about this is to look at the examples you get fron nVidia’s or AMD’s SDK. Start with the most simple examples you can find and you will see how they use local memory in clever ways.

Don’t worry, you ask very good questions :slight_smile:

clEnqueueNDRangeKernel :


The work-group size to be used for kernel can also be specified in the program source using the attribute((reqd_work_group_size(X, Y, Z)))qualifier. In this case the size of work group specified by local_work_size must match the value specified by the reqd_work_group_size attribute qualifier.

-> I guess that is what I should use for define work-groups size? Is it the only way, or the right way?
-> If I assign local work size, does work-group size define from this?
For example, in this :


	const size_t global_work_size[1] = {1024};
	const size_t local_work_size[1] = {256};

-> How work-group (or work-group size) are assigned?

I guess that is what I should use for define work-groups size? Is it the only way, or the right way?

attribute((reqd_work_group_size(X, Y, Z))) is basically a hint for the compiler. It doesn’t hurt to have it but at the end of the day, what determines the work-group size is the value you pass to clEnqueueNDRangeKernel().

Notice that if you use the attribute above, it clEnqueueNDRangeKernel will return an error if the group size you pass to clEnqueueNDRangeKernel does not match the value in the attribute. In that sense it’s more than a hint.

If I assign local work size, does work-group size define from this?
For example, in this

Sorry, I didn’t understand the question.

How work-group (or work-group size) are assigned?

If the question is “how do I choose the right work-group size?”, the answer is trial and error. There’s no easy rule that will give you what is the best size for performance.

depends. the attribute may be used if you require a specific work size (e.g. have hardcoded array sizes, your kernel may only operate on a specific local work size, …), however this shouldn’t be the case for normal kernels.
your work group size is either that you passed to clEnqueueNDRangeKernel as local work size or some value “guessed” by the driver if you didn’t specify any.
note that if you use the reqd_work_group_size attribute it’s an error to specify a different one for the kernel

yes, the parameter you specified as local work size is the work-group size.

something you may also want to have a look at is __local arguments to kernels. from clSetKernelArg (arg_size):

For arguments declared with the __local qualifier, the size specified will be the size in bytes of the buffer that must be allocated for the __local argument.

e.g. if your buffer requires 1 float per work item, you’d set arg_size in clSetKernelArg to 256sizeof(cl_float) if you have 256 as local size and use "__local float" as argument type in your kernel

and finally for copying data to/from global memory you may want to have a look at async_work_group_copy/async_work_group_strided_copy and wait_group_events as it eases that copy a bit.