Regarding async_work_group_copy(global to local)

Hi folks,

I have a kernel where a particular element (of a data structure) from the global memory.
Other words, all the threads executing the kernel use the data at the same address in the global memory.

I am trying to use the async_work_group_copy to get the data to the shared memory, first thing in the kernel. Also, as per the OpenCL specification, async_work_group_copy is executed by all threads.

Is the following possible?

One thread executes the async copy function and gets the data to the shared memory and the rest of the threads of the work group use the data brought in by one of the thread.

Or is it better to allow the cache to handle the data accesses in this case?

– Bharath

All work-items from the same work-group share the same local memory. async_work_group_copy() is a function that loads data from global memory into local memory and it is executed by all work-items in a work-group. In other words, all work-items in the work-group must call async_work_group_copy() with the same arguments.

After async_work_group_copy() has finished performing the memory transfer, all work-items in the work-group can read from local memory to access the data that was transferred.

I am not sure if my understanding of the local memory and async copy is correct. If I may ask a few questions…

I would like to know why does the requirement of “same arguments” come in.

Eg: The kernel has the following lines…

__local char temp;
async_work_group_copy((__local char *)&temp, (__global char *)globalvar, (event_t)0);

Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration “__local char temp”? Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?

– Bharath

I would like to know why does the requirement of “same arguments” come in.

Short answer: because the OpenCL specification requires it.

Long answer: because all work-items in the work-group will perform the copy together. It’s not a single thread doing the work. All threads collaborate.

Assuming a work group has 100 threads, how many variables are present on the local memory due to the declaration “__local char temp”?

Only one variable (one byte).

Putting in another way, if I was able to print the value of &temp, would it be the same at all threads?

Yes, it will be exactly the same.

I get the point. But when several threads try to access the Global memory, wouldn’t there be clashes leading to further increase in the completion of copy?

Also,

Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?

But when several threads try to access the Global memory, wouldn’t there be clashes leading to further increase in the completion of copy?

The should be no issue.

Assuming the number of threads in a work-group to be 512 and 32 (eg, a warp/wavefront) being scheduled at a time, it would be sufficient for the 1st 32 (actually, only 1 IMO) to perform the global to local. Am I right in thinking so?

Are you asking whether the copy is performed by a single warp? That doesn’t have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.

Are you asking whether the copy is performed by a single warp? That doesn’t have a single answer. For instance, I would expect some hardware to use a DMA engine for this while other designs would not.

You got my question right, but I don’t think I understand the explanation. If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?

I was hoping for something close to prefetch but to the shared memory.

Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?

– Bharath

If the 1st warp that was scheduled already got the required data to the local memory, why would the later ones be required to do the same, since the required data is already present?

Because each warp will only do part of the copy. Again, this will be done differently in different hardware.

Off the topic, I am guessing that the Global cache (L2 cache??)to which the prefetch gets the data is slower than the local (shared) memory. Is this right?

I suggest referring to your hardware vendor’s documentation. Some hardware doesn’t even have a global memory cache.

Because each warp will only do part of the copy. Again, this will be done differently in different hardware.

So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.

– Bharath

So, there is no point to have all the threads executing the async copy until they fetch different data, is it? What about the cases where number of elements to be fetched is at most the size of a warp? Worse, the number of elements is just one, as in my case.

If you are only copying something like a single int, then it’s not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle… assuming that your hardware doesn’t use a DMA engine for global->local copies.

If you are only copying something like a single int, then it’s not worth putting that piece of data in local memory. And you are right, in that case a single warp would do all the work and the rest would be idle… assuming that your hardware doesn’t use a DMA engine for global->local copies.

How about this case?

  • Each thread needs lets say 1000 elements to complete its work
  • Number of threads in 1 work group = 1024

Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

– Bharath

How about this case?

  • Each thread needs lets say 1000 elements to complete its work
  • Number of threads in 1 work group = 1024

Even in this case, the 1st thread or the first warp would have brought all of these 1000 elements.

In most implementations, that’s not true. If you have to copy 1000 elements and your work-group size is 1024, the first 1000 work items will copy one element each and the last 24 work-items will not do any work.

Again, this is somewhat hardware-dependent.

Somehow it is not making sense to me that all the threads, from the other warp also execute the async copy, when the data is already there in the shared memory.

Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don’t know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.

Each work-item only does a small part of the copy. When you put together all the pieces copied by all the work items you get the full copy. I don’t know how to explain this any better. At the end of the day you will have to trust that the people who implemented async_work_group_copy() knew what they were doing.

I quite get the point regarding how the contents are brought form the global to local by separate threads. But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.

But yes, I also understand that OpenCL was not tailored for my application. :slight_smile:

– Bharath

But I would still like to stick to the point that when every thread depends on the complete set of data being fetched, Thread with Local ID-1 will be stalled until Thread with Local ID-1000 has (at least) executed the async copy function which might be much later.

I honestly don’t understand where is the problem. When you put some data in local memory it’s because you want all work-items in the work-group to access all that data. In that case the cost of copying the data from global to local memory is usually negligible compared to the alternative of fetching global memory over and over. If each work-item is only going to access a small piece, then local memory is not needed.

Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.

Perhaps it would be a good idea to share with us what your algorithm looks like so that we can give advice on how to adapt it to OpenCL.

Actually, I am solving a knapsack problem.

We’d have N items having value V(0)… V(N-1) and weights W(0)…W(N-1) and a bag of capacity C. I am currently using dynamic programming technique and the kernel would look like

For i=0:N-1
  For j=1:C
    //some code - trivial arithmetic using V[i] and W[i]
 endFor
endFor

What goes into the OpenCL kernel is “//some code”, and I launch C threads at a time and the kernel is enqueued N times (corresponding to the outer loop).

During the ith call to the kernel, the code uses the ith element of the V array and the W array.
I am currently getting some speedup using OpenCL(global memory) for good values of N and C, but I am wondering if I could use the shared memory to improve the performance significantly.

– Bharath

Ah, I see. Interesting :slight_smile:

If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with “in a register”.


__kernel void knapsack(..., float v_i, float w_i)
{
    // ...
}

That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits.

Bit late on this. Held up debugging one similar implementation.

If all you need in each kernel execution is the value V[i] and W[i] then why not pass them directly to the kernel? The following is easy to implement and puts v_i and w_i in private memory, which is almost synonymous with “in a register”.

This helped a bit. :slight_smile: But I will have to move back to the shared memory usage when the number of elements required by one thread is “many”.

I did manage to understand and make use of the async copy in another similar context, although I did not quite get the speed up initially expected. Realized there was another bottleneck.

That said, I would recommend reading about parallel solutions to the knapsack problem. I know nothing about the topic, but Google shows quite a few hits.

Thanks for this suggestion. It will take a while before I digest these. :smiley:

– Bharath

Well, I am back with a few more questions.

Previously, all the threads in a workgroup used a single value from the val[i] and wgt[i].

Currently, I am working on a variant of the Knapsack problem, called the multiple choice knapsack problem. For this, each of the thread would need the access to the complete array val and wgt. I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory. Something like…


__kernel... (__global value_t *val...)
{
    __local value_t localvals[NUM_ITEMS];
    //fetch using async_work_group_copy(global->local)
   loop: 1 to number_of_values_fetched
    // Work using the values fetched
    // Use localvals instead of val
   end loop
}

I see a good decrease in the number of global load requests (OK) but the amount of GPU time increases when compared to the global memory implementation (val and wgt are in global memory).

IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases. Am I right in thinking so?

I can also see the output from the profiler (from Nvidia), but cannot make good use of it as to where I am losing the time I gained due to the shared memory accesses.

I guess I am being a bit vague, but any suggestions what numbers I could look into to understand what is happening?

– Bharath

To add to what I have said, I see that the branches and number of divergent branches has increased in the shared memory implementation. Do async_work_group_copy or wait_group_events contribute to the branches, in any way? The rest of the kernel, the conditions branches remain the same for both shared and global memory implementations.

– Bharath

I thought it would be appropriate to use the shared memory for this. So, I fetch the whole of the val and wgt array into the shared memory.

If val and wgt fit in local memory they almost certainly fit in constant memory as well. Have you tried that? The only difference for you is that instead of declaring them as __global you declare them as __constant.

IMO, the performance due to the shared memory implementation should increase as the number_of_values_fetched increases.

What is number_of_values_fetched? Is it the same as NUM_ITEMS? If data is read only once from global memory then there will be no benefit in using local memory.

Where local memory is a win is where the kernel would fetch from the same global memory over and over.

Do async_work_group_copy or wait_group_events contribute to the branches, in any way?

Sure they can. It’s implementation-dependent.