Runtime differences __global, __constant, __local

Imagine I have some sort of filter algorithm.

  • in_array has the input data (vectorized, for faster access)
  • out_array gets the result of the filtering
  • filter is the filter itself.

The kernel would look something like this:
__kernel void vec_iii_1d(__global float4 filter, __global float4 in_array, __global float4* out_array)
{

out_array[tid] = in_array[tid] * filter[fid];

}

Questions:

  1. If I change “__global float4 *filter” to “__constant float4 *filter”, would the data then be automatically cached in the constant cache + kept there for all subsequent kernel calls (the kernel is called several times) ?

  2. If I change “__global float4 *filter” to “__local float4 *filter” - what will happen then?
    2a) Is the data in global memory first, and then copied automatically to local memory when the kernel is executed?

If I change “__global float4 *filter” to “__constant float4 *filter”, would the data then be automatically cached in the constant cache + kept there for all subsequent kernel calls (the kernel is called several times) ?

The short answer is “probably yes”.

It’s up to the OpenCL driver to see if it can keep it there or not between kernel calls. In other words, it depends on the implementation and environmental factors (other kernels running, etc).

If I change “__global float4 *filter” to “__local float4 *filter” - what will happen then?

2a) Is the data in global memory first, and then copied automatically to local memory when the kernel is executed?

These are very good questions.

I’ll start with 2a. No, data is not copied automatically to local memory. Local memory is lost every time a new work-group starts to execute. This means that each work-group is responsible for filling the local memory with useful data before it reads from it. In your case, each work group would copy the filter data from global (or constant) memory to local memory, and only then the filtering operation can start.

Now back to question 2. I would expect __constant to be faster than __local. Please let me know if you find the opposite is true in some platform :slight_smile:

Thanks David.

I am surprised that it is possible at all to use __local variables in a kernel function interface, since I’ve never seen that so far in sample code.

What they usually do is manually copy data from __global memory to __local memory + use something like barrier(CLK_LOCAL_MEM_FENCE) before they actually use that data.

Can you tell me why they never use “my” approach?

cheers,
F.

I am surprised that it is possible at all to use __local variables in a kernel function interface, since I’ve never seen that so far in sample code.

This feature is defined in section 5.7.2 (page 127) and 6.5.2 (page 185).

What they usually do is manually copy data from __global memory to __local memory + use something like barrier(CLK_LOCAL_MEM_FENCE) before they actually use that data.

Yes, that is still necessary even if you pass the __local variable as a kernel argument.

Can you tell me why they never use “my” approach?

What is your approach?

> Yes, that is still necessary even if you pass the __local variable as a kernel argument.

Uhh! So would the first function call in my kernel then by a barrier call? Like this here:

__kernel void vec_iii_1d(__local float4 filter, __global float4 in_array, __global float4* out_array)
{
barrier(CLK_LOCAL_MEM_FENCE);

}

No, the first instruction would not be a barrier. A local barrier is needed after the global-to-local copy is done.

I was replying affirmatively to this:

What they usually do is manually copy data from __global memory to __local memory + use something like barrier(CLK_LOCAL_MEM_FENCE) before they actually use that data.

By the way, CL 1.1. introduced some builtin functions, such as async_work_group_copy(), to help with these sort of use cases.

David, w.r.t. the following comment posted by you on 7 Jan 2011, 1:21pm

“Now back to question 2. I would expect __constant to be faster than __local. Please let me know if you find the opposite is true in some platform.”

__local should almost always be faster than __constant if __local really is dedicated local memory (typically some form of SRAM) vs. just mapped as __global memory. I assume you meant “Now back to question 2. I would expect __constant to be faster than __global…” which should be the case for most platforms.

Without going into details, let’s say that the performance of the different address spaces will be implementation-dependent and that Frizz should try both :slight_smile: