Precomputing array of 16 elements

Hi,

Before I start, just want to say that (almost) everything works fine so it’s just a question about what you think on this particular subject.
Let’s say that I have a 1D buffer of size N that I “cover” with 1D thread blocks of 128 threads (local size). Each thread divides the angle range [0, 2*pi] in 16 sectors. For each thread, I do something like that:


const float sector_size = sector_size = 2.f * M_PI_F / 16;
for (int i=0; i<16; ++i) {
    float sin_i = sin(i*sector_size);
    float cos_i = cos(i*sector_size);
    (...)
}

As you can see, it’s pretty straightforward. No need to add more details.
Then I thought it was pretty stupid to compute sin and cos many times. I can just pre-compute them and put them into a local array that I fill in parallel using the first 16 threads of my group:


__local float sin_array[16];
__local float cos_array[16];
if (thread_id<16) {
    const float sector_size = 2.f * M_PI_F / 16;
    sin_array[thread_id] = sin(thread_id * sector_size);
    cos_array[thread_id] = cos(thread_id * sector_size);
}
barrier(CLK_LOCAL_MEM_FENCE);

(...)

for (int i=0; i<16; ++i) {
    float sin_i = sin_array[i];
    float cos_i = cos_array[i];
    ...
}

This works fine but it doesn’t speed-up a thing. I’m used to be surprised in GPU coding. I guess here the barrier cancel the benefit of precomputing the array value.
Then I thought “why don’t I initilialize the array by hand?”. So I tried the following approach :


__local float sin_array[16] = { 0.000000f,  0.382683f,  0.707107f,  0.923880f,
                                1.000000f,  0.923880f,  0.707107f,  0.382683f,
                                0.000000f, -0.382683f, -0.707107f, -0.923880f,
                               -1.000000f, -0.923880f, -0.707107f, -0.382683f};

__local float cos_array[16] = { 1.000000f,  0.923880f,  0.707107f,  0.382683f,
                                0.000000f, -0.382683f, -0.707107f, -0.923880f,
                               -1.000000f, -0.923880f, -0.707107f, -0.382683f,
                               -0.000000f,  0.382683f,  0.707107f,  0.923880f};

(...)

for (int i=0; i<16; ++i) {
    float sin_i = sin_array[i];
    float cos_i = cos_array[i];
    ...
}

We don’t have a barrier here so it should be faster no? Problem : sin_array and cos_array are not filled correctly. So this is my main question: Why?

The second question, if this one is solved, is: is it better to let theses arrays in the
__local memory or should I put it in the __constant memory (for instance before the function definition)?

Many thank,

Vincent

As stated in the OpenCL specification, variables allocated in the __local address space inside a kernel function cannot be initialized.

So, your code should not even compile (it does not with both NVIDIA and Intel OpenCL compilers on my computer).

However, this seems to be indeed the perfect case for using __constant memory.

Thank you for your answer. The code did compile though. Strange.
Anyway, I tried the “__constant” version and it doesn’t speed-up anything.
It’s like the more I code in OpenCL the less I understand…

The compiler generally unrolls the loop, so it detects that sin() and cos() are computed on now constant values, and it probably caches the result into… a constant array.

That’s my guess too. Compilers are too smart nowadays :slight_smile: