## Precomputing array of 16 elements

Hi,

Before I start, just want to say that (almost) everything works fine so it's just a question about what you think on this particular subject.
Let's say that I have a 1D buffer of size N that I "cover" with 1D thread blocks of 128 threads (local size). Each thread divides the angle range [0, 2*pi] in 16 sectors. For each thread, I do something like that:
Code :
```const float sector_size = sector_size = 2.f * M_PI_F / 16;
for (int i=0; i<16; ++i) {
float sin_i = sin(i*sector_size);
float cos_i = cos(i*sector_size);
(...)
}```
As you can see, it's pretty straightforward. No need to add more details.
Then I thought it was pretty stupid to compute sin and cos many times. I can just pre-compute them and put them into a local array that I fill in parallel using the first 16 threads of my group:
Code :
```__local float sin_array[16];
__local float cos_array[16];
const float sector_size = 2.f * M_PI_F / 16;
}
barrier(CLK_LOCAL_MEM_FENCE);

(...)

for (int i=0; i<16; ++i) {
float sin_i = sin_array[i];
float cos_i = cos_array[i];
...
}```
This works fine but it doesn't speed-up a thing. I'm used to be surprised in GPU coding. I guess here the barrier cancel the benefit of precomputing the array value.
Then I thought "why don't I initilialize the array by hand?". So I tried the following approach :

Code :
```__local float sin_array[16] = { 0.000000f,  0.382683f,  0.707107f,  0.923880f,
1.000000f,  0.923880f,  0.707107f,  0.382683f,
0.000000f, -0.382683f, -0.707107f, -0.923880f,
-1.000000f, -0.923880f, -0.707107f, -0.382683f};

__local float cos_array[16] = { 1.000000f,  0.923880f,  0.707107f,  0.382683f,
0.000000f, -0.382683f, -0.707107f, -0.923880f,
-1.000000f, -0.923880f, -0.707107f, -0.382683f,
-0.000000f,  0.382683f,  0.707107f,  0.923880f};

(...)

for (int i=0; i<16; ++i) {
float sin_i = sin_array[i];
float cos_i = cos_array[i];
...
}```
We don't have a barrier here so it should be faster no? Problem : sin_array and cos_array are not filled correctly. So this is my main question: Why?

The second question, if this one is solved, is: is it better to let theses arrays in the
__local memory or should I put it in the __constant memory (for instance before the function definition)?

Many thank,

Vincent