Hi,
Before I start, just want to say that (almost) everything works fine so it’s just a question about what you think on this particular subject.
Let’s say that I have a 1D buffer of size N that I “cover” with 1D thread blocks of 128 threads (local size). Each thread divides the angle range [0, 2*pi] in 16 sectors. For each thread, I do something like that:
const float sector_size = sector_size = 2.f * M_PI_F / 16;
for (int i=0; i<16; ++i) {
float sin_i = sin(i*sector_size);
float cos_i = cos(i*sector_size);
(...)
}
As you can see, it’s pretty straightforward. No need to add more details.
Then I thought it was pretty stupid to compute sin and cos many times. I can just pre-compute them and put them into a local array that I fill in parallel using the first 16 threads of my group:
__local float sin_array[16];
__local float cos_array[16];
if (thread_id<16) {
const float sector_size = 2.f * M_PI_F / 16;
sin_array[thread_id] = sin(thread_id * sector_size);
cos_array[thread_id] = cos(thread_id * sector_size);
}
barrier(CLK_LOCAL_MEM_FENCE);
(...)
for (int i=0; i<16; ++i) {
float sin_i = sin_array[i];
float cos_i = cos_array[i];
...
}
This works fine but it doesn’t speed-up a thing. I’m used to be surprised in GPU coding. I guess here the barrier cancel the benefit of precomputing the array value.
Then I thought “why don’t I initilialize the array by hand?”. So I tried the following approach :
__local float sin_array[16] = { 0.000000f, 0.382683f, 0.707107f, 0.923880f,
1.000000f, 0.923880f, 0.707107f, 0.382683f,
0.000000f, -0.382683f, -0.707107f, -0.923880f,
-1.000000f, -0.923880f, -0.707107f, -0.382683f};
__local float cos_array[16] = { 1.000000f, 0.923880f, 0.707107f, 0.382683f,
0.000000f, -0.382683f, -0.707107f, -0.923880f,
-1.000000f, -0.923880f, -0.707107f, -0.382683f,
-0.000000f, 0.382683f, 0.707107f, 0.923880f};
(...)
for (int i=0; i<16; ++i) {
float sin_i = sin_array[i];
float cos_i = cos_array[i];
...
}
We don’t have a barrier here so it should be faster no? Problem : sin_array and cos_array are not filled correctly. So this is my main question: Why?
The second question, if this one is solved, is: is it better to let theses arrays in the
__local memory or should I put it in the __constant memory (for instance before the function definition)?
Many thank,
Vincent