I know this is ridiculous and I believe I did something wrong, because I can’t google anything related on internet. But still, I just can’t re-use a shared memory arry in ati 5870, while the same program run well on nVidia gpu.
say, I have a kernel look like this: (my code is not as simple as this. but the detail is the same)
__local float tmp1[16];
__local float tmp2[16];
uint localIdX = get_local_id(0);
float a,b;
// I first define tmp1 and use it for a
tmp1[localIdX]=1;
a=tmp1[localIdX];
barrier(CLK_LOCAL_MEM_FENCE);
// then if I re-use tmp1for later calculation, the code result will go wrong on ati 5870, while nvidia’s result is good
// but if I use tmp2 instead, then ati is also good.
// example as below
if I use tmp1,
tmp1[localIdX]=1; // the code will go wrong on ati, while nvidia is good
b=tmp1[localIdX];
if a new tmp2 is used:
tmp2[localIdX]=2; // then ok for ati too
b=tmp2[localIdX];
I make sure there is synchronization before re-use of shared memory. This re-use problem only happen on ati 5870, while nVidia GTX260 is good with re-use of shared memory with the same code…
I think maybe there is problem when I build the program, or something related to my card. but I really have no clue now.
Any thought will be appreciated! Thanks.