a weird Local memory error in transpose

Hi, all:

I have a transpose program using Local memory as follows. Generally, the program runs smoothly, but with a weird bug. I transpose my matrix twice, and compare with the raw matrix. I find that most of the position is correct, but several position’s value is different from raw matrix. Here is an example. My matrix is 4096490664 and I will transpose every 4096*4096 twice. In this big matrix, four positions’ value is wrong.
They are

framenum 0, x 1230, y 3448, value 22 ,input 64, trans 86
framenum 0, x 1230, y 3449, value 29 ,input 67, trans 96
framenum 0, x 1231, y 3448, value 1 ,input 67, trans 68
framenum 0, x 1231, y 3449, value 9 ,input 60, trans 69

You may see that the four position are together. More weirdly, if I re-run the program, sometimes the error is disappear, sometimes the error is at other position, sometimes, 8 or 12 or 16 positions have errors. I mean, the error happens randomly when running the program.

If I only use global memory to do transpose ( the simplest way), no error happens. And I if use local memory to do simple copy, that is (1), copy data from global memory to local memory, (2) copy data from local memory to global memory, No error either!

So I am every confusing, why this happens? I try to use different block size, try with or without volatile , try BLOCK_DIM*(BLOCK_DIM+1) or BLOCK_DIM*(BLOCK_DIM). The error always here.

My card is K40. By the way, I tried two K40 cards. And both of them have this phenomenon.

Is anyone have any suggestion? Many thanks for your reply in advance.

Here is the program: transpose_local_memory.

__kernel void transpose_local_uchar(__global unsigned char* input, __global volatile unsigned char* output, int width, int height, int BLOCK_DIM, __local volatile unsigned char* block)
{
// read the matrix tile into shared memory

for(int frm=0;frm<16;frm++){
int x2=get_global_id(2)/4;
int y2=get_global_id(2)%4;

unsigned int xIndex = get_global_id(0)+get_global_size(0)*x2;
unsigned int yIndex = get_global_id(1)+get_global_size(1)*y2;

if (xIndex  &gt;= width || yIndex &gt;= height)
{
	return;
}
else
{
	unsigned int index_in = yIndex * width + xIndex+width*height*frm;
	//block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = input[index_in];
	block[get_local_id(1)*(BLOCK_DIM)+get_local_id(0)] = input[index_in];
}

	barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE );

// write the transposed matrix tile to global memory
xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0)+get_global_size(1)*y2;
yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1)+get_global_size(0)*x2;
if((xIndex &lt; height) && (yIndex  &lt; width))
{
	unsigned int index_out = yIndex * height + xIndex+width*height*frm;
	//output[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];
	output[index_out] = block[get_local_id(0)*(BLOCK_DIM)+get_local_id(1)];
}

}
}

if (xIndex >= width || yIndex >= height)
{
return;
}

this part makes some threads of a group quit early but

barrier(CLK_LOCAL_MEM_FENCE|CLK_GLOBAL_MEM_FENCE );

needs to be hit by all threads of a group. So undefined behavior.

Your “If I only use global memory to do transpose ( the simplest way), no error happens.” part shows that.

Also if a work area is treated different on borders, it “could” make it both faster and easier to do borders in a separate kernel(easier without border checking but working on border itself).