Results 1 to 2 of 2

Thread: a weird Local memory error in transpose

  1. #1
    Join Date
    Sep 2017

    a weird Local memory error in transpose

    Hi, all:

    I have a transpose program using Local memory as follows. Generally, the program runs smoothly, but with a weird bug. I transpose my matrix twice, and compare with the raw matrix. I find that most of the position is correct, but several position's value is different from raw matrix. Here is an example. My matrix is 4096*4906*64 and I will transpose every 4096*4096 twice. In this big matrix, four positions' value is wrong.
    They are

    framenum 0, x 1230, y 3448, value 22 ,input 64, trans 86
    framenum 0, x 1230, y 3449, value 29 ,input 67, trans 96
    framenum 0, x 1231, y 3448, value 1 ,input 67, trans 68
    framenum 0, x 1231, y 3449, value 9 ,input 60, trans 69

    You may see that the four position are together. More weirdly, if I re-run the program, sometimes the error is disappear, sometimes the error is at other position, sometimes, 8 or 12 or 16 positions have errors. I mean, the error happens randomly when running the program.

    If I only use global memory to do transpose ( the simplest way), no error happens. And I if use local memory to do simple copy, that is (1), copy data from global memory to local memory, (2) copy data from local memory to global memory, No error either!

    So I am every confusing, why this happens? I try to use different block size, try with or without volatile , try BLOCK_DIM*(BLOCK_DIM+1) or BLOCK_DIM*(BLOCK_DIM). The error always here.

    My card is K40. By the way, I tried two K40 cards. And both of them have this phenomenon.

    Is anyone have any suggestion? Many thanks for your reply in advance.

    Here is the program: transpose_local_memory.
    __kernel void transpose_local_uchar(__global unsigned char* input, __global volatile unsigned char* output, int width, int height, int BLOCK_DIM, __local volatile unsigned char* block)
    // read the matrix tile into shared memory

    for(int frm=0;frm<16;frm++){
    int x2=get_global_id(2)/4;
    int y2=get_global_id(2)%4;

    unsigned int xIndex = get_global_id(0)+get_global_size(0)*x2;
    unsigned int yIndex = get_global_id(1)+get_global_size(1)*y2;

    if (xIndex >= width || yIndex >= height)
    unsigned int index_in = yIndex * width + xIndex+width*height*frm;
    //block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = input[index_in];
    block[get_local_id(1)*(BLOCK_DIM)+get_local_id(0)] = input[index_in];


    // write the transposed matrix tile to global memory
    xIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0)+get_global_size(1)*y2;
    yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1)+get_global_size(0)*x2;
    if((xIndex < height) && (yIndex < width))
    unsigned int index_out = yIndex * height + xIndex+width*height*frm;
    //output[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];
    output[index_out] = block[get_local_id(0)*(BLOCK_DIM)+get_local_id(1)];


  2. #2
    Junior Member
    Join Date
    Sep 2012
    if (xIndex >= width || yIndex >= height)

    this part makes some threads of a group quit early but


    needs to be hit by all threads of a group. So undefined behavior.

    Your "If I only use global memory to do transpose ( the simplest way), no error happens." part shows that.

    Also if a work area is treated different on borders, it "could" make it both faster and easier to do borders in a separate kernel(easier without border checking but working on border itself).
    Last edited by Tugrul; 09-09-2017 at 11:36 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Proudly hosted by Digital Ocean