How optimally transfer character array to kernel

Hi, I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer.
Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call
in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don’t call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:


__kernel void test(__global unsigned char *in, unsigned int offset) {
    int grId = get_group_id(0);
    unsigned char msg[1024];
    offset *= grId;
    
    // Copy from global to private memory
    size_t i;
    for (i = 0; i < 1024; i++)
        msg[i] = in[ offset + i ];

    // Make some computation here, not complicated logic    

    // Copy from private to global memory
    for (i = 0; i < 1024; i++)
        in[ offset + i ] = msg[i];
}

When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.

It is possible to optimize this code?

Are you sure the code you have posted works as intended? This part looks strange:


// Copy from private to global memory
    for (i = 0; i < 1024; i++)
        in[ offset + i ] = msg[i];

For one thing, it’s copying from global memory (msg) into private memory(in) again. Also, if you switch it (“msg[i] = in[offset + i];”) then all work-items in the work-group will be overwriting the same portion of msg with the same data.