Hi, I work in kernel with large array of unsigned characters, I create memory object with clCreateBuffer.
Than I copied through clEnqueueWriteBuffer a chunk of unsigned chars to this memory object. And than I call
in cycle the kernel which read from this memory object, do some logic and write new data to the same place (I don’t call clEnqueueWriteBuffer or clEnqueueReadBuffer in this cycle). Here is the kernel code:
__kernel void test(__global unsigned char *in, unsigned int offset) {
int grId = get_group_id(0);
unsigned char msg[1024];
offset *= grId;
// Copy from global to private memory
size_t i;
for (i = 0; i < 1024; i++)
msg[i] = in[ offset + i ];
// Make some computation here, not complicated logic
// Copy from private to global memory
for (i = 0; i < 1024; i++)
in[ offset + i ] = msg[i];
}
When the cycle is done (the cycle run cca 1000 times) then I read result from memory object through clEnqueueReadBuffer.
It is possible to optimize this code?