i try to automatically subdivide my global workgroup size (gws) into smaller pieces using the GW offset.
Here an example:
Code :
size_t szGWS[3] = {1024,1024,1};
size_t szLWS[3] = {256,1,1};
size_t szGWO[3] = {0,0,0}
if(1024*1024*uiWIComplexity > device.AvailaleFlops) //Test if we need to subdivide problem
  int sub = 3;
  for(int i = 0; i < sub; i++)
     szGWS[1] = 1024/sub;
     szGWO[1] = 1024 * i / sub;
  clEnqKernel(..., szGWO, szGWS, szLWS,...);

I think indexing inside my Kernel works properly but synchronaization fails.
I have a synchonized queue, which means all kernels equeued should synchronize by themself, correct?

but if i do the following:
(1) copy values from buffer A to B in muliple subkernels
(2) edit values of A in multiple subkernels
(3) edit values of B in multiple subkernels

my data seems corrupted.
Does openCl waits for the whole task (1) to complete before srating (2) and (3) or does it start with the first part of (2) or (3) when the first part of (1) is done?

Thanks in advance,