There is still a dependency for your timing: i.e. that both kernels are complete.
And what about this `transfer from queue[0] to queue[1]’ you mentioned but never clarified? Does that not add a dependency?
The code without clFinish and clFlush is as follows:
queue[0] = clCreateCommandQueue(context, device[0], QUEUE_OPTS , &result);
queue[1] = clCreateCommandQueue(context, device[1], QUEUE_OPTS , &result);
// Kernel creation, argument passing not included
// kernel[0] and kernel[1] "are assigned to" queue[0] and queue[1] respectively by assigning arguments properly
// variables with index 0 were created in queue[0]
kernel[0] = clCreateKernel(OpenCLProgram, kernel, &err)
err = clSetKernelArg(kernel[0], 0, sizeof(cl_mem), (void*)&(coefx[0]));
err = clSetKernelArg(kernel[0], 1, sizeof(cl_mem), (void*)&(coefy[0]));
err = clSetKernelArg(kernel[0], 2, sizeof(cl_mem), (void*)&(res[0]));
// variables with index 1 were created in queue[1]
kernel[1] = clCreateKernel(OpenCLProgram, kernel, &err);
err = clSetKernelArg(kernel[1], 0, sizeof(cl_mem), (void*)&(coefx[1]));
err = clSetKernelArg(kernel[1], 1, sizeof(cl_mem), (void*)&(coefy[1]));
err = clSetKernelArg(kernel[1], 2, sizeof(cl_mem), (void*)&(res[1]));
size = Ntotal * sizeof(float);
offset = size;
err = clEnqueueNDRangeKernel(queue[0], kernel[0], 1, NULL, WSize, GSize, 0, NULL, &event0);
err = clEnqueueNDRangeKernel(queue[1], kernel[1], 1, NULL, WSize, GSize, 0, NULL, &event1);
// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &event1, &eventCP);
Main idea is that the kernel is partitioned into 2 gpus, and each kernel will generate half of the res array. After the execution of the kernel[1] in queue[1] I need to transfer its half of res to queue[0].
I need to measure the total execution time of this code to verify that indeed the total time will be approximately KernelTime + TransferTime. That’s why I’m using the wall clock time.
Thanks for your help.