What is the general technique when you want more devices (in same context) to run the same kernel on same memory? How do you split the workload? (For example I want first device to calculate first half of the job, and second device second half of the job (in same memory).)

global_work_offset parameter in clEnqueueNDRangeKernel would really be handy for that, but currently isn't supported.

