I have a kernel that performs two tasks (A followed by B) - the first is quite parallel, and the second task cannot be parallelized.
Task A is performed by all work items, and task B is only performed by the first work item in the work group.
I am wasting a lot of the GPU resources waiting for task B to complete. What can I do to optimize?
I thought of device partitioning, so that multiple instances of the kernel could run simultaneously, thereby improving parallelization.
Does this sound like the right approach? Any other ideas?
Thanks!