How to optimize kernel with mixture of parallel and serial code ?

I have a kernel that performs two tasks (A followed by B) - the first is quite parallel, and the second task cannot be parallelized.

Task A is performed by all work items, and task B is only performed by the first work item in the work group.

I am wasting a lot of the GPU resources waiting for task B to complete. What can I do to optimize?

I thought of device partitioning, so that multiple instances of the kernel could run simultaneously, thereby improving parallelization.

Does this sound like the right approach? Any other ideas?

Thanks!