Best way to enqueue a kernel for a triangular matrix?

I’m solving a 2D problem where I just need to process work-items for i>j (well I could process all work-items, but the result of the [i,j] item is guaranteed to be the same as the [j,i] item, so it’s a waste of resources to compute symmetrical items).

What would be the most efficient way of doing it?
Just make work-items return without doing anything when j>=i?
Or perhaps enqueuing the kernel several times with different global IDs (one invocation per row, so that each row only defines work-items with i>j)?
Or doing this: http://stackoverflow.com/questions/24021305/opencl-efficient-way-to-group-a-lower-triangular-matrix ?
(Btw, maybe this URL advice was written by the same Dithermaster as in this forum).
The URL advice somehow scares me. Wouldn’t it be already efficient if I follow my first idea above (ie: just make work-items return when j>=i)? I mean, I tend to believe that when a work-item returns, the OpenCL runtime will start execution for a new work-item as soon as a compute unit is capable of starting a new work-item, so maybe it’s efficient too, wouldn’t it?

Thanks!

I’ve come to the conclusion that the best way is to turn the 2D nature of the problem into 1D. It’s fast to get the (i,j) indexes in the triangular matrix from the single 1D id of each work-item. This way you just create work-items for the half of the triangular matrix that you’re really interested in, and moreover, I believe cache use will be quite coherent.