quote: can you please clarify what do you mean by “indirect dispatch”?
Let’s take the example of collision detection: We have two kernels. First one detects pairs of bounding boxes, second calculates exact contact points for each pair.
With OpenCL 1.x we need a read back to determinate the number of detected pairs:
EnqueueKernel (DetectPairs(), boxes, numBoxes, &pairCountGPU);
int pairCountCPU; ReadBuffer (pairCountGPU, &pairCountCPU);
EnqueueKernel (CalcExactCollision(), boxes, numBoxes, pairCountCPU);
So the GPU has to wait while pairCount is downloaded to CPU and until the next Enqueue has been loaded up.
With VK, indirect dispatch means we do not need to download because the work size can come from GPU memory.
Also, because the command buffer can be recorded just once and reused multiple times, we do not wait on the upload os Enqueue commands:
Setup:
commandBuffer.RecordDispatch (DetectPairs(), boxes, numBoxes, &pairCountGPUbuffer);
commandBuffer.MemoryBarrier (pairCountGPUbuffer);
commandBuffer.RecordIndirectDispatch (CalcExactCollision(), boxDataGPUbuffer, pairCountGPUbuffer);
Runtime:
Enqueue(commandBuffer);
So we can implement a whole physics engine (or graphics engine) in one ‘draw call’, if we are able to do it with a constant program flow.
(But we don’t know if a GPU really can do this without some CPU<->GPU interaction under the hood).
NV already goes beyond that with device generated command buffers.
In my current work i have only one small download per frame that i do with a blocking read. No other down- or uploads are included in my profiling.
All my Enqueues don’t do any waiting.
So the CL slowdown i see comes mainly from the need to enqueue everything each frame.
quote: Does it mean that VK can somehow execute patch of 50 requests in parallel?
No. I don’t use async compute yet. The need to use multiple queues and command buffers produces enough overhead to destroy the benefit for me, see https://community.amd.com/thread/209805
In theory the driver could figure out a dependency graph from a single command buffer and do async under the hood, but i’m pretty sure that does not happen.
quote: Or does it, again, mean that CL kernel code is compiled to 10% less efficient binary representation compared to VK shader code?
Yes. When i start working on a new kernel CL mostly is initially faster. But after optimizing it (still focusing on CL using CodeXL), VK ends up 10% faster on average (but there are also rare cases where CL wins).
Observed on AMD. Only 10% difference means the vendor has good compilers. I remember OpenCL being two times faster than OpenGL compute shader on Nvidia some years ago, i would have expected the other way around
However, the AMD OpenCL compiler can do really crazy bad things. VK seems better but i can’t tell because CodeXL does not support it yet.
quote: …and then look in CodeXL timeline, I see that there is no redundant waits – all kernels and R/W-s launch after each other immediately.
How do you do that?
I’m not aware there is a timeline where i could see the idle time between enqueues. I can only see the time the kernel needs to run, but no start / end timestamps.