OpenCL vs. Vulkan Compute

tornado · February 6, 2017, 10:07pm

Is there any comparison between OpenCL and Vulkan compute, for example, feature, performance, flexibility and usability? It looks to me that OpenCL is still superior than Vulkan compute, if the use case is a pure compute case in which rendering is not needed. Vulkan Compute has the advantage of unifying the graphics and compute and there should be no or very little overhead for switching between compute and graphics.

In terms of feature wise, OpenCL looks a lot more mature than Vulkan Compute. CL 2.0 supports the SVM and other advanced features like pipe and kernel enqueue kernel. while Vulkan Compute is inherited from the GL compute and it covers only a subset of features OpenCL 2.0 supports.

In terms of SW overhead, Vulkan has tremendous advantage over OpenGL. I am curious if anyone has compared OpenCL vs. Vulkan compute in terms of software overhead. I can imagine that since OpenCL is still pretty new, and its SW overhead should not be as severe as OpenGL’s. But I have not seen any data from any vendors.

Alfonse_Reinheart · February 7, 2017, 8:08am

The point of Vulkan’s compute pipeline is to provide compute facilities for rendering operations. Despite the ridiculous tagline, Vulkan is not trying to compete with OpenCL. Use OpenCL for compute tasks. Use Vulkan’s compute shaders when you need to do computations that feed rendering processes.

JoeJ · March 8, 2017, 7:00am

I’m working on a very large and complex compute project (realtime GI) and i’m pretty sure my results are good for a generic comparision between OpenCL and VK.
I use OpenCL for those reasons: AMD CodeXL tools don’t work yet for VK, OpenCL is much easier to use to test some stuff quickly, it’s good to compare performance.
I also feed both my VK / CL code path with the same shader code (using a preprocessor to generate GLSL and OpenCL C from the same source files).

My results are a comparision of OpenCL 1.2 vs VK only (NVidia does not support CL 2.0), and with this VK is a lot faster:

Vulkan:

AMD FuryX: 1.37ms
NV GTX 1070: 2.01 ms

AMD 7950: 2.3 ms
NV GTX 670: 9.5 ms

OpenCL:

FuryX: 2.2 ms
GTX 670: 12.5 ms

You see VK is faster on both AMD and NV. The reason is indirect dispatch.
My algorithm uses about 50 dispatches.

With VK i can pre record all this to a single command buffer and need only one submission per frame.
If a GPU is able to store and process command buffers on its own that means only one CPU<->GPU interaction per frame.

With OpenCL 1.2 i have 50 CPU<->GPU interactions per frame.
Because there is no indirect dispatch i need to set the amount of work from CPU for each kernel (often requiring a read back from previous results), and this is the reason why CL is slow.

Comparing only the GPU runtime and ignoring the above limitation VK is still about 10 % faster on average on AMD.

(Unfortunately i can’t say anything about how using OpenCL 2.0 device side enqueue would change this picture)

So i can only recommend to look at Vulkan even if you’re not interested in graphics.
Using the API is much more work and OpenCL C is a much nicer language than GLSL, but if performance matters VK is great.

Thunder · March 17, 2017, 9:26am

JoeJ___ , can you please clarify what do you mean by “indirect dispatch”?
I have experience only with OpenCL.
If I enqueue, say, 50 kernels to OpenCL command queue eventually interleaved with buffers reads/writes, and then look in CodeXL timeline, I see that there is no redundant waits – all kernels and R/W-s launch after each other immediately.
This happens, certainly, only if R/W-s are launched as non-blocking, with ‘wait’ flag disabled and no queue.finish() is engaged in between.
Sometimes I observed situation when CPU part of application is over and program is waiting in the last queue.finish() line while GPU is still executing kernels enqueued long time ago.
You are saying that VK is doing better, though.
Does it mean that VK can somehow execute patch of 50 requests in parallel? That would be the only way to improve performance in this case.
If not, than I do not understand even 10% improvement.
Or does it, again, mean that CL kernel code is compiled to 10% less efficient binary representation compared to VK shader code?
That would be strange from my point of view, looking like vendor’s attempt to sabotage OpenCL.

JoeJ · March 17, 2017, 11:35pm

quote: can you please clarify what do you mean by “indirect dispatch”?

Let’s take the example of collision detection: We have two kernels. First one detects pairs of bounding boxes, second calculates exact contact points for each pair.

With OpenCL 1.x we need a read back to determinate the number of detected pairs:

EnqueueKernel (DetectPairs(), boxes, numBoxes, &pairCountGPU);
int pairCountCPU; ReadBuffer (pairCountGPU, &pairCountCPU);
EnqueueKernel (CalcExactCollision(), boxes, numBoxes, pairCountCPU);

So the GPU has to wait while pairCount is downloaded to CPU and until the next Enqueue has been loaded up.

With VK, indirect dispatch means we do not need to download because the work size can come from GPU memory.
Also, because the command buffer can be recorded just once and reused multiple times, we do not wait on the upload os Enqueue commands:

Setup:

commandBuffer.RecordDispatch (DetectPairs(), boxes, numBoxes, &pairCountGPUbuffer);
commandBuffer.MemoryBarrier (pairCountGPUbuffer);
commandBuffer.RecordIndirectDispatch (CalcExactCollision(), boxDataGPUbuffer, pairCountGPUbuffer);

Runtime:

Enqueue(commandBuffer);

So we can implement a whole physics engine (or graphics engine) in one ‘draw call’, if we are able to do it with a constant program flow.
(But we don’t know if a GPU really can do this without some CPU<->GPU interaction under the hood).
NV already goes beyond that with device generated command buffers.

In my current work i have only one small download per frame that i do with a blocking read. No other down- or uploads are included in my profiling.
All my Enqueues don’t do any waiting.
So the CL slowdown i see comes mainly from the need to enqueue everything each frame.

quote: Does it mean that VK can somehow execute patch of 50 requests in parallel?

No. I don’t use async compute yet. The need to use multiple queues and command buffers produces enough overhead to destroy the benefit for me, see https://community.amd.com/thread/209805
In theory the driver could figure out a dependency graph from a single command buffer and do async under the hood, but i’m pretty sure that does not happen.

quote: Or does it, again, mean that CL kernel code is compiled to 10% less efficient binary representation compared to VK shader code?

Yes. When i start working on a new kernel CL mostly is initially faster. But after optimizing it (still focusing on CL using CodeXL), VK ends up 10% faster on average (but there are also rare cases where CL wins).
Observed on AMD. Only 10% difference means the vendor has good compilers. I remember OpenCL being two times faster than OpenGL compute shader on Nvidia some years ago, i would have expected the other way around
However, the AMD OpenCL compiler can do really crazy bad things. VK seems better but i can’t tell because CodeXL does not support it yet.

quote: …and then look in CodeXL timeline, I see that there is no redundant waits – all kernels and R/W-s launch after each other immediately.

How do you do that?
I’m not aware there is a timeline where i could see the idle time between enqueues. I can only see the time the kernel needs to run, but no start / end timestamps.

Salabar · March 18, 2017, 4:37am

VK ends up 10% faster on average

This can be related to Vulkan’s looser precision requirements. Have you tried using native_* OpenCL C functions and unsafe optimizations compiler options? A Vulkan shader may in fact turn out to be incorrect.

JoeJ · March 18, 2017, 7:27am

I use minimal precision in OpenCL too and i don’t use any native stuff or vendor extensions yet. The result is equal for CL and VK branch while CPU version produces slightly different results. (i don’t use transcendental math functions, just sqrt)

VK has an advantage that we set memory barriers only when needed while OpenCL has to figure out itself. But in my case i have dependencies on previous results almost always so probably this does not really affect my numbers.

But it’s worth to mention i measure time for CL only on CPU after calling clFinish. For VK i use proper profiling with GPU timestamps.
That’s not really good but if we assume i waste 0.3ms with the finish, we would still talk about 1.4 vs. 2 ms.

For game dev there is the opinion that VK / DX12 makes OpenCL / Cuda needless and i agree as long as CL 2.0 with data sharing is no option.
What’s still missing is fine grained async compute. Memory barriers are the main performance issue for me. It would be great if we could do some work while this is happening.