Which GPUs support Queue Out of Order, which GPUs support global thread count not...?

So I have been testing my OpenCL code on two different machines; one is newer and has a MSI Radeon R9 390x and the other has a GeForce GT 610.

I am planning on upgrading the one with the old GeForce GT 610 to a newer and more powerful graphics card. My question is, while browsing for graphics cards,
1) what cards support / how do I find out if a graphics card will support CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE
and
2) what cards support / how do I find out if a graphics card will support the total number of global threads not being divisible by local group size?

I ask because my Radeon R9 allows me to run kernel jobs where the global thread count is not divisible by the local group size, which is really nice because with the OpenCL computations I am doing I don’t have control over the number of total kernels I need to run being nicely divisible by a number bigger than or equal to 32 (for example, because things run slower if local group size is smaller than that on my Radeon). However it doesn’t support CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, which would be nice.

Meanwhile the GeForce GT 610 is the opposite; it supports out of order queues but requires that the global thread count is divisible by the local group size.

Are there any cards that support both?

AFAIK, current Radeon GPUs still do not support out-of-order command queues. However, what they do support is having multiple in-order command queues in flight. IIRC most AMD GPUs have two hardware queues under the hood, so doing so should allow you to fulfill some use cases of out-of-order queues like overlapping IO and compute. You may want to play with this and see if it fits your needs.

As for the total amount of global threads not being divisible by the local group size, this is not supported by OpenCL (see page 172 of the OpenCL 1.2 spec), and clEnqueueNDRangeKernel should fail with CL_INVALID_WORK_GROUP_SIZE when you do that (per page 173 of the spec). Even if you can find some non-conforming GPUs and drivers which allow this today, you should not expect this to continue to be true in the future: a simple driver update is allowed to break this behaviour while still conforming to the spec. So I do not think you should rely on this behaviour in your code.

A number of workaround exists. The simplest one is to pad your inputs to a multiple of the local work size. Another is to build a boolean mask at the beginning of your kernels, which is set to false if the global index is out of bounds, and used to prevent any operation with side-effects to occur. Beware that this later approach interacts subtly with memory barriers, as only NVidia’s OpenCL implementation will be forgiving of situations where not all threads in a work group reach a memory barrier. Testing on Intel’s OpenCL implementations, which crash or hang on improper memory barrier use, could help you catch invalid barrier usage.

By the way, AMD GPUs use warps/wavefronts/subgroups of 64 threads, so for best performance on this architecture you may want to use a local work size of at least 64.

AFAIK, current Radeon GPUs still do not support out-of-order command queues.

I can swear they did at some point in time. They actually had (or still have) an optimization that allowed them to run independent kernels simultaneously even on in-order queues. It can be they dropped the feature on purpose.

Even if you can find some non-conforming GPUs and drivers which allow this today,

Global work size non-divisible by local group size is legal in OpenCL 2.0 (not available on NVIDIA though), you have to manually turn the feature off to save a bit of performance.

Ah yes. I spent too long doing OpenCL 1.2, and started forgetting about the ways NVidia love to keep my life miserable.

I see, thanks for the help! I may come back later with more questions.

Hey guys, I have two more questions:

  1. What new high-end graphics cards have the best Linux support? Currently the machine is running CentOS 6. If useful we are looking to spend $800-$600 on a card that meets the following criteria; memory type of GDDR5X, 8GB, Core clock speed that is greater than 1.5Ghz.

  2. What does having two or more graphics cards connected via SLI or Crossfire mean for OpenCL code? Will they be logically treated as one device, basically now just able to run twice as many kernels at a time? Or could I give one card a different program to run when I want to?

memory type of GDDR5X, 8GB, Core clock speed that is greater than 1.5Ghz.

That’s a very specific set of requirements only GTX 1080 meet. :smiley: In which case you’re better off with CUDA, NVIDIA’s OpenCL support is quite poor. Their OpenGL support is almost flawless though. AMD only has Fury X (a ton of compute and 25% faster memory, but only 4 GB of it) and RX 480 which you can buy 2 of for your money (GDDR5 and only half of cores of Fury X, but it has the hardware prefetch. This should help with low occupancy kernels, but I failed to benchmark this properly). But horribad OpenGL performance. What is your use-case? Fury X will prevail in something like raytracing, 2 RX 480s are better when your task will scale on 2 GPUs, Pascal has dedicated half-precision hardware, etc.

  1. What does having two or more graphics cards connected via SLI or Crossfire mean for OpenCL code? Will they be logically treated as one device, basically now just able to run twice as many kernels at a time? Or could I give one card a different program to run when I want to?

You should create a context with two devices and a queue for each. You should supply each device with work separately.

[QUOTE=Salabar;41401]That’s a very specific set of requirements only GTX 1080 meet. :smiley: In which case you’re better off with CUDA, NVIDIA’s OpenCL support is quite poor. Their OpenGL support is almost flawless though. AMD only has Fury X (a ton of compute and 25% faster memory, but only 4 GB of it) and RX 480 which you can buy 2 of for your money (GDDR5 and only half of cores of Fury X, but it has the hardware prefetch. This should help with low occupancy kernels, but I failed to benchmark this properly). But horribad OpenGL performance. What is your use-case? Fury X will prevail in something like raytracing, 2 RX 480s are better when your task will scale on 2 GPUs, Pascal has dedicated half-precision hardware, etc.

You should create a context with two devices and a queue for each. You should supply each device with work separately.[/QUOTE]

So it looks like we don’t have enough room for two graphics cards. In any case we would be using it for a wide range of computations, nothing very similar to ray tracing. Things like the N-Body problem (this is not what we are doing but in terms of computations it is kind of similar).

So I have another question then since you pointed out that I may have OpenCL problems; if I were to compare OpenCL running on a Radeon 390x to CUDA running on a GeForce 1080, would the 1080 be significantly faster or would they be about the same despite the 1080’s higher clock speed and memory bandwidth?

1080 can run circles around 390 in any case: bigger core count, bigger cache, bigger register file. NVIDIA does not have problems with OpenCL per se, but should you choose to use OpenCL, you are going blind: the only way to profile your kernels is to hack a command line CUDA tool that, I believe, got deprecated a year ago, so it probably does not support Pascals at all.