OpenCL Dual Copy Engines

Before I waste more time trying to figure this out, I had a quick question.

In OpenCL, is it possible to pass input data, execute a kernel, and read output data back out at the same time?
There is some start up time required (2 input data portions as well as 1 kernel execution) before they can all operate at the same time.
I’m currently trying to implement this using 3 queues.

I using a Quadro K5100M.

If you can give me any insight on whether this is possible or not, I would greatly appreciate it.
As well if you know that it is possible in CUDA would be nice to know as well.

Thank you,

BHa

Sure, this is one of use-cases, but NVIDIA used to serialize commands submitted from different command queues. I don’t know if this is fixed by now.

So I went ahead and tried it, and I was able to get it work.
I’m running into some other issues regarding mapping/unmapping buffers but I’m sure I’ll get those resolved.

The method that I went with is inputting, running, and outputting one data set on one queue and just copying that process over multiple queues.

A simple text based description.
[Input][Kernel][Output][Input][Kernel][Output]
asdfads[Input][Kernel][Output][Input][Kernel][Output]
asdfsadfasdfsfd[Input][Kernel][Output][Input][Kernel][Output]

The other method I attempted where input is on queue 0, kernels are on queue 1, and output is on queue 2 has yet to work.

You can use the queue 1 for both input and output by the way. You only have one PCI-E, don’t you?

Yes I only have 1 PCI-E.

But I want to overlap input/compute/output.
Using 1 queue for both input and output wouldn’t allow me to overlap input and output data transfer due to the way In-Order Queues work. (AFAIK)

Also an interesting note, when the input and output are transferring at the same time, there is a slight slowdown in transfer speed.
Generally, I get 11.5G Gbps when only one transfer is occuring, but when two are occurring at the same time, I get speeds of roughly 10.5 Gbps (input) and 9.5 Gbps (output).
Overall, a speedup still exists but not quite 2X.

I have an open source project that you can try driver-controlled pipelining and event controlled pipelining for separable kernels(can both upload+download+compute at the same time for all stages, per device) and also device to device pipelining for non-separable kernels(this just overlaps host transitions with device computes (computes are serial with pci-e movements), will upgrade it later so it will overlap everything including pci-e).

Driver controlled one uses 16 queues so you can try at least 16 blobs overlapped with different stages(read,write,compute)

Event controlled uses 6 queues(read queue + write queue + compute queue and all these is duplicated )

Device to device pipeline uses only 1 queue per device but overlaps array copies between devices(through RAM, for compatibility) with computings in all devices.

but its C# though. I was getting speedups with HD7870 but not with R7-240. Low end cards are not given ability to overlap computes with movements.