Streaming data | transfer rate?

Semirke · July 5, 2014, 4:38am

Hello,

Im in the need to create a software that is able to move a continuous stream of data into the GPU for processing then fetch back the data (or what’s left of it). So it forwards the stream after processing.
Is there a way to do these transfers (never mind the computation now) without seriously increasing data latency? I need this to be done under 5 ms. (1ms preferred).

What transfer rates can be achived between host>GPU?
Which transfer method should I choose for this kind of transferes?

(Selected platfrom is AMD, however, NVidia is an option. Currently I can only test on nvidia.)

I found a bandwidth testing app (Nvidia OpenCL examples), that is configured to use pinned memory with mapped access and is blazing fast. However, Im still dont understand when the actual transfer happens:

        // MAPPED: mapped pointers to device buffer for conventional pointer access
        void* dm_idata = clEnqueueMapBuffer(cqCommandQueue, cmDevData, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, &ciErrNum);
        oclCheckError(ciErrNum, CL_SUCCESS);
        for(unsigned int i = 0; i < MEMCOPY_ITERATIONS; i++)
        {
            memcpy(h_data, dm_idata, memSize);
        }
        ciErrNum = clEnqueueUnmapMemObject(cqCommandQueue, cmDevData, dm_idata, 0, NULL, NULL);
        oclCheckError(ciErrNum, CL_SUCCESS);

This uses MEMCOPY_ITERATIONS to test the BW.
Does memcpy actually moving data between the host and GPU? (So it really does MEMCOPY_ITERATIONS transfer of h_data to GPU and it could be processed there?)
Or does the transfer happen when we unmap the mem object?

I can keep the queues opened and the memory allocated|mapped and keep reusing it, right?

Sorry for the noob questions, Im in the dark yet.

I really appreciate your help
Thank you in advance!

Bests,
Semirke

Dithermaster · July 5, 2014, 7:24am

OpenCL 1.x doesn’t have a continuous data streaming mode so you’ll need to chop up your data into blocks and upload them one by one, process them, and download results. On modern hardware you’ll be able to overlap these operations so the the upload for block N can happen at the same time as the compute for N-1 at the same time as the download for N-2. Unfortunately the technique for this is slightly different for each vendor. NVIDIA, for example, requires your host buffers to be in pinned memory (see their oclCopyComputeOverlap example). AMD can do it with clEnqueueMapBuffer / clEnqueueUnmapMemObject. On their APU chip this is a “zero-copy” operation (takes no time), and the same for Intel’s integrated GPU (e.g., HD Graphics 5000). For AMD and NVIDIA’s discrete GPUs there is also a technique where the buffer operations don’t take time but instead the data gets copied “on demand” as your kernel needs it. This works well if you are accessing in-order but not randomly.

Since you’re just getting started with OpenCL I’d suggest getting your code working first using a single command queue, using clEnqueueWriteBuffer / kernel / clEnqueueReadBuffer and then once that is working you can split this into 3 command queues, use events to add interdependencies, and pinned host buffers to enable async write/reads on NVIDIA, and see how much faster you can get it.

Watch Tom True’s “Topics in GPU-Based Video Processing” session from GTC; you’ll find lots of insight. Search | NVIDIA On-Demand

AMD has similar techniques and training, I just don’t have links to them right now.

Semirke · July 5, 2014, 7:59am

Hi,

thank you very much for your answer
I’ve come along with overlapped already on nvidia. I just was not sure if that is what I need.

My highest transfer rate is around 1.2GB/sec so it is a way under the bandwidth achiveable. So Im not even sure if I need overlapped IO. However, I still dont see if host<>GPU can achive 1k transfer / sec? (Cause I wont have more data then 1.2MB/ms any time. That is what I want to transfer asap. )
Ofc, if the GPU is able to do 1M transfer, I wont even need overlapped IO.

Thanks!

Semirke

OferRosenberg · July 7, 2014, 12:33am

[QUOTE=Semirke;30518]Hi,

thank you very much for your answer
I’ve come along with overlapped already on nvidia. I just was not sure if that is what I need.

My highest transfer rate is around 1.2GB/sec so it is a way under the bandwidth achiveable. So Im not even sure if I need overlapped IO. However, I still dont see if host<>GPU can achive 1k transfer / sec? (Cause I wont have more data then 1.2MB/ms any time. That is what I want to transfer asap. )
Ofc, if the GPU is able to do 1M transfer, I wont even need overlapped IO.

Thanks!

Semirke[/QUOTE]

On PCIe 3.0, using pinned memry, it is possible to get to 12GB/Sec transfer rate to the GPU (NVIDIA) using pinned memory. Regular memory reduces it to around 5GB/Sec.
Can you provide more details regarding your system configuration ? Integrated GPU (Intel/AMD) or Discrete GPU ? Which bus type is your GPU connected to ?

Semirke · July 7, 2014, 12:48am

Hi,
Thank you very much.
My question is not bandwidth. That is clearly high.
We are going to use Amd or Nvidia, discrete cards.

I’ll have a bunch of small transfers and I want to get a feeling of memory transfer overheads.

So I try to rephrase my questions: how many small transfers can I anticipate on this platform? I understand that it may be up to several GB/s, but if the overhead of filling then sending the buffers is high (for eg. 2-5 ms) then it will never reach the theoretical bandwidth with 8kbyte packets.

Thank you!
Bests,
Semirke