clEnqueueMapBuffer in discrete systems

Hi to everybody!
I’m performing some benchmarks to compare discrete systems (GPU separated from CPU) with APUs under various conditions/algorithms.
At the beginning, I thought that “APUs are cool” because:

  1. The cross-domain (CPU<->GPU) accesses and data-transfers latency/bandwidth are not architecturally limited by the PCI-e bus

  2. The data can be passed between the CPU and the GPU without any copy (if properly allocated in OpenCL)

Anyway, recently someone told me that mapped cross-domain accesses can be performed also in discrete systems through the PCI-e bus, preventing from copying data.
Therefore, I’m actually not sure that the second reason why “APUs are cool” is totally right.

Can somebody help me to clarify this aspect?

Nobody? :frowning:

The behavior of buffer mapping on discrete GPUs is actually implementation-dependent. The specification only mentions that the implementation can cache the buffers in GPU memory. So how much is cached, whether it’s cached at all, and how much is accessed at kernel runtime with DMA over the PCI-e bus is quite unspecified.

I’ve been looking on these things myself, but the actual behavior is not very clear. For example, the AMD profiler does not mention when and how memory is being transferred to the device, so I can’t get timings on the GPU/host transfer speed. The kernels do tend to run slower on mapped (not copied) buffers, but since the memory access method is actually unknown and not profiled, it’s impossible to determine if it’s convenient or not for one-shot processing.