Cross-device bandwidth for discrete GPU (HD 5870)

Hi,
I’m testing a system equipped with a Fusion A8-3850 and an HD 5870 gpu. I was planning to test the memory access bandwidth in the following cases:

  1. The discrete GPU (HD 5870) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)
  2. The integrated GPU (6550D) reads from a buffer allocated in the host memory (CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY)

Reads are performed linearly (each thread reads a fixed-size memory range starting from its own global index).

I was assuming that the result of the first test (discrete gpu) would have never been higher than the PCI-express bandwidth (approx 8GB/s), but I’m getting a bandwidth that is around 40 GB/s.
I’m checking the bandwidth by using both the GlobalMemoryTest sample shipped with the AMD SDK and a program written by myself. The results are very similar.

Can you explain me if it is (and why it is) possible to get a cross-domain (gpu->cpu) read bandwidth higher than the PCI one from a discrete GPU?.

Thank you very much!

Can you explain me if it is (and why it is) possible to get a cross-domain (gpu->cpu) read bandwidth higher than the PCI one from a discrete GPU?.

CL_MEM_ALLOC_HOST_PTR doesn’t guarantee that the memory is allocated in any particular place. All it guarantees is that calls to clEnqueueMapBuffer() and clEnqueueMapImage() will not return CL_MAP_FAILURE.

Oh :oops:
In the last hour I set up a succinct test for the problem I encountered.

Here is the link to the source code:
Host code: http://www.gabrielecocco.it/fusion/SimpleMemoryTest.cpp
Kernel: http://www.gabrielecocco.it/fusion/memory_test.cl

And here is the output of the test (150GB/s for the 5870, 42 GB/s for the 6550D, 14GB/s for the CPU)
C:\Users\gabriele\Desktop\CpuGpuTesting\Release>SimpleMemoryTest.exe

  • Tested devices listed below
    Cypress[GPU]
    BeaverCreek[GPU]
    AMD A8-3800 APU with Radeon™ HD Graphics[CPU]

  • Creating opencl environment for each tested device…
    Getting platform id… DONE!
    Searching device (Cypress)… DONE!
    Creating context… DONE!
    Creating command queue… DONE!
    Loading kernel file… DONE!
    Creating program with source… DONE!
    Building program… DONE!
    Creating kernel read_linear DONE!

    Getting platform id… DONE!
    Searching device (BeaverCreek)… DONE!
    Creating context… DONE!
    Creating command queue… DONE!
    Loading kernel file… DONE!
    Creating program with source… DONE!
    Building program… DONE!
    Creating kernel read_linear DONE!

    Getting platform id… DONE!
    Searching device (AMD A8-3800 APU with Radeon™ HD Graphics)…DONE!
    Creating context… DONE!
    Creating command queue… DONE!
    Loading kernel file… DONE!
    Creating program with source… DONE!
    Building program… DONE!
    Creating kernel read_linear DONE!

  • Testing Cypress [GPU] (16777216 bytes buffer, 32 reads per thread)
    Estimated bandwidth: 151460.05 MB/s (success = 1)

  • Testing BeaverCreek [GPU] (16777216 bytes buffer, 32 reads per thread)
    Estimated bandwidth: 42080.92 MB/s (success = 1)

  • Testing AMD A8-3800 APU with Radeon™ HD Graphics [CPU] (16777216 bytes buffer, 32 reads per thread)
    Estimated bandwidth: 14809.57 MB/s (success = 1)

  • Test ended. Press a key to exit…


So, should I desume that the buffer is placed on the GPU even if I specify the flags CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY?

Thank you for your help!!!