I work on a 16 777 216 floats array, with a non host memory buffer.
-Each thread write on float so each warp (32 threads) write 128 bytes (that’s the best case for nVidia GPU compabilitie 1.3).
I use OpenCL profiler.
=> My kernel is executing in 0.955 ms, so bandwitdh is : 65,462 GB/s.
I don’t know if we can say gtx275 have a 118GB/S bandwidth… :?
Perhaps it’s an OpenCL implementation limitation? Does someone test the same sample with CUDA?
Run your kernel once to warm up the card, then average your results over a few dozen/hundred runs. You can easily get very strange results with both the first kernel execution and any individual execution.
Use the vload commands to load a larger chunk of data at once. On some architectures this can make a big difference, on others it may not.
Also, the Nvidia OpenCL drivers are apparently not as mature as the CUDA ones (not surprising given that they are far newer) so you probably won’t get the same performance in some areas.