Slow JavaCL CLBuffer map/unmap operations

Firstly, I have already posted this in the JavaCL forum a while ago, but haven’t received a reply yet.

My problem is as follows. I am working on a multi-GPU application that is a very simple benchmark. One of the tests is to measure the combined host-to-device memory bandwidth using different transfer methods. If I use the normal CLBuffer write method with blocking writes, corresponding to clEnqueueWriteBuffer, then I am able to achieve a maximum of 5GB/s. If I use the map and unmap methods of CLBuffer, I cannot get past 1.9GB/s. Where am I going wrong?

My test method: I launch a separate thread for each GPU. Each thread repeated maps a buffer, copies data to it and then unmaps the buffer. This is repeated 1000 times. System.currentTimeMillis() is used to record the start and end times in milliseconds. All buffers are allocated outside of this loop.

My computer: Nvidia GeForce GTX 560 Ti and GTX 260, Intel Core i7 2600k, Asus P8P67 Pro motherboard.

Does anyone have some suggestions? Thanks very much for the help.

BTW, Nvidia’s oclBandwidthTest reports 14GB/s to both cards, and 7GB/s to either card individually.

This could be quite possibly reasonable values - although it depends entirely on the implementation and hardware capabilities.

map/unmap may very well resort to copying data, if for example, you’ve used USE_HOST_PTR, or for other resource/operating system/hardware related reasons. So then you’d have to add the copying overheads + the mmutable overheads, etc. A read-write buffer mapped might also need data copied to it first (which would not occur with writeBuffer()), etc.

oclbandwidth will probably be trying to hide any latencies from these map/unmap or queue overheads and so will by definition always be better than if you include them.

The vendor documentation should list the parameters under which mapping works most efficiently, or look at the various benchmark samples to see how they do it.