Performance of clEnqueueReadBuffer on different HW systems

This is a very generic question and pardon me for that - the intention is to find out if anyone else has experienced an issue like this.

The event timing for clEnqueueReadBuffer of an OpenCL app that runs on Windows 7 64 bit OS on one Hardware isn’t the same as on another Hardware system. We experience spikes of long time to copy data from GPU to host on one system while the performance is fairly consistent and good on the other hardware system.

The system that is slow is on Supermicro chassis with X9DRG-QF motherboard.

Is there a better way to benchmark this or troubleshoot this? We modified the AMD APP SDK to timestamp clEnqueueReadBuffer event and use that as a tool - but we would prefer any existing third party tool to validate and troubleshoot.

Note that we are not interested in kernel execution time as that is same on both our hardware configurations.

Thanks for your readin the post.

You don’t mention the configuration of the system that is fast. The first thing that comes to mind is to ask what other software is running in the background on the Supermicro system? The other possibility, but I don’t have the money to have experience with this, is that the slow copies have to copy data from GPU1 to CPU1 (assuming that GPU1 connects to CPU1) and then on to CPU2 as a thread running there requested the data from GPU1. I think that was called NUMA architecture (non-uniform memory access). So a thread running on CPU1 getting data from GPU1 would see decent bandwidth, while a thread running on CPU2 but accessing GPU1 would see worse bandwidth due to the extra hop from CPU1 to CPU2.