This is a very generic question and pardon me for that - the intention is to find out if anyone else has experienced an issue like this.

The event timing for clEnqueueReadBuffer of an OpenCL app that runs on Windows 7 64 bit OS on one Hardware isn't the same as on another Hardware system. We experience spikes of long time to copy data from GPU to host on one system while the performance is fairly consistent and good on the other hardware system.

The system that is slow is on Supermicro chassis with X9DRG-QF motherboard.

Is there a better way to benchmark this or troubleshoot this? We modified the AMD APP SDK to timestamp clEnqueueReadBuffer event and use that as a tool - but we would prefer any existing third party tool to validate and troubleshoot.

Note that we are not interested in kernel execution time as that is same on both our hardware configurations.

Thanks for your readin the post.