Memory test: cpu timers vs. gpu timer sample

Hi,
I’m performing some memory tests on a pc (cpu + discrete gpu) and on an apu.
In particular, my test consists in writing Y bytes X times to find out the completion time and the average bandwidth. I do this for all the possible allocation strategies for the source and the destination buffers.

I tried to compare the results given by using cpu timers (on windows, queryPerformanceCounter) to those obtained using gpu timers, by now only on the cpu + discrete gpu.
The difference between thos two measures is so huge that I’m sure I made some mistakes.
Here is an example:


Testing transfer of 1024 bytes 16 times...

Unpinned -> Unpinned
CPU timer: 6259.02 Mbytes/s (total time: 0.02 ms)
GPU timer: 6368.19 Mbytes/s (total time: 0.02 ms)
Unpinned -> Device
CPU timer: 4.10 Mbytes/s (total time: 4.65 ms)
GPU timer: 13611.35 Mbytes/s (total time: 0.03 ms)
Pinned   -> Unpinned
CPU timer: 3.94 Mbytes/s (total time: 4.33 ms)
GPU timer: 7492.48 Mbytes/s (total time: 0.03 ms)
Pinned   -> Pinned
CPU timer: 3.73 Mbytes/s (total time: 5.23 ms)
GPU timer: 9359.60 Mbytes/s (total time: 0.04 ms)
Pinned   -> Device
CPU timer: 3.30 Mbytes/s (total time: 5.64 ms)
GPU timer: 12743.39 Mbytes/s (total time: 0.03 ms)
Device   -> Unpinned
CPU timer: 4.70 Mbytes/s (total time: 3.69 ms)
GPU timer: 11231.09 Mbytes/s (total time: 0.03 ms)
Device   -> Pinned
CPU timer: 4.37 Mbytes/s (total time: 3.79 ms)
GPU timer: 8819.22 Mbytes/s (total time: 0.04 ms)
Device   -> Device
CPU timer: 7.78 Mbytes/s (total time: 2.15 ms)
GPU timer: 8876.45 Mbytes/s (total time: 0.04 ms)

I really need and help to find the mistake, or to be told why I get such different results.

I show you the piece of code where I compute the completion time for a transfer between a pinned source buffer and a destination buffer allocated on the device. The other cases are really similar.

Some hints regarding the code:

  1. DATATYPE is a macro actually set to “int”
  2. The struct Timer is contained in an utility library. I report the code of timer (just in case the mistake is in there) at the end of the post

//profile with gpu timer
if(!gpu_timer) {
	timer.start();
	src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, NULL, NULL);
	for(int i = 0; i < NUM_TRANSF; i++) 
		clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, NULL);
	clFinish(queue);
	time = timer.get();
}		
//profile with cpu timer					
else {
	src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, &transfer_event, NULL);
	clWaitForEvents(1, &transfer_event);
	clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &start, 0);					
	for(int i = 0; i < NUM_TRANSF; i++)
		clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, &transfer_event);
	clFinish(queue);	
	clWaitForEvents(1, &transfer_event);
	clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0);
	time = (double)1.0e-9 * (end - start);
}


double bandwidth = ((double)(NUM_TRANSF * size * sizeof(DATATYPE)) / (double)time) * 1000.0 / 1000000.0;
result.total_time = time + alloc_time;
result.bandwidth = bandwidth;

The code of Timer:


typedef struct Timer {
	LARGE_INTEGER frequency;
	LARGE_INTEGER start_time;
	void start() {
		QueryPerformanceFrequency(&frequency);	
		QueryPerformanceCounter(&start_time);
	}
	double get() {
		LARGE_INTEGER end;
		QueryPerformanceCounter(&end);
		double elapsedTime = (end.QuadPart - start_time.QuadPart) * 1000.0 / frequency.QuadPart;
		return elapsedTime;
	}
} Timer;

Thank you very much!

Additional infos…

In case of 16 times 16 Mbytes, I get:


Pinned   -> Device
CPU timer: 3317.04 Mbytes/s (total time: 81.06 ms)
GPU timer: 3837185.08 Mbytes/s (total time: 0.10 ms)

i.e. 3Tbytes of bandwidth, which is practically impossible, especially for a transfer host->device, which should be limited by the PCI bandwidth.

I would like to know the cause of this as well. Performance measurements are always important and this makes me question some of my own measurements.

I noticed that you copy from a memory mapped buffer into another buffer. Aare both of these two buffers on the same device? What happens if you replace the mapped src_pointer with a normal, malloc:ed host pointer?

What happens if you wrap the entire if(!gpu_timer)-else statement with some other timing mechanic, such as GetSystemTimeAsFileTime() or gettimeofday()?

In the else-part, were you use OpenCL events, you overwrite the same event multiple times in the for-loop. Is this safe?

As a side note, I find the ‘profile with gpu timer’ and ‘profile with cpu timer’ comments confusing.

I found the problem. It was simply due to an erroneout conversion between nanoseconds and milliseconds. So sorry, but it was very late :slight_smile:
Not gpu and cpu timer give me very similar results (gpu timer measures are about 10% less then those give by cpu timer).