Memory test: cpu timers vs. gpu timer sample

Cadorino · October 17, 2011, 1:30pm

Hi,
I’m performing some memory tests on a pc (cpu + discrete gpu) and on an apu.
In particular, my test consists in writing Y bytes X times to find out the completion time and the average bandwidth. I do this for all the possible allocation strategies for the source and the destination buffers.

I tried to compare the results given by using cpu timers (on windows, queryPerformanceCounter) to those obtained using gpu timers, by now only on the cpu + discrete gpu.
The difference between thos two measures is so huge that I’m sure I made some mistakes.
Here is an example:


Testing transfer of 1024 bytes 16 times...

Unpinned -> Unpinned
CPU timer: 6259.02 Mbytes/s (total time: 0.02 ms)
GPU timer: 6368.19 Mbytes/s (total time: 0.02 ms)
Unpinned -> Device
CPU timer: 4.10 Mbytes/s (total time: 4.65 ms)
GPU timer: 13611.35 Mbytes/s (total time: 0.03 ms)
Pinned   -> Unpinned
CPU timer: 3.94 Mbytes/s (total time: 4.33 ms)
GPU timer: 7492.48 Mbytes/s (total time: 0.03 ms)
Pinned   -> Pinned
CPU timer: 3.73 Mbytes/s (total time: 5.23 ms)
GPU timer: 9359.60 Mbytes/s (total time: 0.04 ms)
Pinned   -> Device
CPU timer: 3.30 Mbytes/s (total time: 5.64 ms)
GPU timer: 12743.39 Mbytes/s (total time: 0.03 ms)
Device   -> Unpinned
CPU timer: 4.70 Mbytes/s (total time: 3.69 ms)
GPU timer: 11231.09 Mbytes/s (total time: 0.03 ms)
Device   -> Pinned
CPU timer: 4.37 Mbytes/s (total time: 3.79 ms)
GPU timer: 8819.22 Mbytes/s (total time: 0.04 ms)
Device   -> Device
CPU timer: 7.78 Mbytes/s (total time: 2.15 ms)
GPU timer: 8876.45 Mbytes/s (total time: 0.04 ms)

I really need and help to find the mistake, or to be told why I get such different results.

I show you the piece of code where I compute the completion time for a transfer between a pinned source buffer and a destination buffer allocated on the device. The other cases are really similar.

Some hints regarding the code:

DATATYPE is a macro actually set to “int”
The struct Timer is contained in an utility library. I report the code of timer (just in case the mistake is in there) at the end of the post


//profile with gpu timer
if(!gpu_timer) {
	timer.start();
	src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, NULL, NULL);
	for(int i = 0; i < NUM_TRANSF; i++) 
		clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, NULL);
	clFinish(queue);
	time = timer.get();
}		
//profile with cpu timer					
else {
	src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, &transfer_event, NULL);
	clWaitForEvents(1, &transfer_event);
	clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &start, 0);					
	for(int i = 0; i < NUM_TRANSF; i++)
		clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, &transfer_event);
	clFinish(queue);	
	clWaitForEvents(1, &transfer_event);
	clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0);
	time = (double)1.0e-9 * (end - start);
}


double bandwidth = ((double)(NUM_TRANSF * size * sizeof(DATATYPE)) / (double)time) * 1000.0 / 1000000.0;
result.total_time = time + alloc_time;
result.bandwidth = bandwidth;

The code of Timer:


typedef struct Timer {
	LARGE_INTEGER frequency;
	LARGE_INTEGER start_time;
	void start() {
		QueryPerformanceFrequency(&frequency);	
		QueryPerformanceCounter(&start_time);
	}
	double get() {
		LARGE_INTEGER end;
		QueryPerformanceCounter(&end);
		double elapsedTime = (end.QuadPart - start_time.QuadPart) * 1000.0 / frequency.QuadPart;
		return elapsedTime;
	}
} Timer;

Thank you very much!

Cadorino · October 17, 2011, 1:35pm

Additional infos…

In case of 16 times 16 Mbytes, I get:


Pinned   -> Device
CPU timer: 3317.04 Mbytes/s (total time: 81.06 ms)
GPU timer: 3837185.08 Mbytes/s (total time: 0.10 ms)

i.e. 3Tbytes of bandwidth, which is practically impossible, especially for a transfer host->device, which should be limited by the PCI bandwidth.

ibbles · October 18, 2011, 4:10am

I would like to know the cause of this as well. Performance measurements are always important and this makes me question some of my own measurements.

I noticed that you copy from a memory mapped buffer into another buffer. Aare both of these two buffers on the same device? What happens if you replace the mapped src_pointer with a normal, malloc:ed host pointer?

What happens if you wrap the entire if(!gpu_timer)-else statement with some other timing mechanic, such as GetSystemTimeAsFileTime() or gettimeofday()?

In the else-part, were you use OpenCL events, you overwrite the same event multiple times in the for-loop. Is this safe?

As a side note, I find the ‘profile with gpu timer’ and ‘profile with cpu timer’ comments confusing.

Cadorino · October 18, 2011, 10:50am

I found the problem. It was simply due to an erroneout conversion between nanoseconds and milliseconds. So sorry, but it was very late
Not gpu and cpu timer give me very similar results (gpu timer measures are about 10% less then those give by cpu timer).