Hi,
I’m performing some memory tests on a pc (cpu + discrete gpu) and on an apu.
In particular, my test consists in writing Y bytes X times to find out the completion time and the average bandwidth. I do this for all the possible allocation strategies for the source and the destination buffers.
I tried to compare the results given by using cpu timers (on windows, queryPerformanceCounter) to those obtained using gpu timers, by now only on the cpu + discrete gpu.
The difference between thos two measures is so huge that I’m sure I made some mistakes.
Here is an example:
Testing transfer of 1024 bytes 16 times...
Unpinned -> Unpinned
CPU timer: 6259.02 Mbytes/s (total time: 0.02 ms)
GPU timer: 6368.19 Mbytes/s (total time: 0.02 ms)
Unpinned -> Device
CPU timer: 4.10 Mbytes/s (total time: 4.65 ms)
GPU timer: 13611.35 Mbytes/s (total time: 0.03 ms)
Pinned -> Unpinned
CPU timer: 3.94 Mbytes/s (total time: 4.33 ms)
GPU timer: 7492.48 Mbytes/s (total time: 0.03 ms)
Pinned -> Pinned
CPU timer: 3.73 Mbytes/s (total time: 5.23 ms)
GPU timer: 9359.60 Mbytes/s (total time: 0.04 ms)
Pinned -> Device
CPU timer: 3.30 Mbytes/s (total time: 5.64 ms)
GPU timer: 12743.39 Mbytes/s (total time: 0.03 ms)
Device -> Unpinned
CPU timer: 4.70 Mbytes/s (total time: 3.69 ms)
GPU timer: 11231.09 Mbytes/s (total time: 0.03 ms)
Device -> Pinned
CPU timer: 4.37 Mbytes/s (total time: 3.79 ms)
GPU timer: 8819.22 Mbytes/s (total time: 0.04 ms)
Device -> Device
CPU timer: 7.78 Mbytes/s (total time: 2.15 ms)
GPU timer: 8876.45 Mbytes/s (total time: 0.04 ms)
I really need and help to find the mistake, or to be told why I get such different results.
I show you the piece of code where I compute the completion time for a transfer between a pinned source buffer and a destination buffer allocated on the device. The other cases are really similar.
Some hints regarding the code:
- DATATYPE is a macro actually set to “int”
- The struct Timer is contained in an utility library. I report the code of timer (just in case the mistake is in there) at the end of the post
//profile with gpu timer
if(!gpu_timer) {
timer.start();
src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, NULL, NULL);
for(int i = 0; i < NUM_TRANSF; i++)
clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, NULL);
clFinish(queue);
time = timer.get();
}
//profile with cpu timer
else {
src_pointer = (DATATYPE*)clEnqueueMapBuffer(queue, src, CL_FALSE, CL_MAP_READ, 0, size * sizeof(DATATYPE), 0, NULL, &transfer_event, NULL);
clWaitForEvents(1, &transfer_event);
clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &start, 0);
for(int i = 0; i < NUM_TRANSF; i++)
clEnqueueWriteBuffer(queue, dst, CL_FALSE, 0, size * sizeof(DATATYPE), src_pointer, 0, NULL, &transfer_event);
clFinish(queue);
clWaitForEvents(1, &transfer_event);
clGetEventProfilingInfo(transfer_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0);
time = (double)1.0e-9 * (end - start);
}
double bandwidth = ((double)(NUM_TRANSF * size * sizeof(DATATYPE)) / (double)time) * 1000.0 / 1000000.0;
result.total_time = time + alloc_time;
result.bandwidth = bandwidth;
The code of Timer:
typedef struct Timer {
LARGE_INTEGER frequency;
LARGE_INTEGER start_time;
void start() {
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&start_time);
}
double get() {
LARGE_INTEGER end;
QueryPerformanceCounter(&end);
double elapsedTime = (end.QuadPart - start_time.QuadPart) * 1000.0 / frequency.QuadPart;
return elapsedTime;
}
} Timer;
Thank you very much!