Hi,
I am using CPU SDK on a comp that has 16GB of RAM, running OpenSUSE. I did some profiling on command-queue and either I messed up something huge, or this is less than acceptable performance. In summary, I am measuring how long it takes to read buffer back after kernel executes for int buf[16] (just 16 ints!!!). Here’s what I got:
//HOST SIDE (nthreads = 16)
d_calc2_res = clCreateBuffer(context,
CL_MEM_READ_WRITE,
nthreads*sizeof(int),
NULL,
&err );
checkResult((err == CL_SUCCESS), "clCreateBuffer failed
");
//Pass the pointer to the kernel
err = clSetKernelArg(calc2_kernel, 10, sizeof(cl_mem), static_cast<void *>(&d_calc2_res));
checkResult((err == CL_SUCCESS), "clSetKernelArg failed
");
.....
//Fill it up with values in kernel (verified correct kernel execution)
.....
//Read the result back
err = clEnqueueReadBuffer(cmdQueue, d_calc2_res, CL_TRUE,
0, nthreads*sizeof(int),
static_cast<void *>(calc2_res),
0, NULL, &eventh);
clWaitForEvents(1, &eventh);
//Read profiling info
clGetEventProfilingInfo(eventh,CL_PROFILING_COMMAND_QUEUED,sizeof(cl_ulong),&tstart,&ret_size);
clGetEventProfilingInfo(eventh,CL_PROFILING_COMMAND_SUBMIT,sizeof(cl_ulong),&tend,&ret_size);
printf("
Read buffer time for submit (1 pass): %f msec
",(tend-tstart)/1000000.0);
clGetEventProfilingInfo(eventh,CL_PROFILING_COMMAND_START,sizeof(cl_ulong),&tstart,&ret_size);
clGetEventProfilingInfo(eventh,CL_PROFILING_COMMAND_END,sizeof(cl_ulong),&tend,&ret_size);
printf("
Read buffer time for execute (1 pass): %f msec
",(tend-tstart)/1000000.0);
clReleaseEvent(eventh);
checkResult((err == CL_SUCCESS), "clEnqueueReadBuffer failed
");
The timer has nanosecond resolution and it’s pretty close to my accurate timer I used before OpenCL, both confirm about the same numbers:
Read buffer time for submit (1 pass): 0.007054 msec
Read buffer time for execute (1 pass): 0.298222 msec
So, .3 msec to copy 16 ints??? I tried using blocking/non-blocking option, same thing. Is this to be expected and if so, what workarounds do we have to get decent performance?