I am using CPU SDK on a comp that has 16GB of RAM, running OpenSUSE. I did some profiling on command-queue and either I messed up something huge, or this is less than acceptable performance. In summary, I am measuring how long it takes to read buffer back after kernel executes for int buf[16] (just 16 ints!!!). Here's what I got:

Code :
        //HOST SIDE (nthreads = 16)
	d_calc2_res = clCreateBuffer(context,
				&err );
	checkResult((err == CL_SUCCESS), "clCreateBuffer failed\n");
        //Pass the pointer to the kernel
	err = clSetKernelArg(calc2_kernel, 10, sizeof(cl_mem), static_cast<void *>(&d_calc2_res));
	checkResult((err == CL_SUCCESS), "clSetKernelArg failed\n");
        //Fill it up with values in kernel (verified correct kernel execution)
        //Read the result back
    	err = clEnqueueReadBuffer(cmdQueue, d_calc2_res, CL_TRUE,
                              0, nthreads*sizeof(int),
                              static_cast<void *>(calc2_res),
                              0, NULL, &eventh);
    	clWaitForEvents(1, &eventh);
        //Read profiling info
	printf("\n\tRead buffer time for submit (1 pass):\t%f msec\n\n",(tend-tstart)/1000000.0);
	printf("\n\tRead buffer time for execute (1 pass):\t%f msec\n\n",(tend-tstart)/1000000.0);	
    	checkResult((err == CL_SUCCESS), "clEnqueueReadBuffer failed\n");

The timer has nanosecond resolution and it's pretty close to my accurate timer I used before OpenCL, both confirm about the same numbers:

Read buffer time for submit (1 pass): 0.007054 msec
Read buffer time for execute (1 pass): 0.298222 msec

So, .3 msec to copy 16 ints???? I tried using blocking/non-blocking option, same thing. Is this to be expected and if so, what workarounds do we have to get decent performance?