OpenCL tradeoffs with driver

Dear all,

I would like to share some info that I have investigated for the past few days.
In short, I find it tough to achieve speed up considering the tradeoffs.

In cases of both OpenCL implemenation, I achieve only the functionality that is producing the same values for 4 output arrays of each 53760 in size.

Here is the profiling information I obtained before porting to OpenCL.

Here is the profiling information, when I modify 1 function to use OpenCL and only 1 kernel function.

Finally, the last profiling information, when I modify the same function with 2 kernel functions.

I know there is a high possibility that I might have coded them wrongly in OpenCL but taking a closer look, you will see that the driver does play a part as well, e.g. cllcdGetPlatformIDskHR (from amdocl.dll) and calddiGetVersion (from aticaldd.dll).

I have also found that clGetPlatformIDs and clBuildProgram (times 2 when running 2 kernel functions) have some poor utilization of CPU time.

It means I have to optimize the code (using OpenCL) fast enough to recover the losses I have in the driver.

If there is anyone can give a glimmer of hope, please kindly do so…

I have also found that clGetPlatformIDs and clBuildProgram (times 2 when running 2 kernel functions) have some poor utilization of CPU time.

Does it mean you are calling clGetPlatformIDs more than once in your application?

Hi David,

I believe the answer is yes.
I have embedded the OpenCL implementation into the application, so the implementation would be called each time an image frame is being processed.

Is it better to only make the call once?

And for clBuildProgram, will there be a difference between using precompiled binary and compilation during runtime?

Thanks!

I have embedded the OpenCL implementation into the application, so the implementation would be called each time an image frame is being processed.

Is it better to only make the call once?

It would be a lot better if you did the setup calls only once, then kept the objects around for the next image. By setup calls I mean APIs like these:

[ul]clGetPlatformIDs
clGetDeviceIDs
clCreateContext
clCreateCommandQueue
clCreateBuffer
clCreateImage2D
clCreateProgramWithSource
clCreateProgramWithBinary
clBuildProgram
clCreateKernel
[/ul]

The only calls that you should be doing from one image to the next would be clSetKernelArg, clEnqueueXXX, and the like.

And for clBuildProgram, will there be a difference between using precompiled binary and compilation during runtime?

Loading precompiled binaries saves some time. However, look into avoiding performing setup calls over and over (see above) before thinking of binary programs.

Thanks for the advice.

I will try to modify the code based on that and probably go up 1 or 2 level of the function call.

Will update if I have better profiling info.