opencl performance

Hi, All
I attempt to use opcncl to imporve peroformance of my program. but i find two questions.
first, initialize OpenCL i mean getdevice, createprogram, buildprogram…cost about more than 300ms, towards me 300ms is too much long, how to shorten initialization time?
second, my program is about image processing, so big image memory transfer between gpu and host is such a headache, so any good idea about memory copy?
thanks for your answer, sorry for my english.

[li]The OpenCL initialization is something you will have to live with. It will take some amount of time, especially if you are using complex programs. If you are developing on a known system, you can possibly skip querying the hardware (at your own risk) and hardcoding the device to be used. I find that saves me a bit of time. [/li][li]The most time consuming part of initialization is the compiling of the OpenCL programs for me. Eventually, I hope to be able to use pre-built binaries, but I’m not going to be able to use those in time for my thesis. I have one program file in my code that contains about 2 dozen kernels (about 3800 lines of code), and compile time for my NVIDIA GeForce 460 GTX is about 7 or 8 seconds, although with their latest drivers, it is down to about 6 seconds. However, for me, that is a price I am willing to pay. My CPU-only version of my code can take upwards of 45 minutes to arrive at a solution. My GPU-accelerated version can arrive at an answer in about 45 seconds, including this startup time. It’s a small price to pay for me, even though it ends up being about 20% of my total run time. [/li][*]As for the memory transfer, it is really dependent on your hardware how fast your image can get transferred. However, from my experience, it is better to do one big transfer than many small ones. Unfortunately, I don’t deal with images currently, so I don’t know any tricks. Perhaps someone else does.

thank you very much! its very helpful to me!