Manually optimizing OpenCL/CUDA intermediate code !

Gopal_HC · September 4, 2013, 12:01am

Hi,

I am interested to optimize OpenCL code, in this regards i went through some OpenCL optmization guide book which says that there are following things you should consider while optimizing your code:

Device utilization and occupancy:- it is required to launch as many blocks as possible to get optimal occupancy and to hide memory latency.
Maximize Memory Bandwidth:- by minimizing the data transfer and by using overlapping of data transfer with device computation.
Shared Memory:- Use shared memory when you need to access data more than once either within the same thread or from different thread within a block.

There may be few more things to consider while optimizing:

my questions are:

what can be the other possibilities to optimize OpenCL/CUDA code?
Is there any way to manually optimize IR code generated by OpenCL/CUDA compiler? If yes then what are the procedure to do this?
One more thing I want to know about CUDA terminology is that why we have concept of warps/blocks/grids?
OpenCL guarantees that its programs are portable but it does not guarantee of having optimum performance across different vendor’s device, so if I want to get optimum performance across different vendor’s device then how should I approach?
Can we modify LLVM IR code generated by OpenCL to optimize my code?

Thanks !!