LLVM: Manually optimizing OpenCL/CUDA intermediate code !
I am interested to optimize OpenCL code, in this regards i went through some OpenCL optmization guide book which says that there are following things you should consider while optimizing your code:
1. Device utilization and occupancy:- it is required to launch as many blocks as possible to get optimal occupancy and to hide memory latency.
2. Maximize Memory Bandwidth:- by minimizing the data transfer and by using overlapping of data transfer with device computation.
3. Shared Memory:- Use shared memory when you need to access data more than once either within the same thread or from different thread within a block.
There may be few more things to consider while optimizing:
my questions are:
1. what can be the other possibilities to optimize OpenCL/CUDA code?
2. Is there any way to manually optimize IR code generated by OpenCL/CUDA compiler? If yes then what are the procedure to do this?
3. One more thing I want to know about CUDA terminology is that why we have concept of warps/blocks/grids?
4. OpenCL guarantees that its programs are portable but it does not guarantee of having optimum performance across different vendor's device, so if I want to get optimum performance across different vendor's device then how should I approach?
5. Can we modify LLVM IR code generated by OpenCL to optimize my code?