Hi,
I am interested to optimize OpenCL code, in this regards i went through some OpenCL optmization guide book which says that there are following things you should consider while optimizing your code:
- Device utilization and occupancy:- it is required to launch as many blocks as possible to get optimal occupancy and to hide memory latency.
- Maximize Memory Bandwidth:- by minimizing the data transfer and by using overlapping of data transfer with device computation.
- Shared Memory:- Use shared memory when you need to access data more than once either within the same thread or from different thread within a block.
There may be few more things to consider while optimizing:
my questions are:
- what can be the other possibilities to optimize OpenCL/CUDA code?
- Is there any way to manually optimize IR code generated by OpenCL/CUDA compiler? If yes then what are the procedure to do this?
- One more thing I want to know about CUDA terminology is that why we have concept of warps/blocks/grids?
- OpenCL guarantees that its programs are portable but it does not guarantee of having optimum performance across different vendor’s device, so if I want to get optimum performance across different vendor’s device then how should I approach?
- Can we modify LLVM IR code generated by OpenCL to optimize my code?
Thanks !!