For loops inside kernels

I’m using the book “openCL in action” to learn how to program openCL. In this book the author claims that for-loops inside kernel-functions is a bad idea because comparison statements are time consuming on gpus which I understand considering general gpu architectures. However in his matrix-examples and in other matrix-examples from other sources, for loops are used quite extensively inside kernels. Isn’t this sort of against the whole idea behind using openCL? If I need many for-loops why dispatch kernels at all instead of just writing normal c/c++ code?

Could someone please shed some light on these issues for me? It would be of great help before I start implementing my own algorithms.

For loops in kernels are fine if all work items in the workgroup are looping the same number of times. If each work item takes a different number of loops then you have divergence and that’s what can really slow things down.