Kernel breakdown

I need confirmation or correction for my understanding on how kernel is executed. OpenCL version being 1.1.

The kernel is executed line by line in all work items, even when part of the code is encapsulated inside if() condition statement. Kernels which fail the condition statement just ignore the code encapsulated inside the condition statement and advance along with the other work items. And this happens with loops too where if one work item continues the loop the others will too. This makes the kernel execution time to be the longest execution path.

If that was true then yey for me but here comes my another confusion.

When you are using clEnqueueNDRangeKernel() function you determine the dimensions which are in use as well as the amount of work items each dimension uses which cannot exceed the used device’s max work group item count (in my case my 560 Ti only supports 1024 work items per work group).
So inside the kernel you must give the dimension being used in order to grab the work item’s ID but how do you exactly use this since there is no way identifying which dimension the current work item belongs to as far as I know of when looking at these function Work-Item Built-In Functions

Would I need to create some sort of loop which handles the assignments along multiple dimensions or what?

And finally a bonus question. Why call them work item dimensions? That sounds a bit misleading since I will assume the work item count is for example in my case with 560 Ti 1024 * 1024 * 64 when it actually is 1024 + 1024 + 64. Why not call them lanes or work item partitions. Dimensions is such a confusing description in my opinion.

[QUOTE=EmJayJay;31369]The kernel is executed line by line in all work items, even when part of the code is encapsulated inside if() condition statement. Kernels which fail the condition statement just ignore the code encapsulated inside the condition statement and advance along with the other work items. And this happens with loops too where if one work item continues the loop the others will too. This makes the kernel execution time to be the longest execution path.[/QUOTE]Sort of. There’s no “line by line” as a single CL-C statement might involve multiple HW ISA instructions. Most importantly, the performance penalty from branching depends on how the WIs branch.

Let me use AMD GCN1.x architecture as an example. It mangles “wavefronts” made of 16x4 WIs.
Let’s assume we dispatch work in a 1D index space so get_global_id(0) uniquely identifies each WI.

Example 1: get_global_id(0)%2 take true path, odd WIs always take false path. Result: you pay full price.
Example 2: get_global_id(0)<32 take true path, others false path. Result: you still pay full price.
Example 3: let’s consider 128 WIs, where get_global_id(0)<64 take true path, others false path. Result: no divergent branching; WI>=64 just evaluate if condition and skip the whole thing. In theory. In practice the compiler might “linearize” simple expressions.

[QUOTE=EmJayJay;31369]So inside the kernel you must give the dimension being used in order to grab the work item’s ID but how do you exactly use this since there is no way identifying which dimension the current work item belongs to as far as I know of when looking at these function Work-Item Built-In Functions

Would I need to create some sort of loop which handles the assignments along multiple dimensions or what?[/QUOTE]There’s no way. In practice the amount of dimensions is by design and you need to compute it accurately; at worse, you set some dimensions to have 1 WI.
Anyway, CL2.0 has size_t get_global_linear_id (), you can implement it yourself if you want; it’s not much of a big deal as far as I am concerned.

[QUOTE=EmJayJay;31369]And finally a bonus question. Why call them work item dimensions? That sounds a bit misleading since I will assume the work item count is for example in my case with 560 Ti 1024 * 1024 * 64 when it actually is 1024 + 1024 + 64. Why not call them lanes or work item partitions. Dimensions is such a confusing description in my opinion.[/QUOTE]:confused: I cannot make any sense of this.
If you dispatch an NDRange of 1k1k64 sure you’ll be executing 64M work items.