I have a question about one of the basic concepts in OpenCL, work-items. Suppose you have an array of 1,000,000 elements and you want to execute some code on each item which is completely independent from other items. Now you can have two scenarios to do so:
1- You can have a work-item for each element, which adds up to 1,000,000 work-items. As the GPU likely would not have this number of PEs to assign to each work-item, I think some work-items will have to wait until the others are completed. Am I correct? How are work-items mapped to PEs during runtime?
2- Now suppose you want to unroll the parallel algorithm, such that each work-item deals with more than just one element. For example if total number of PEs is 100, then each work-item is responsible for processing 10,000 elements. How can I achieve this goal assuming that I donít know the number of PEs in GPU?

I will really appreciate any kind of suggestions!