From what I understood openCL offers 2 styles to express data-parallelism :
  • The first one is implicit, we write a kernel as if it were scalar but many work-items actually executes it across the index space. [/*:m:29bg0qqw]
  • The second one is explicit, we use vector data types in the kernel.[/*:m:29bg0qqw]

My question is how de we know which style best fits a given hardware?
I suppose the traditional answer is the first style is for GPU (as in C for CUDA), and the second one for SSE, right?

But why couldnt a compiler be able to optimize for SSE with the first style ? If this is already the case, why does the second style exist at all?