2 data-parallel models

Hi,

From what I understood openCL offers 2 styles to express data-parallelism :
[ul]
[li]The first one is implicit, we write a kernel as if it were scalar but many work-items actually executes it across the index space. [/:m:29bg0qqw][/li][li]The second one is explicit, we use vector data types in the kernel.[/:m:29bg0qqw][/ul][/li]
My question is how de we know which style best fits a given hardware?
I suppose the traditional answer is the first style is for GPU (as in C for CUDA), and the second one for SSE, right?

But why couldnt a compiler be able to optimize for SSE with the first style ? If this is already the case, why does the second style exist at all?

I suppose the traditional answer is the first style is for GPU (as in C for CUDA), and the second one for SSE, right?

GPU’s might very well be vector based as well given that they often deal with colours and vertices that have multiple components (RGBA / XYZW). I believe IBM Cell processors have multiple vector based cores, and I know of other GPUs/processors which are as well, although not the recent NVIDIA ones from what I gather. I’ve even seen designs that have multiple clusters of multiple vector cores on a single chip.

My question is how de we know which style best fits a given hardware?

Difficult, because every architecture is different (as above), so the balance of cores vs vector width can be tricky to judge.

I tend to split things depending on the type of problem. For example, I’m working on some particle system code. Each work item is a single particle, but the particles position, velocity, colour are all vector types within the code. That makes a lot of the code vector, but it’s a natural fit for the problem, and actually makes the code easier to understand.

Granted, you can write code that processes multiple scalar values using vectors, but I’d only do that if I was sure I had enough work to keep a good number of work-items. I also wouldn’t go above 128-bit vectors (e.g. float4), as I don’t know of any architecture that has words any bigger.

But why couldnt a compiler be able to optimize for SSE with the first style ? If this is already the case, why does the second style exist at all?

From what I’ve seen, 1 instance of a kernel = 1 thread / work-item, so a vector kernel wouldn’t get run across multiple scalars cores. Nor would a scalar kernel be run in parallel within a vector core. That’s not to say people couldn’t do this, just that they don’t seem to at the moment.

Compiling vector code to a scalar processor is much easier than compiling scalar code to a vector processor and getting full parallelism out of it. After all, we’ve had SSE and friends for a long time, but not much success with exploiting it unless you hand code routines. Writing the code in vector form, if the problem is suitable for it, makes the compilers job easier when targeting a vector processor and won’t really harm a scalar processor. All a scalar compiler will do is repeat code / loop n-times more.

Part of the problem is there’s a lot of information out there based on CUDA, but not much from anyone else, so it’s difficult to say what’s good in general. We just don’t know what maps well onto ATI GPUs, Larrabee, Cells, etc.

Well… I don’t.

That got a bit waffly, but hope it answers some things.

Well, thanks for your reply, it does answer some things.

Paul is basically right. Different architectures are optimized for different things. AMD’s GPUs are 4-way SIMD so float4 is the optimal data type. Depending on the CPU, different SSE versions support different vector sizes. Nvidia’s GPUs are scalar, so vector ops are broken up.

The best way to start is to use the vector extensions in the manner that best maps to your algorithm. If you are using RGBA data or XYZW, then it is natural to use float4s. The backend will then map this in the best way to your target architecture. Once you’ve got that working, you can investigate the performance gains of manually vectorizing non-vector code for various platforms, but that is a second step.