3 Element Vectors in Memory

I’m converting some particle system code from using OpenGL shaders to update particle positions. This code currently stores the positions and velocities of particles as 3 element floating point vectors, and uses vec3’s to do the calculation.

Looking at the OpenCL spec I notice that vectors can only be sized in powers of 2. This isn’t a great hardship in the actual calculation, as I understand most hardware would have 128 bit vectors anyway, so I shouldn’t get a penalty. I would be specifying a third more work for a scalar architecture, but I’d hope that could be pruned if a compiler sees I’m not using the result.

The problem is reading the data into the kernel. Is there a nice way of getting the data in/out a single access, or am I resigned to doing something like:

__kernel particle (global float * positions, global float * velocities) {
    int idx = get_global_id(0) * 3;
    float4 position = (float4) (positions[idx +0], positions[idx +1], positions[idx +2], 1.0)
    float4 velocity = (float4) (velocities[idx +0], velocities[idx +1], velocities[idx +2], 0.0)

    .....

    positions[idx +0] = position.x;
    positions[idx +1] = position.y;
    positions[idx +2] = position.z;
    velocities[idx +0] = velocity.x;
    velocities[idx +1] = velocity.y;
    velocities[idx +2] = velocity.z;
}

I’d like to keep the 3 element structure in memory to avoid a 33% increase in memory for no gain.

Paul

This may be more suitable in one of the other forums, but they don’t seem to be active. If the moderators want to move the thread that’s fine by me.

CL 1.0 doesn’t support 3-element vectors. You have several options: take the 33% hit up front and transfer 4-element vectors to the GPU, transfer the data as 3-element vectors and then write your kernels to access it manually as such, or transfer the data as 3-element vectors and run a kernel that unpacks it on the GPU.

The optimal solution will depend on the amount of data you need to transfer and the speedup from using float4 types vs. float types on a given architecture. (E.g., on AMD you will see a big win using a float4 so options 1 or 3 are probably worth it, on Nvidia it will be less, but un-aligned memory accesses may not be readily coalesced so it’s not clear.)