GPGPU on uint64, hints please

Hello,

I am new to OpenCL programming, so I would appreciate
any hints related to the questions stated below. Assume there
is a large array (~10e7 and more elements) composed of unsigned
64-bit values. I want to ask the GPU whether a given value is less
than a common 64-bit value C and return the vector of the resulting
boolean answers. What is the best way to do that? Since the GPU
is not very strong in bit handling, I assume that inside a kernel
I should bundle a set of such answers into a reasonably-sized
scalar type, say ushort. So:

  1. Should I implement that as a simple compare-and-set loop
    which evaluates the i-th bit of the result?

  2. Or maybe use the vector types, e.g. ulong16? This would
    compute a vector of 16 results the SIMD way, which sounds
    good, but I don’t see any vector-to-scalar reduce functions
    in OpenCL, such as “extract the sign bits” from the x86 SSE
    instruction set. Does OpenCL support any integer horizontal
    operations on vector types, such as add etc.?

  3. On top of that, it seems that the mid-range GPUs have
    some performance issues with 64-bit types, including integers.
    Should I use ulong in performance-critical code or manually
    decompose my input data into 32-bit chunks and then merge
    the respective partial results using e.g. a local memory look-up
    table?

    Best regards, Piotr

> I assume that inside a kernel I should bundle a set of such answers into a reasonably-sized scalar type, say ushort.
Yes, you should do that. Bit set/clear across work items would not be efficient.

A simple implementation would read 8 ulong values, and use the results of 8 comparisons to set some bits in a local uchar variable, and then write that result. Each work items returns 8 bitfield flags. Simple and easy to write.

If you wanted to get fancier, you could read a ulong16 in each work item, then do a vector to scaler “<” compare to your ulong constant, and then do bit a bithack to collect the ulong16 result into a ushort. I don’t think it would be any faster though because you’re going to be bandwidth-limited anyway (all compute is going to be hidden by the I/O). Ideal read coalescing is if each work item reads 32-bits and you’re reading 64 to 1024 bits.

You are right, my experiments show that the memory access pattern does matter. I can easily slow the calculations down by a factor of 6 if I don’t care about coalescing. I have designed the code below assuming
that there should be no wasted global memory transfers (in 32-bit quantities) and that all the local memory
addresses computed in the loops should produce permutations only (i.e. no bank conflicts), which makes the
transposition phase somewhat fancy. :slight_smile:

When tested on a 512MiB block of input data it results in 75.5% utilisation of the peak global
memory bandwidth of my not very advanced Radeon 6670, which looks very good. I’m curious
what results a Radeon R9 295x2 would provide (or a bunch of them…) – I requested one. I can
perfectly live with 32-bit-only capable hardware, so I probably should not need the Teslas.

And here’s the kernel:

    __kernel void selector(__read_only __global uint* input,
                           __write_only __global uint* output,
                           __constant uint* comparand) {
        
        const size_t group_size = 256;
        const size_t grid = get_group_id(0);
        const size_t lid  = get_local_id(0);

        const uint c = comparand[lid];
        uint partial_eq = 0;
        uint partial_lt = 0;

        for(size_t i = 0; i != 32; ++i) {
            
            const uint v = input[((grid * 32) + i) * group_size + lid];
            partial_eq |= ((v == c) &lt;&lt; i);
            partial_lt |= ((v &lt; c) &lt;&lt; i);
        }
        
        // Combine the partial results. CAUTION: the order is crucial!

        __local uint partials[group_size];

        if ((lid % 2) == 0) {

            partials[lid] = partial_eq;
            partials[lid + 1] = partial_lt;
        }
      
        barrier(CLK_LOCAL_MEM_FENCE);

        if ((lid % 2) != 0) {

            partial_lt = partial_lt | (partial_eq & partials[lid]);
            partial_eq = partial_eq & partials[lid - 1];
            partials[lid - 1] = partial_eq;
            partials[lid] = partial_lt;
        }

        // Transpose the results

        barrier(CLK_LOCAL_MEM_FENCE);                  
        uint transposed = 0;

        {   const size_t src_bit_index = (lid / 2) % 32;
            const uint src_mask = 1 &lt;&lt; src_bit_index;
            const size_t base_index = lid & ~0x3e;
            size_t loop_offset = (lid & 0x3e) / 2;

            for(uint i = 0; i != 32; ++i) {

                transposed |= ((partials[base_index + loop_offset * 2] & src_mask) != 0) &lt;&lt; loop_offset;
                loop_offset = (loop_offset + 1) % 32;
            }                
        }

        output[grid * group_size + lid] = transposed;
    }