I’m new with OpenCL. I try to illustrate the power of float16, but I failed to.
I built a program which add to 1024102416-array of float. With GPU, when I run with float16, the time of computation is 0.03 secondes. With GPU when I run with 16 * float, the time of computation is 0.006 secondes. And with CPU, the time of computation is 2 secondes. But Why it’s longer with float16 than 16 * float?
Thanks for your help.
A part of my code :
Fichier Main.cpp :
// Define an index space (global work size) of threads for execution.
// A workgroup size (local work size) is not required, but can be used.
size_t globalWorkSize[1];
size_t localWorkSize[1];
// There are nbKernel threads
globalWorkSize[0] = nbKernel/16;
localWorkSize[0] = 512;
// Execute the kernel.
// 'globalWorkSize' is the 1D dimension of the work-items
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, globalWorkSize,
localWorkSize, 0, NULL, NULL);
clFinish(cmdQueue);
Fichier.cl :
__kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C)
{
unsigned int const i = get_global_id(0);
C[i] = A[i] + B[i];
There’s all sorts of reasons the float16 case might run slower:
A) The biggest problem is the memory accesses will not be coalesced. The float16 load will be serialised into a number of smaller loads (float4?), which are spread across the address space. With a float type, every thread will access a single float in a packed range which can be serviced by a single memory load for each wavefront/warp.
B) You get less parallelisation of the ALU. i.e. each processor needs to do more than one operation to implement the arithmetic. (whether this is faster or slower though depends on the problem, extra parallelism isn’t always a win)
C) You will have 1/16th as many threads running, and that might not be enough to hide the memory latencies depending on the problem size.
D) Probably not a problem in this case, but if you did something more complex, you are using many more registers - this limits how many threads can run concurrently on a given multi-processor.
E) Maybe … if the compiler does a full load, then a full alu op, then a full store, you don’t get such good interleaving of memory + alu ops, which may prevent the memory latency from being hidden.
I’ve never used a float16, and I can’t imagine where they’d be particularly useful for GPU, or CPU performance - they might be worth it on CELL though, where instruction-level data pipelining is critical to performance, and there are lots of registers.
GPU’s seem to be optimised for float4 (not surprisingly: RGBA, or XYZW, etc).
The number of nbKernel
nbKernel = 1024102416, the size of the array in case of 16 floats
nbKernel = 1024*1024, in case of float16
The Declaration of the function :
__kernel void vecadd(__global float const * const A, __global float const * const B, __global float * const C) in case of 16 floats
__kernel void vecadd(__global float16 const * const A, __global float16 const * const B, __global float16 * const C) in case of float16
Anyway - you’ve demonstrated that there is no benefit from using float16 but a negative impact on performance. This is a result.
The reasons I listed are possible ones - some of the actual hardware details are proprietary so some of them are only guesses.
I suspect the main one here is the memory reads aren’t coalesced properly. See the nvidia or amd documentation (the ‘programming guide’ ones), they cover this pretty well with nice diagrams.