Results 1 to 5 of 5

Thread: disappointing half-precision performance - any advice?

  1. #1
    Junior Member
    Join Date
    Aug 2015
    Posts
    21

    Unhappy disappointing half-precision performance - any advice?

    I bought a Vega 64 recently. From the specs, it has 23 TFLOPs fp16 throughput compared to 12 TFLOP fp32. so I converted portion of my Monte Carlo code to half, expecting to gain some noticeable speed up. Disappointingly, instead of gaining speed, I got a 5% speed drop.

    the changes were done for a core function, which I believe is the bottleneck of the code (maybe account for 1/4 of the run-time), see the key

    https://github.com/fangq/mcxcl/commi...6591ee8862R311

    in comparison, here is the float counter-part:

    https://github.com/fangq/mcxcl/blob/...e.cl#L252-L288

    my kernel is a compute-bound kernel.

    I don't know what is the common scenario when converting to half will bring speedup. in my case, were the conversions or extra registers responsible for the drop? any dos and not-dos when using half?

    thanks


    PS: the code can be tested by

    Code :
    git clone https://github.com/fangq/mcxcl.git
    cd mcxcl
    git checkout nvidiaomp
    cd src
    make clean all
    cd ../example/benchmark
    ./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

    removing the -J "-DUSE_HALF" option will enable the original fp32 code

  2. #2
    Senior Member
    Join Date
    Apr 2015
    Posts
    310
    Vega improves half precision performance by introducing packed arithmetics similiar to SSE for x86.
    https://gpuopen.com/amd-vega-instruc...documentation/
    This means that such code
    Code :
    half2 a = array[0]
    that used to translate to
    Code :
    load reg1 array
    load reg2 array + 2
    can be replaced with a single read to a single register. And things like
    Code :
    half2 a,b;
    half2 c = a + b
    will require 3 registers and one operation instead of 6 and 2 as previously.
    But you casting vector datatypes to pointers to scalars and setting vector's value on per-component basis probably confuses the compiler preventing the optimization.

  3. #3
    Junior Member
    Join Date
    Aug 2015
    Posts
    21
    want to make a correction on the commands to test this issue, the git checkout nvidiaomp line needs to be removed. the corrected commands are

    Code :
    git clone https://github.com/fangq/mcxcl.git
    cd mcxcl/src
    make clean all
    cd ../example/benchmark
    ./run_benchmark1.sh -G 1 -J "-DUSE_HALF"

  4. #4
    Junior Member
    Join Date
    Aug 2015
    Posts
    21
    Quote Originally Posted by Salabar View Post
    Vega improves half precision performance by introducing packed arithmetics similiar to SSE for x86.
    https://gpuopen.com/amd-vega-instruc...documentation/
    ...
    But you casting vector datatypes to pointers to scalars and setting vector's value on per-component basis probably confuses the compiler preventing the optimization.

    thanks for the reply. I am wondering if I need to add any special extension or compilation flags to enable the Rapid Packed Math (RPM) instructions? I did enable the cl_khr_fp16 extension. is that sufficient if I installed the amdgpu-pro Linux drivers?

    PS: found someone had difficulty with Arch Linux https://github.com/plaidml/plaidml/issues/29

  5. #5
    Senior Member
    Join Date
    Apr 2015
    Posts
    310
    It shouldn't be needing any special extension, though, indeed, it may turn out their compiler does not support the feature yet. Use CodeXL analyzer on this simple kernel to find out if your compiler generates any instructions with a prefix "pk_*"
    Code :
    __kernel void test(__global half2* a, __global half2* b){
    int id = get_global_id(0);
    a[i] = a[i] + b[i];
    }

Tags for this Thread

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean