My colleague and I recently were puzzled by the following finding: there is an if() condition inside my opencl kernel, and we know for a particular run, this condition is always false. However, if we leave this unused if block in the .cl file and run the simulation, the run time is almost twice longer than when we completely remove the block from the source code (or disabling it by #ifdef/#endif). Yet, both produced the same output.
My question is, is this kind of behavior common in opencl’s JIT compilation? are there anything we can do to ensure that such overhead due to compilation optimization is minimized? a 2-fold difference seems significant in my application.
It is a mystery to us too since we can’t see the code. One possibility is that the compiler doesn’t know the condition is always false (perhaps it is passed in as a kernel argument) and your condition has a lot of code it in, using lots of registers. This could cause less occupancy in the GPU and therefore cut the speed, even though the code is never executed. On some architectures (like older AMD) there is a “fast path” that some kernels can take if they avoid doing some things. Perhaps you do one of those things in your condition and so when you remove it, you get on the fast path (and were not before). Just two possibilities, I’m sure there are more.
in fact, my code is available online, if you are interested, just checkout this git repository
after git clone, switch to mcxlite branch by ‘git checkout mcxlite’, then go to src, run make. it should produce a binary called mcxcl. Then go to example/quicktest, and run ‘run_qtest.sh’ to do a benchmark.
the block that I found sensitive to the performance is this one:
when “MCXCL_DO_REFLECT” macro is not defined, this block is not compiled by JIT, and the speed of the simulation is 19600 photons/ms. If this macro is enabled when running the folloowing command in the quicktest folder:
then the speed drops to 12000 photons/ms. The output results are exactly the same. The test was done on an nvidia card (980Ti) but similar finding was also found on AMD cards.
surely it is not always false, otherwise there is no need to have that block in the first place.
the block is enabled by an input parameter, gcfg->isreflect which is located in the constant memory. this flag is fixed for each kernel execution.
if I set gcfg->isreflect=false, I thought that the JIT should know this when building the program and automatically remove the unneeded blocks?
I did notice that my clBuildProgram was called (line#376) before I pass the gcfg constants (line#434)
I don’t think I can move line 434 before line 376 because mcxkernel has not be created until lines 388/398.
curious about this “fast path” technique, any links?
Funnily enough, reason I thought was to blame doesn’t explain the slow down.
Your kernel is too big to run well:
Your code uses a lot of registers, which means only 512 threads can run simultaneously on AMD hardware. It is not enough to hide high memory latency. No reflection variant only uses slightly less registers, which should not actually affect perfomance (perhaps, 108 and 87 does make a difference on NV), but I can’t tell without runtime data. Split your kernel into a sequence of smaller operations so compiler could breathe more easily. It should improve general perfomance as well, just don’t be overzealous about it.
Regarding compile-time code removal: you can add “-D MCXCL_DO_REFLECT” compilation flag when calling clBuildProgram to give a compiler the right idea. Turning particular code paths off using language operators is a valid technique, but it should only be used when you simply have to compile too much variants of the same kernel.
[QUOTE=Salabar;39750]Funnily enough, reason I thought was to blame doesn’t explain the slow down.
Your kernel is too big to run well:
Your code uses a lot of registers, which means only 512 threads can run simultaneously on AMD hardware. It is not enough to hide high memory latency. No reflection variant only uses slightly less registers, which should not actually affect perfomance (perhaps, 108 and 87 does make a difference on NV), but I can’t tell without runtime data. Split your kernel into a sequence of smaller operations so compiler could breathe more easily. It should improve general perfomance as well, just don’t be overzealous about it.[/QUOTE]
@Salabar, thanks for looking into this, and the helpful comments. Yes, this is a heavy kernel using a lot of registers. We have been optimizing the CUDA version of this software (mcx, GitHub - fangq/mcx: Monte Carlo eXtreme (MCX) - GPU-accelerated photon transport simulator ), nvvp also pointed out the high register utility. Despite this, the memory latency only accounts for 3% of the total latency for the CUDA implementation (benchmarked on 980Ti). The kernel seems to be compute bound. Things could be different on the cl kernel.
for the cuda version, we did some optimization to move registers (about 15-24) into the shared memory. The performance actually went down. We were quite puzzled by this, and were not sure if that was the right direction to go. Perhaps we did not reduce enough registers to reach the critical point.
That’s part of what I want to know here. I am glad you confirmed that this is a valid approach. Although, it is somewhat unexpected from what I have read. I thought the whole point of using JIT compilation in CL is to do run-time optimization. When all parameters are provided, the JIT compiler can efficiently ‘recompile’ to get better performance. But it looks like it is not yet intelligent enough to recognize the settings.
yes, we have been doing profiling with codexl, but as I mentioned in the other thread (Line-by-line time profiling for an OpenCL kernel - OpenCL - Khronos Forums), the codexl output was too coarse to provide specific guidance. For the CUDA version, the nvvp coming with cuda 7.5 can already do line-by-line profiling on Maxwell. I wish I can find a similar tool for OpenCL. This will make optimization much more focused.
What are KernelOccupancy and VALUUtilization values in CodeXL?
For the CUDA version, the nvvp coming with cuda 7.5 can already do line-by-line profiling on Maxwell. I wish I can find a similar tool for OpenCL. This will make optimization much more focused.
That’s the second reason to split your kernel into a couple of smaller ones. It’s just wiser design-wise, because it allows to debug, test and profile operations of manageable size.
I am glad you asked, we actually have a pretty wield finding regarding these numbers on an AMD GPU.
a few days ago, my student submitted a patch to fix a speed regression issue, see this tracker
the code changes only involved moving two floating point number accumulations (energyloss and energylaunched) outside a local function (launchnewphoton), see diff here:
the two versions are essentially the same computation-wise, however, the new code runs 4x faster than the old one (3000 photon/ms vs 800 photon/ms) on a Radeon 7970. My student also looked into the profiling outputs of codexl, and sent me the following table:
I placed a “<<-” marker along the items that were significantly different. It seems that, by simply moving those two additions outside this local function, suddenly, vector operations became possible (is it true? I am not exactly sure how to interpret these numbers though).
What made me even more puzzled was that, since energyloss/energylaunched were no longer needed inside launchnewphoton(), I asked my student to remove them from the parameter list from launchnewphoton, surprisingly, he found the speed went down again! the only way to get the higher speed was to keep those two parameters, and pass energyabsorbed in the place of energylaunched (as shown in his patch: replace energylaunched with energyabsorbed by leimingyu · Pull Request #9 · fangq/mcxcl · GitHub).
I guess many tricky things can happen when running opencl (at least on the AMD card, on the NVIDIA card, the difference was not significant). That’s why I’d like to do a line-by-line profiling and find out all these hidden inefficiencies.
I agree, but it is very difficult to restructure a particle random-walk kernel into smaller ones. Each kernel run has to contain the entire life span of a particle (and repeat), otherwise, you have to save a lot of states into the memory, that is expected to kill the speed.
of course, if you happen to know any other Monte Carlo code has successfully done so, I am happy to learn.
Kernel Name old code new code
Thread ID 4594 4409
Kernel Name mcx_main_loop mcx_main_loop
Device Name Tahiti Tahiti
Number of compute units 32 32
Max. number of wavefronts per CU 40 40
Max. number of work-group per CU 40 40
Max. number of VGPR 256 256
Max. number of SGPR 102 102
Max. amount of LDS 65536 65536
Number of VGPR used 107 253 <<-
Number of SGPR used 94 99
Amount of LDS used 1 1
Size of wavefront 64 64
Work-group size 64 64
Wavefronts per work-group 1 1
Max work-group size 256 256
Max wavefronts per work-group 4 4
Global work size 16384 16384
Maximum global work size 16777216 16777216
Nbr VGPR-limited waves 8 4 <<-
Nbr SGPR-limited waves 20 16
Nbr LDS-limited waves 40 40
Nbr of WG limited waves 40 40
Kernel occupancy 20 10 <<-
the kernel occupancy actually dropped from 20 to 10, despite the 4x speed improvement.
Once kernel code will only have divergence the algorithm has intrinsicly, real problems should show up. By the way, is CUDA code much different from this? What are CUDA profiling results?
of course, if you happen to know any other Monte Carlo code has successfully done so, I am happy to learn.
If there is a way to calculate the upperbound of random numbers required by each workitem, this could be a start.
No clue how the little modification changes this so dramatically
I think I figured it out. Compiler is stuck with two options that are equally bad based on its heuristics. That little change doesn’t do much in particular, but it shook something random up and made compiler to believe that adding a lot of scratch registers isn’t as bad anymore. It did pay off, but it seems like coincidence to me.
as I mentioned earlier, this was a regression, introduced in an earlier commit
before this change, the utilization rate was more like the 60% as the corrected code.
I agree that it has tons of branches, and generally speaking the less divergence the better. But to be honest, I am not convinced some of the branches are the bottleneck of the code. That’s why I want to find a profiler that gives me more direct evidence.
Part of my doubt came from the profiling results of the CUDA version. The CUDA version shares almost the same structure/complexity as the OpenCL version (but recently implemented more accurate algorithms, thus slower). However, the nvvp profiler output did not seem to suggestion major issues on divergence or branching. Below is the latency contribution report generated by nvvp.
From the line-by-line latency analysis in nvvp, we did find a hotspot in a device function, but I wasn’t sure any of these metrics has identified branching or divergence as the main cause of the latency. curious if you have any thoughts on this?
mind explaining this in more details? I am particularly interested in understanding how you have arrived this conclusion. perhaps those metrics are more telling than I thought.
It’s only a guess. In the first variant, the optimizing compiler didn’t want to allocate scratch registers. It managed exact zero. To achieve this, it had to handle branching very sub-optimally, and it made perfomance degrade. In the second one, some heuristic (X instructions in a loop or whatever) triggered. It snapped in compiler’s head that now amount of spilled registers doesn’t matter as much and optimizing whatever it intended to optimize coincidentally improved VALU. It doesn’t happen on NVIDIA, because they use different heuristics, but it is likely possible to make their compiler to do something weird as well.
As of your CUDA profiling. It takes 20% of time to simply fetch instructions. I found out that compiled kernel is 170 Kbytes, while Radeon’s instruction cache is only 32 Kbytes. On another hand, CodeXL showed great cache hit ratio, but I don’t know if it accounts for the code fetch. What is this metric for simple kernels like reduction? If it is much smaller, you may try to restructure your bigger loops by splitting them into few consecutive ones. This should allow GPU to use its instruction cache more effectively (it will make kernel flow more linear, so to say, and GPU won’t jump all over the code anymore). Another measure to make code more linear is to make branching very short. Instead of
if (whatever){
a = compute_x()
}else{
a = compute_y()
use
x =compute_x()
y = compute_y()
a = (whatever) ? x : y;
}
It appears to me, though, it still comes down to the sheer size of the kernel. How to mitigate it is probably something beyond my exprertise on GPGPU.