barrier() crashes on Intel HD Graphics 630 + Apple OpenCL 1.2

Simply adding the line “barrier(CLK_GLOBAL_MEM_FENCE);” to a function causes an error when run on Intel HD Graphics 630.

  • Without a printf the function takes a very long time to execute (~10 seconds) and does not appear to execute subsequent statements.
  • With a printf included the program will crash when flush() is called.
    However, on same system the function runs quickly an correctly on both the CPU and discrete GPU.

Setup:
macOS Sierra 10.12.6
MacBookPro (15-inch, 2017)
3.1 GHz Intel Core i7
Intel HD Graphics 630
Radeon Pro 560

The same problem is encountered when calling barrier(CLK_LOCAL_MEM_FENCE).
However, when calling an atomic (e.g. atomic_inc) the behavior is correct.

Is there some special way I should be handling barrier() calls?

First rule is: All work items in the work group must hit the barrier() call.

@Dithermaster: all work items reach the barrier. Also, the same function with same inputs run to completion on the Intel CPU and on the AMD GPU. Only the integrated Intel GPU fails.

I’m facing the same error with adding the line barrier(CLK_GLOBAL_MEM_FENCE); have you been able to figure how to work it out?

@garryjoshi I’m glad to hear that I’m not alone in hitting this issue.

Unfortunately, I never found a resolution to the problem - I simply excluded the integrated GPU from the context.

Hi Gabriel, have you tried contacting the Khronos support team? I’ve contacted them, it been almost 2 week now since I have sent my mail hadn’t got any response from them yet :frowning:


Garry Joshi
Tutuapp Vip Showbox Android Tutuapp Free

Khronos won’t help you with Intel drivers. Intel support or development forum is a better shot, not sure by how much.

[QUOTE=garryjoshi;43029]Hi Gabriel, have you tried contacting the Khronos support team? I’ve contacted them, it been almost 2 week now since I have sent my mail hadn’t got any response from them yet :frowning:


Garry Joshi
Tutuapp Vip Showbox Android Tutuapp Free[/QUOTE]

Who did you contact at Khronos? Feel free to email me at webmaster at khronos.org.

[QUOTE=AngelGabriel;42827]Simply adding the line “barrier(CLK_GLOBAL_MEM_FENCE);” to a function causes an error when run on Intel HD Graphics 630.

  • Without a printf the function takes a very long time to execute (~10 seconds) and does not appear to execute subsequent statements.
  • With a printf included the program will crash when flush() is called.
    However, on same system the function runs quickly an correctly on both the CPU and discrete GPU.

Setup:
macOS Sierra 10.12.6
MacBookPro (15-inch, 2017)
3.1 GHz Intel Core i7
Intel HD Graphics 630
Radeon Pro 560

The same problem is encountered when calling barrier(CLK_LOCAL_MEM_FENCE).
However, when calling an atomic (e.g. atomic_inc) the behavior is correct.

Is there some special way I should be handling barrier() calls?[/QUOTE]

If you feel this is a bug in OpenCL, you are welcome to post an issue on our issue tracker on Github. This may very well be an issue with the Intel implementation and as Salabar says, posting on Intel forums might be a better place.

I realize this is not a very timely response, but it just came to me. If your kernel is very big, try to split it into multiple smaller kernels. On Radeon GPUs, if your kernel is too big and compiler has to resort to register spilling, I can bloody guarantee you that your GPU will not do whatever you expect it to do. It won’t crush, but the cause can be the same. Big kernels are bad for performance and they assume you ain’t doing it, therefore no one tests this properly and this part of the compiler is a bug-ridden toxic wasteland.

When I have a clean example to share I will post on the Intel forum and link here for reference. (No time yet to make that example)

      • Updated - - -

Would you please clarify how I could identify a “very big” kernel?
Whether or not this is the cause of the behavior that I encountered I would imagine that at some point (probably the worst possible) I will hit this issue.

Would you please clarify how I could identify a “very big” kernel?

AMD has an analyzer that can show, i.e. number of spilled registers. In their case, any number besides zero basically means undefined behavior. Intel or Apple should have a similiar tool, though I don’t know if it is free or not.