OSX Lion 10.7.4 update broke my kernel

– new drivers, apparently –

(This is on AMD only (5870); it works fine with nVidia and Intel i7 and Xeon)

clbuildprogram fails with “Error getting function data from server”, but the good news is that there’s much more info in the console log, including:

5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: Both operands to a binary operator are not of the same type!
5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: %34 = fadd <4 x float> %33, i32 %32
5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: Instruction does not dominate all uses!
5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: %34 = fadd <4 x float> %33, i32 %32
5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: store <4 x float> %34, <4 x float>* %Y, align 16
5/10/12 2:29:21.792 PM com.apple.cvmsCompAgent_x86_64: Broken module found, compilation aborted!

… however, that’s at a lower level than what I have available to me before I send off the OpenCL-C to the compiler!

Any hints on how to track these down? Is it the case that I have an IL representation that compiled under the old version but now fails? OR, is the IL itself the product of a new bug? How can I tell?..

Thanks for any ideas!

… plus, for what it’s worth, if I go back to unvectorized code (which worked before the update) clbuildprogram hangs my whole Mac Pro, requiring hard power-off…

** I posted a very similar thread on the AMD forum, but it may well be of interest here, and I may have more luck here too…

Please file a bug to Apple with a test case.

I know I should, but:

  1. I don’t know if I can file a bug with Apple without paying the $100 registration fee, and I hear the forums there are pretty slow, so don’t know if it’s worth it.

  2. I don’t think I have the time to whittle down my kernel into a test case for them … and the whole thing is 1800 lines of code; they don’t want to see that, and I’m not letting it out the door.

  3. The Pro is basically a production machine, and I mostly need this to work, so I may just go back to an earlier OS.

… but …

  1. I will spend part of this weekend trying to isolate the parts of my code that make it fail, and if I can do so without too much trouble then maybe I can make a test case. If not, I’ll just back off the update for a while.

Thanks for your reply!

… p.s. … as far as the IL lines I quoted from the log, I don’t know that I even can get the IL for the whole thing; I ran across some apple documentation that says 10.6 supports only device-specific executable binaries, not an intermediate representation, so unless that has changed, I can’t even see it…

Regarding the $100 and if it’s worth it, I’ll just share my experience. I’ve submitted three OpenCL bugs to Apple via the Apple Bug Reporter since January. only one elicited any follow-up inquiry from apple. None of them have been resolved.
Now, I didn’t pay the $100 because students can get a free apple developer id. I’m not sure if that puts me at a lower priority though.

Is it the case that I have an IL representation that compiled under the old version but now fails? OR, is the IL itself the product of a new bug? How can I tell?..

Are you trying to clCreateProgramWithBinary? Or trying to do a fresh clCreateProgramWithSource after the 10.7.4 update?

Cheers,

Noah

  1. Yes, Noah; the slowness of Apple oCL driver updates is another reason I wondered whether it was worth reporting to them. (Plus, it might not be a ‘bug’.)

  2. It’s a fresh build from source. I’d been changing the kernel so much that I hadn’t implemented cacheing the binary yet, plus I wasn’t sure I was getting a good binary back (turns out I think that Apple doesn’t support an “IL”-style binary, just the device-specific final executable).

… but …

It turns out the error I quoted was generated by almost the first executable line in my kernel, where I was clearly adding an int32 to a float4; presumably it was auto-converted before because the numbers used to be right. Adding a (float4) cast to the int32 fixed that.

BUT, now I’m embroiled in more mysterious bugs without any cool diagnostic info; it just tells me what pass it fails on, and my only method to search it down is what I call the “méthode tédieuse”, commenting out huge chunks of my kernel to see when it compiles, then cutting the chunks down, then trying again. I’ve identified a problem with a tweening function I’ve had trouble with before, and now need to find another solution for. Also, some kind of problem with stepping through a character buffer.

It now fails on the ‘AMD IL Swizzle Encoder Pass’, which is at least two passes later than it used to fail, but that’s all it’ll give me…

If I find more of a solution, I’ll post it here…

Cheers!
Dave

I upgraded to 10.7.4 and now offline compilation seems to not work anymore.
The Offline Compilation sample from Apple builds successfully, but the program fails with
CL_BUILD_PROGRAM_FAILURE on ATI Radeon HD5850. It works fine on GeForce GTX 470, however, installed in the same machine.

Why did you ask about clCreateProgramWithBinary? Are there any known problems with it?

Update: build error log contains ‘Invalid ALLOCA record’. Does it ring a bell?

It rings a little one for me. But, I think I read it while doing internet searches based on the names of the various ocl-c compiler passes where failure occurred; I don’t think I’ve seen it on my machine, but I’m not certain…

I just re-partitioned the main drive in the MP so I can boot 10.7.4 to work on this problem but boot 10.7.3 when I want it to work!..

(… oh by the way affie I haven’t given up on providing Apple with a test case … I did isolate the problem to subtracting two text pointers in my strpos replacement function, buuuut, when I isolate that code it works. Something is happening earlier which makes that innocuous operation fail.)