CL/GL Interop, OSX -- ever shared a Renderbuffer or Texture?

Problem: Attempting to share a Renderbuffer (or a Texture) fails when clSetKernelArg() is called.

I’ve spent days on this and am finally asking for help.

My program generates frames for a video projector that runs at 60fps (16.7ms frames).

My kernel runs in (typically) 24ms, but it’s taking 50ms between each frame. I assume that some of the extra cost is because I’m using the GPU to calculate the pixels, then enqueuing a readbuffer to pull the data off the GPU, then using glDrawPixels to put it back onto the GPU for display. Perfect situation to try OpenGL/OpenCL interoperation, right?, to avoid the two extra copy operations.

There are many examples, and I have succeeded in sharing a VBO with OpenCL, and can write to it, but that doesn’t help me. I don’t want to write vertex data, just the 2-D image that’s been calculated.

There are examples of two different ways to do this, and they both involve Framebuffer objects.

You can attach a Renderbuffer to a Framebuffer, or you can attach a Texture to a Framebuffer.

Then you should be able to write to that buffer in opencl and display it with opengl, no extra copies.

I have found a few examples of this in code, and I think I’m doing everything exactly the way the examples say to do it, but maybe it is broken in OSX? … because it doesn’t work. The FBO is “Complete”, no errors along the way, until I try to do the clSetKernelArg. That call returns error -38, CL_INVALID_MEM_OBJECT.

*note: I would rather use a Renderbuffer than a Texture, since all I’m doing is making a 2-D RGB image that I want to display. But I tried a Texture out of desperation. Still no help.

I do these steps, in this order, with some other stuff in between:

kCGLContext = CGLGetCurrentContext();
kCGLShareGroup = CGLGetShareGroup( kCGLContext );

glGenFramebuffers( 1, &fboid );
glBindFramebuffer( GL_FRAMEBUFFER, fboid );

glGenRenderbuffers( 1, &rboid );
glBindRenderbuffer( GL_RENDERBUFFER, rboid );
glRenderbufferStorage( GL_RENDERBUFFER, GL_RGBA, rb_wid, rb_hgt );

glboid = rboid;

glFramebufferRenderbuffer( GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, rboid );

then:

cl_context_properties ourprops[] = { CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, (cl_context_properties)kCGLShareGroup, 0 };

contextZ = clCreateContext( ourprops, 1, &dev_idZ[0], clLogMessagesToStdoutAPPLE, NULL, &err );

clbo = clCreateFromGLRenderbuffer( contextZ, CL_MEM_WRITE_ONLY, glboid, &err );

then later:

err = clSetKernelArg( kernelZ, 1, sizeof(cl_mem), &clbo );

… which fails with error -38, CL_INVALID_MEM_OBJECT.


(I’ve tried adding a second Renderbuffer and attaching it to a Depth Attachment Point, in case that was needed. No help.)

(I’ve tried the same thing with a Texture bound to the FBO, and I get the same error in the same place.)

… does anybody have any ideas at all?

Anything output to the console or when you run with CL_LOG_ERRORS=stderr?

Nope! (Though thanks for teaching me about CL_LOG_ERRORS.)

my output to the log along the way:

glCheckFrameBufferStatus is GL_FRAMEBUFFER_COMPLETE

create context err= 0

clCreateFromGLRenderbuffer err= 0

clCreateCommandQueue err= 0

clSetKernelArg &clbo returned -38

I have tried so many variations … and when I look at examples of code telling me “this is how you do this”, I don’t see anything wrong. I think that if there were an error, something along the way would be clobbered – but all my gl and cl objects check out. It’s only when clSetKernelArg is called that it finally gives an error. Pretty frustrating!

Did you forget the enqueueAcquireGLObject() and the glFinish()?

That lost me some hair one day when I was looking at this …

Nope; those are in there, but thanks for asking! I was just cherry-picking the lines of code that I thought might have an obvious bad/mismatched parameter. Those two are called, then a clEnqueueWriteBuffer for the kernel’s input variables (delivered in a structure), then a clSetKernelArg for that input structure (which works), then the clSetKernelArg for the output buffer, which was created from the gl buffer object with no problems reported; and this clSetKernelArg is what fails. clbo is a static cl_mem, just like the buffer object that’s used when interop is not on. The difference being that it was created using clCreateFromGLRenderbuffer instead of clCreateBuffer, and it’s in a context created in association with the gl sharegroup.

I’m a bit puzzled that it fails at that point; I didn’t think anything was checked in the passed-in address of the cl_mem object. If anything were wrong with that I was expecting an error later, while the kernel was executing; not when I first hand the pointer up to it. I mean, the kernel hasn’t looked at it yet; there can’t be a size check because clSetKernelArg doesn’t know anything about size; that’s in the kernel, which hasn’t had the opportunity to raise its head yet in this scenario. It should just be an empty output buffer as far as cl is concerned; all it is is an empty place to store a stream of bytes…

There was one more detail I was going to add but I don’t remember it right now and it was only possibly marginally relevant. Ok, perhaps it was this: I’ve used the same mechanics to successfully attach a glBuffer object (vbo), and I write my image data into it (which is nonsensical to a vbo), and call display stuff, which is useless but it doesn’t crash, and that runs continuously until I interrupt it. So, again, kinda puzzled as to why this fails, particularly at that point.

There are downloadable example projects for xcode, in (i.e.) grass / oceanwave, but those all use vertex buffers. All I wanna do is paint the plain old, vanilla pixels, that are already calculated, on the screen. Perhaps the Universe is playing with me?.. O_O

clSetKernelArg lists a few possible error return codes - some of which require checking the arguments. It’s just a pointer, and it has the kernel invocation conventions available so it could do plenty of checking if it wanted to.

Only thing I can think of is an incompatible image format/renderbuffer setup - but clCreateFrom … should catch that.

I suppose try posting a complete example and see if anyone can help …

Thanks for the pointers, folks.

I did manage to get the cl compiler stuff redirected to the system console, ajs2, but at first that just gave me the text equivalent of error -38, “CL_INVALID_MEM_OBJECT”, so I didn’t think I was much further than before.

But you were right, notzed, clSetKernelArg does do some checking of the pointer against the kernel arg type. I resisted for so long rewriting the kernel to use an image2d_t for the output buffer arg, 1) partly because I was so insistent on just writing bytes into a pointer without using the image accessing functions, plus 2) that meant having two .cl files, one for interop and one for non, and two cached binaries, doubling some variables etc., and I wouldn’t know if that work would even be worthwhile until it was done. But ultimately I did it.

And once that was done, there was more detailed information in the system log. Not just the invalid mem obj message, but a line before that like “Kernel argument 2 should not be write-only, but object &0xnnnnnnnn is write-only”. This told me that it needed to be write_only in the kernel of course, which was easy enough, but more importantly that it was worth doing that work and that it did check the kernel args, but just didn’t have a super useful error message to offer earlier.

Now there still isn’t anything on the screen(!), which is frustrating, but the kernel now does a write_imageui() for each pixel, then there’s a gl framebuffer blit that seems to execute, and the inter-frame interval is much better than non-interop with its two extra pci bus transfers. Next, just to get it to actually put the pixels where they can be seen … maybe worth another post later, or add to this one.

This is a bit OT but you might find it useful …

Actually images are very good for image data - primarily as they provide automatic conversion on read/writes and interpolation on reads, and work well with a 2d access patten.

And unless you have a specific algorithmic requirement for integers, you’ll probably find doing everything in floats will be easier, and run faster. That is what the hardware has been optimised for. The same code will then also work with different storage formats (from normalised unsigned 8 bit, 16-bit, or float).

BTW if you’re using UNORM_INT8 you need to use write_imagef, not write_imageui.

BTW if you’re using UNORM_INT8 you need to use write_imagef, not write_imageui.

Ding! Yes, I did stumble across that fact via extensive searching just a few hours before you sent your message, and then at last there was “something on the screen”! From that point on it was much less frustrating, as the effects of any changes were visible, and now it’s working beautifully. That was the opposite of my first reading of the functions, however; I’d thought you’d use “ui” to write unsigned integers, “f” to write floats, etc. SO, a critically useful answer that I just happened to stumble upon earlier, and I’m as grateful as if I’d heard it from you first.

On the rest, yes it’s float4 all the way up until the very end. I really like to put every bit just where I want it, and am not super comfortable not being able to explicitly build each byte for display, but it works! Also, I had no plans to do any scaling, conversion etc. – but, a blit from a smaller size to full screen happens to be good for demos on the laptop (with some GL_LINEAR interp), though not for showtime on the big machine, where every pixel shall be explicitly calculated. Very cool! Works with either a Texture or a Renderbuffer.

Now on to the next problem, which I’ll probably post on the AMD board. What’s just been described works on nVidia on the MBP as well as the (weak) nVidia on the Mac Pro. However, on the big-dog five-hundred-dollar AMD 5870, it fails on create context, waaay before any of this fiddly stuff – “cannot find device 0xnnnn in context 0xnnnn”. If there’s no obvious answer to that, I’ll just shell out for an nVidia 570 or 680; they supposedly have better oCL throughput anyway.

Cheers!

My first attempt at images - back when the drivers barely supported them - i did the same thing and wondered what was going on, it left such a bad taste i didn’t touch images again for months.

And that whole ‘nothing @##$@ works’ thing is a massive barrier to getting started - very frustrating.

Now on to the next problem, which I’ll probably post on the AMD board. What’s just been described works on nVidia on the MBP as well as the (weak) nVidia on the Mac Pro. However, on the big-dog five-hundred-dollar AMD 5870, it fails on create context, waaay before any of this fiddly stuff – “cannot find device 0xnnnn in context 0xnnnn”. If there’s no obvious answer to that, I’ll just shell out for an nVidia 570 or 680; they supposedly have better oCL throughput anyway.

Well hopefully the amd forum can help - it’s something that does work so it might be a install or driver issue, or bug.

I’m not sure where you heard such a thing but from every benchmark i’ve seen the 680 is pretty poor for OpenCL (sometimes very poor) - it looks like NV are now targetting a different market as ‘gpgpu’ hasn’t really taken off as a selling point for mass market cards: i.e. games, with better power efficiency. And with those architectural changes the CUDA/OpenCL performance dropped off significantly - it’s less than some of their older cards and miles behind the GCN stuff except on very specific workloads. That’s not even counting the fact that OpenCL always seemed to be a dirty word around nvidia.

And that just seems to be getting worse …
http://www.streamcomputing.eu/blog/2012 … or-opencl/

Which is a bit of a pity.

Well, without posting at amd yet, it seems some people have problems in general if they do all the GL init first (create context, create fb, create & attach textures/renderbuffers, then create the CL context) – which is what this program does. It would mess up compartmentalization a bit, but may be worth trying before asking, to create GL context then create CL context then make GL buffers…

570/680 – I’ve read so many discussions my head spins, mostly on the MacRumors board threads, so I’m probably confused. It’s clear that the 680 is crippled in double-precision float performance, but hasn’t become clear for me on single-precision. But it does kinda look like the 570 is the way to go with nVidia. Yes it is a pity that nV seem to be playing the pouty child with their CUDA vs “Open” CL.

I was very excited about the GCN cards, but don’t know when Apple will provide support. The GTX 5xx/6xx seem to work out of the box with 10.7.5 drivers. But, if the interop works on the 5870, it should be enough to do 33ms frames at 1024x768, which would be fine for now.

(Far OT now, but that should be okay; the question has after all been answered and the thread is winding down with other info that may be of interest to a curious reader…:))

Hmm, I created the gl context first then the cl context.

I had to do some weird stuff to get the gl context out of the glut-like stuff provided by JOCL though, and/or run the cl init in the glut.init() callback.

570/680 – I’ve read so many discussions my head spins, mostly on the MacRumors board threads, so I’m probably confused. It’s clear that the 680 is crippled in double-precision float performance, but hasn’t become clear for me on single-precision. But it does kinda look like the 570 is the way to go with nVidia. Yes it is a pity that nV seem to be playing the pouty child with their CUDA vs “Open” CL.

Well nvidia seem to be ‘crippled’ on double for consumer cards in general. The 680 has more problems than that though for compute. I’m not sure a site called ‘macrumours’ should be your canonical source of information.

I was very excited about the GCN cards, but don’t know when Apple will provide support. The GTX 5xx/6xx seem to work out of the box with 10.7.5 drivers. But, if the interop works on the 5870, it should be enough to do 33ms frames at 1024x768, which would be fine for now.

(Far OT now, but that should be okay; the question has after all been answered and the thread is winding down with other info that may be of interest to a curious reader…:))

Ahh vendor lockin - well that’s your choice!

Hmm, I created the gl context first then the cl context.
Ok, but, did you create gl buffer objects before creating the cl context, or after? I do the gl context, then all gl buffer objects, then the cl context (due to original compartmentalization; all kernelish stuff is in one source file). But that could be a problem on AMD. From their board:

“To use shared resources, the OpenGL® application must first create an OpenGL® context and then an OpenCL™ context. All resources created after the OpenCL™ context has been created can be shared between OpenGL® and OpenCL™. If resources are allocated before the OpenCL™ context is created, they cannot be shared between OpenGL® and OpenCL™.” – (user genaganna).

… though I’m a bit surprised that it fails at clCreateContext, before anything attempts to be shared. (Plus, it works fine on nV.)

I’m not sure a site called ‘macrumours’ should be your canonical source of information.
I am sure you’re right about that! … but, it was attractive due to several threads (macos subpart) describing how to get the Fermi / Kepler cards to work on osx, with or without flashing. But, apparently they work in 10.7.5+ right out of the box.

Ahh vendor lockin - well that’s your choice!
… thinking … oh, okay; you mean the fruit company…
This project began over 20 years ago on mac os 7 or 8. Whereas I learn quickly and have written code for many architectures, it seemed the quickest path from old CodeWarrior was to stay on mac and force-learn Xcode, Cocoa, Objective-C and OpenCL; at least the system services might be familiar. But, it had been a very long time (it was still in CodeWarrior under SheepShaver up until 2011), and the apis had been superseded several times, so it may not have slowed the process down much to throw a new OS (linux?) onto the pile too. But also, I want to be able to spec a system to someone halfway around the world that will run this, so sticking to a major vendor and being able to name an OTC box might still have been a good idea. If I ever generate enough revenue to justify doing this full-time I may branch out; it might be wise to have more than one roost…

Yeah i created the cl context first thing in the init callback. I only did some experiments and wasn’t retro-fitting it to an existing application, which is a different kettle of fish. I probably started with an example which had already gone through working out various interop issues.

If those are the requirements than I guess you have no option - at least it failing to create a context straight away tells you that you can’t do that. It’s probably some internal optimisation so that GL contexts and objects don’t need to worry about OpenCL stuff unless you’re trying to interoperate (or maybe it’s just an internal hack to make it work, who knows?). Well at least you know how to tackle it.

(I read the other comments but it’s veering off topic a bit - but 20 years, well at least you’ll be used to it by now :wink:

I can’t see any mention of this opengl requirement in the amd programming guide (appendix g), apart from the fact that the devices in the context must all be opengl-sharing compatible.