Suggestions for next release of OpenCL

Refer OpenCL Spec 1.2

Section 6.2.3
Explicit conversions may be performed using the
convert_destType(sourceType)
suite of functions. These provide a full set of type conversions between supported types (see
sections 6.1.1, 6.1.2 and 6.1.3) except for the following types: bool, half, size_t,
ptrdiff_t, intptr_t, uintptr_t, and void

Section 6.2.3.1
Conversions are available for the following scalar types: char, uchar, short, ushort,
int, uint, long, ulong, float, and built-in vector types derived therefrom.

There are datatypes like
double,
image2d_t
image3d_t
image2d_array_t
image1d_t
image1d_buffer_t
image1d_array_t
sampler_t
event_t
Which are covered in section 6.2.3, but not in 6.2.3.1. What is expected of these datatypes?

double is handled like float.

The other data types (image, sampler, event) are neither scalar types nor vector types, so section 6.2.3.1 does not apply to them.
They can’t be cast to another type. They are simply considered opaque types.

The relational select() function is very handy for vectorization and mimics the ternary ()?: op. But, the supported types in scalar and vector modes for doubles (and halfs) is inconsistent with the relational comparison functions such as isgreater().

Scalar prototypes:


int isgreater (double a, double b);
double select (double a, double b, long cmp);

Vector prototypes:


longn isgreater (doublen a, doublen b);
doublen select (doublen a, doublen b, longn cmp);

The scalar isgreater() (and similar) functions match the c99 math.h prototypes and return int for all datatypes. But, select() only accepts long for double (and short for half). This requires an explicit cast in most (all?) implementations and makes for some headaches when building type-independent code. That is, we can’t cleanly write

T select (T a, T b, isgreater(a,b));

and expect it to work with double and doublen. This same issues occurs with halfs. I have to wrap an #ifdef statement to distinguish scalar and vector types.


#if (__VectorSize == 1)
   // ()?: version
   // double result = (isgreater(a,b)) ? b : a;
   double result = select (a, b, (long) isgreater(a,b));
#else
   double2 result = select (a, b, isgreater(a,b));
#endif

I propose that select() accept the datatype output of the relational functions in both scalar and vector modes. That is, accept (int) for all datatypes in scalar modes and accept the equivalent bit-masks in vector mode.

Hello List,
I would like to be able to load-balance my algorithm onto both intel cpu and amd gpu
at the same time.

Now, Intel SDK supports intel hardware, and AMD SDK supports AMD hardware.

How can I develop a solution that targets both platforms concurrently?

Make SPIR version number the same as the OpenCL version it belongs too.
Reduces potential confusion.

Releasing temporary buffers in the middle of a chain of kernels executing asynchronously is currently cumbersome. It requires either a synchronization with the device to guarantee that all pending operations using the buffer have finished, or a clumsy event callback on a marker with wait list (or even worse through a native kernel if the device supports it).
The drawback of the first is that it introduces needless synchronization just to release memory, and the disadvantage of the second besides the horrible syntax is the fact that there is no guarantee as to when the callback will be invoked.

I think it would be useful to have a function such as clEnqueueReleaseMemObject, which can be pushed onto a queue with the traditional wait list and attached event. It would do exactly the same as clReleaseMemObject with the added advantage that it can be woven into a complex task graph to release the memory as soon as it is not needed.

Proposed function:

cl_int clEnqueueReleaseMemObject ( cl_command_queue command_queue,
                                   cl_mem memobj,
                                   cl_uint num_events_in_wait_list,
                                   const cl_event *event_wait_list,
                                   cl_event *event )

Has this been already discussed?

It’s not clear to me what the problem is. There is no requirement that all pending operations using a buffer complete before you can release it - the buffer will only be destroyed when the reference count is 0 and all commands that use it have completed.

Can you give an example of the sequence of operations that you are trying to perform, and where you would like to release the buffers?

Right, my bad, I missed that part. I based my assumption of the note of clSetKernelArg (5.7.2):

A kernel object does not update the reference count for objects such as memory, sampler objects specified as argument values by clSetKernelArg, Users may not rely on a kernel object to retain objects specified as argument values to the kernel.
and the definition of reference counting from the spec (2):
After the reference count reaches zero, the object’s resources are deallocated by OpenCL.

So I thought that temporary buffers could only be safely released after synchronization. But the doc of clReleaseMemObject indeed says that the object stays alive event with a ref count of zero as long as it is used by an object in the command queue:

After the memobj reference count becomes zero and commands queued for execution on a command-queue(s) that use memobj have finished, the memory object is deleted.

Thanks for pointing that out.

We’re restructuring and cleaning up our forums. This will be the official thread for everyone to post their suggestions for the next version of OpenCL. We have moved the most recent suggestions into this thread already. We look forward to seeing more suggestions.

Hello, I originally wrote this in AMD CL support forums but as noted it’s really not a vendor issue so I registered.

I propose a function to wait on a single event of a set. This could be in various forms:
[ul]
[li]cl_int clWaitForAnyEvent(cl_uint num_events, const cl_event *event_list), basically as now or[/li][li]cl_int clWaitEvent(cl_bool all, cl_uint num_events, const cl_event *event_list), in an attempt to save an entry point by putting the other one in a deprecated status[/li][li]cl_uint clWaitEvent(cl_bool all, cl_uint num_events, const cl_event *event_list, cl_int *error), to allow return of a triggering event index, [/li][/ul]

The behavior of select(…) is to wake up when at least one watched descriptor is “ready”.
Pthreads takes it easy with a single condition variable to pthread_cond_wait.
Windows has WaitForMultipleObjects(…) which allows to sleep pretty much on everything. It will wake up when at least one event is triggered but it is possible to require all events to be triggered.

clWaitForEvents returns CL_SUCCESS if the execution status of all events in event_list is CL_COMPLETE.

To wait on the first event only, it is necessary to put a callback system in place. Leaving aside this has to be done with some care it seems to me that assembling the wait for all events operation from multiple wait for an event operations to be simpler than the opposite.

Leaving aside select(…) also updates lists, which does not seem like anything reasonable to me.

Maybe not pertinent to this specific thread, I would like to know the rationale behind the decision of wait for all.

I haven’t currently read the CL2.1 spec. A quick search suggests this function is not there.

On my todo list: migrate my application to a one-work-queue-one-thread architecture.
How is this relevant to OpenCL?
I’ve stumbled on a CUDA presentation lately and apparently they make a point of having one thread driving all the possible GPUs.
Considering I have less than 10% load on a 800Mhz K10 core I still think resolving to multiple threads is a bad habit. They should be for CPU-performance only as far as I am concerned albeit I could understand a one-thread-per-context approach.

Therefore, I am renewing my request for clWaitAnyEvent.

Can we pretty please have constant variables that can be set from the host code?

OpenGL has these in the form of uniform variables, CUDA has them in the form of cudaMemcpyToSymbol().

Everyone I talked to agreed that CL should support these, and they would certainly make my life infinitely easier.

Can we pretty please have constant variables that can be set from the host code?

You can use CL_MEM_READ_ONLY buffers just for that. Architectures with fast constant memory may take advantage of it, others use global memory. In both cases kernel compiler can use read_only modifier.

Indeed (that is essentially what I am doing now), but the point is to do it without having to add extra parameters to the kernel. Adding a bunch of uniform variables of heterogeneous types that the kernel can then read from is much much easier than having to add a huge number of extra kernel parameters. Especially when some of the kernel is auto-generated or pieced together from a bunch of templates – e.g. a users piece of code/function is taken and then substituted into some wrapper function that is then actually invoked as kernel. Think e.g. of having some accessor-function “int c = get_some_constant_thing()” in GLSL, that now has to become “mykernel(…, some_constant_thing){int c = get_some_constant_thing(some_constant_thing);}”. Having uniform variables available would produce much prettier and more readable code than having to put gargantuan function signatures all over the place.

Talking about auto-generating code, can we get the “#line” pragma from GLSL? That would also be very handy.

Adding (not removing, rather) lots of stuff that is redundant, yet handy, is what made OpenGL a mess it was before 4.0 and still kinda is. If you CAN do something on your own, it means you should do it on your own. It can be a bit of a pain, but this is the only way to keep the API clean. If you don’t like to add arguments on your own, write a small run-time \ build-time preprocessor which will modify OpenCL C code and a wrapper over clCreateKernel which will call clSetKernelArg right after creation. And, preferrably, you should release your code as an open-source library so we all could benefit from having uniforms in OpenCL. Any additional feature increases the chance it will be bugged on some platform.

That’s a pretty vague blanket statement. I could likewise say that by that argument no new OpenCL version should ever be released and no new feature should ever be added. Things that are useful should be added. Otherwise the API is dead and will soon become out-competed by something else.

Adding uniforms would objectively make life a lot easier for a lot of projects, and not having uniforms is considered to be a blocking factor for porting to CL for some projects I know. I’ll also point out that this is a feature that every other GPU API has, including recent ones and recent iterations of old ones: CUDA, OpenGL, Direct3D, Mantle, Metal.

“write a small run-time \ build-time preprocessor” - that’s what I’ve done, but having to do this is a really ugly workaround. Unless you’re willing to include a full parser dependency into your application, you will not be able to fully reliably parse kernel parameters (think e.g. nested //, /* */ comments in combination with commented-out kernels)

False equivalence. There is a difference between adding features to expose functionality that doesn’t exist, and adding features that are redundant with other features or are otherwise mere syntactic sugar. You have a way to furnish an OpenCL kernel with user-provided values. Thus, adding another mechanism is redundant.

Remember: OpenGL got uniforms before uniform blocks/buffers. OpenGL would never have added non-block uniforms if they’d started with uniform blocks to begin with.

Making your life easier is not the purpose of OpenCL. Its purpose is to provide a low-level, cross-platform environment for accessing various computational hardware on a particular device. Convenience features are “higher” level, thus working against that whole “low-level” thing.

D3D hasn’t had uniforms of that kind since D3D10. The only way to communicate values to D3D10+ is with a constant buffer, which is structurally identical to CL_MEM_READ_ONLY buffers. Similarly, Mantle doesn’t have uniforms, and neither does Metal; both of them require you to communicate to shaders with memory objects. And you can bet Vulkan won’t have them either.

The closest thing to uniforms that Mantle has are dynamic memory views, but even those are blocks of memory that you must allocate, manage, and attach to shaders before executing them. Exactly like CL_MEM_READ_ONLY buffers.

So the only APIs that have them are OpenGL (grandfathered, per the above) and CUDA. Thus, your suggestion that this is some universal feature that OpenCL must have to compete is simply untrue.

Get used to it; that’s the shape of the future. Why else do you think SPIR-V is around? They want you to write front-ends that compile to SPIR-V, which you then give to the driver to do its compilation and execution. That makes it easier for you to build higher-level layers, where you can provide all the convenience features you want.

That’s also why #line isn’t coming either. SPIR-V takes care of that with its various decoration opcodes.

New features are supposed to expand computation model or make it more flexible, like device-side dispatch. Syntactic sugar like this should remain in libraries, say CLUT. They might as well be Khronos driven and standartised.

Mantle
It is rather complicated here, but I’d say Mantle resource system is closer to OpenCL’s than OpenGL’s.

@Alfonse Reinheart

To clarify, I am not particularly concerned about the distinction between UBOs and uniforms (individual uniform variables). Either would work for my use-case. And D3D10+, mantle, OpenGL 4+, metal, CUDA et al do indeed have UBOs in some form or other.

“which is structurally identical to CL_MEM_READ_ONLY buffers” structurally – yes. But that wasn’t exactly my complaint about them… my complaint is that I don’t want to stink up function signatures everywhere with loads and loads of unnecessary parameters.

I’d also say there is a bit of a difference between “not making someones life easier” and “expecting someone to implement a compiler”. Most projects will be able to deal with an API that makes their life harder, but not nearly have the budget, manpower or time to implement a compiler.

I’d also say there is a bit of a difference between “not making someones life easier” and “expecting someone to implement a compiler”. Most projects will be able to deal with an API that makes their life harder, but not nearly have the budget, manpower or time to implement a compiler.

OpenCL 2.1 will ship with open-source CL C -> SPIR-V compiler. I don’t know how hard it is gonna be to hack in global variables there, but it is certanly possible. I’m still not sure how this

__kernel void qwe(__constant unistruect* uni ..
//
clCreateKernel("qwe");
clSetKernelArg(0, &uniformbuffer);

is considered

stink up function signatures everywhere with loads and loads of unnecessary parameters.

though. Yes, you have to write “uni->” all the time, but this is what autocompletion is for.