OpenCL 2.0 considerations

Hi!

I know I post quite late about OpenCL 2.0 ratification, but I thought I might not be too late. There are a few things I think should be taken into consideration when updating the specs.

I believe that the specs should place a larger burden on the implementors in terms of features. There are lots of things that everyone implements over and over and over… again. Most of these things are very hard to implement in a performance portable manner. Therefore I would suggest thinking about creating a lot more platform extensions, that focus on use-cases that are common among many libraries:

FFT, BLAS, …

These are APIs implemented by the vendors anyway, since they are rather ‘core’ features that applications use, but they all have a tiny bit different interfaces, which again makes the users’ life a living hell. Having these libs as optional platform extensions (not vendor specific, but cl_khr_fft interface to all fft implementations), users could get vendor-tuned features in a portable manner.

Same thing goes for algorithms. OpenCL (although being a C API) should focus a lot more on STL conformance, or similarly specify frontends to parallel versions to STL algorithms that operate on buffers. (Similarly cl_khr_algorithms) I think a lot of people would even kill for these features.

cl.hpp should be given a lot more attention, VS2013 on the way with variadic template support the 13k line header should be reduced to a normal size, now that it’s not needed to write variadics out by hand. cl.hpp increases compile times like crazy. The C++ documentation should not be a supplement to the C version, but be a standalone document. It would simply rock the world, if the aforementioned platform extensions would get a C++ interface too.

C++ kernel language!!! It is a must in 2013. If it is too late to bring it into the specs, at least the C++ static kernel language implemented by AMD. It is just compiler magic atop the existing C language. Little syntactic sugar, TEMPLATES… the biggest pain is making truly performance portable code, because types are determined at runtime, and templates could give a big advantage in making code portable. Given things such as vector widths (!) are determined at runtime, not just float/double and the likes.

Vector indexing (my_float4 = buf[my_int4];). Yet again it’s no black magic. I really don’t know why it has been forbidden in the first place.

cl_khr_printf device extension. Yet again a common name to cl_amd/nv/intel_printf would be nice.

cl_khr_fp128 +1.

The intel article I read about the possibility of introducing a __share memory namespace that is a virtual namespace addressable by all devices in a context would be a kick@ss feature. +1 (+100 actually)

SPIR precompilation tool would be nice indeed (or specifying the means of generating SPIR code from OpenCL C).

Sub-buffers be multidimensional. The fact that sub-buffers must be continuous in memory limits it’s usablity. There should really be multidimensional versions of it, where one could get 2D/3D parts of a 2D/3D buffer to be a sub-buffer. This feature is a must for distributed lattice calculations, where the border must be communicated accross devices. (This feature by no means is negated by the introduction of __share namespace) This question leads to the following feature also.

clCreateBuffer2D/3D as someone else has already mentioned. Parallel algorithms would also benefit from constructs like this (buffers that know their dimensions).

clCreateSubBuffer2D/3D.

Another feature yet again implemented over and over and over… again is transpose. I cannot tell how many transpose kernels I’ve seen already, yet again something that would a lot better be left to the runtime implementors to be written. clTransposeBuffer2D/3D are again trivial and commonly used routines which noone could implement as efficiently as the vendors themselves. (Or even better if it were a standard algorithm operating on a cl_buffer or cl::Buffer)

cl_complex is yet again a must.

Dynamic parallelism as optional device extension (cl_khr_dynamic_work_dimension or something).

In an interop environment it would be highly beneficial, if one could link OpenCL kernels and OpenGL shaders. OpenCL kernels as far as I see completely fit into the notion of OpenGL compute kernels. It would be nice if there were clCreateGLShaderFromCLKernel that could be linked into OpenGL shader programs without the overhead of interop.

So far this is as much that I thought of. Apart from the features that others have suggested and I opt for them too, I highly recommend thinking about the platform extensions part and integrating C++ more closely into the language as either having it as the native kernel language (or as a standard cl_khr_cpp_kernel platform extension) and/or puting more effort into cl.hpp being a true standard, not something that one of the vendors maintain (somewhat laggy, mostly updated along with their SDK). STL conformance is a nice feature of the cl.hpp header (creating buffers from STL containers), but going a bit further than that would be nice (for eg. cl::Buffer being a transparent STL container itself, who knows it’s size, etc.).

Just a final word of thought: OpenCL 2.0 will be a major revision of the specs, and thus it should reflect major changes, not just a few feature updates, but rather as a new direction to OpenCL. FFT, BLAS, algorithms platform extensions (either mandatory or optional) could be one step in this direction, plus C++ as a kernel language or at least as a standard API interface could be another.

Ideas?