Official OpenCL 2.1 Feedback thread

OpenCL 2.1 has been released today as a set of provisional specifications to enable feedback from OpenCL community before the specification is finalized - to ensure that we are properly targeting your needs and requests.

The provisional specification is in three parts. It should be noted that a requirement of OpenCL 2.1 is that it not add hardware requirements over OpenCL 2.0, so all feedback should be constrained within those bounds.

The OpenCL 2.1 API specification
[ul]
[li]Sub-groups have moved into core, we’ve adding support for copying kernel objects with their arguments across threads and low-latency timers to improve joint profiling of host and device code.[/li][/ul]

The SPIR-V specification
[ul]
[li]Earlier versions of SPIR were OpenCL-specific, and tied to LLVM. SPIR-V is a major step, defining a fully custom intermediate language for graphics and accelerated computing. SPIR-V is intended as a core requirement of both OpenCL and Vulkan platforms and as a target for shader languages, OpenCL C, OpenCL C++ kernel language, SYCL and other languages that may arise from the community.[/li][/ul]

The OpenCL 2.1 C++ kernel language specification
[ul]
[li]Defines a new C++ kernel language, compiled offline via SPIR or online via a compiler library, that implements the feature set of OpenCL C 2.0 in a new, efficient and composable static C++ syntax. The C++ kernel language replaces blocks with lambda functions, adds templates and inheritance and allows function overloading of user-defined functions.[/li][li]By adding the OpenCL C++ kernel language we do not obsolete OpenCL C. New features are implemented in the OpenCL C++ kernel language but earlier versions of OpenCL C remain supported both at runtime and through support in SPIR-V.[/li][/ul]

For these provisional specifications, which are far from final, we are looking for feedback to see if they satisfy the community’s needs.
[ul]
[li]The memory model has not been changed - does OpenCL 2.0’s memory model meet your needs or is something missing?[/li][li]Is the execution model well-enough defined or are there things that can be clarified without requiring hardware changes?[/li][li]Does SPIR-V satisfy the requirements of someone wishing to write a compiler from a new language that targets OpenCL runtimes? Is there anything missing in terms of features or specification that could improve this?[/li][li]Does the C++ kernel language match expectations?[/li][li]Is hiding address spaces in the kernel language the right solution? Explicit address spaces place the burden on the programmer and template engine, hidden address spaces place the burden on the compiler and hardware.[/li][/ul]

Page 20 of OpenCLC++ specification seems to have missing parentheses
It has:
float4 vb = static_cast<float4>u;
While it probably should be:
float4 vb = static_cast<float4>(u);

Restrict keyword seems to be missing. It is not in C++14 so it has to be separately added. printf function definition still uses restrict so it seems it’s implied to be present. On current CL1.2 implementations the restrict qualifier for kernel arguments is rather important. As an example on some implementations you get special caching only with const * restrict, because otherwise the compiler cannot know if same memory object is being used multiple times as an argument for the same kernel.

Geometric functions on page 64 are defined only for float, float2, float3 and float4 (and half and double variants). This is the same problem as with the previous specs.

As an example the new AVX instruction set supports float8 dot product natively. However the compiler is not smart enough to optimize float8 a,b; dot(a.hi, b.hi) + dot(a.lo, b.lo) into a single float8 dot product.

This is also a question of convenience. If the bigger vector types are to be supported at all they should be fully supported.

All of the geometric functions (with exception of cross that is meaningful only for gentype3 and 4) can be trivially implemented using the lower width vectors on hardware that does not natively support them. So it should not be an unreasonable burden for the implementers.

Thank you very much for your feedback. It will be discussed within the working group and we will provide you with a response after due consideration.

A few things I’ve noticed on the first read of the OpenCL C++ 1.0 draft:

  • a minor missing point is that there is no device property retrievable by clGetDeviceInfo about the supported OpenCL C++ version; I would suggest adding one (similar to CL_DEVICE_OPENCL_C_VERSION);

  • the lack of support for goto is a little annoying. It does mean that a lot of code with early exits that would be cleanly written with goto has to be written in terms of a do { ... } while(0); fake loop in order to allow the use of break instead, which is functionally equivalent but horrible in syntax;

  • concerning rounding modes, the specification claims (section 3.1, page 144 of revision 08 ) that (only) static selection of rounding mode is supported; however, nothing in the (documented) host API nor the device language specification allows setting the rounding mode ever; this used to be possible in OpenCL 1.0 (using a specific pragma in device code available when a specific extension was supported), but the feature has been removed in all subsequent revisions of OpenCL; the specification should either document the methods by which rounding modes can be set, or clearly state that rte is the only supported rounding mode for operations, and that the other rounding modes are only supported in the conversion methods; I would honestly hope for the standard to actually expose support for the rounding modes, as I’ve argued extensively here: Rounding modes in OpenCL

  • the new preference for the generic address space, aside from introducing a syntax for constant and local pointers which is of debatable taste at best, is also likely to lead to the generation of inefficient code; address spaces should be always resolvable at compile time, and I honestly have troubles finding a use for the generic address space beyond the creation of space-agnostic functions, which could be realized in terms of function templates, along the lines of template<mem_type AS1, mem_type AS2> somefunc(AS1 float const *p1, AS2 float *p2) { ... }, or even better with some syntax sugar by introducing a ‘generic’ keyword aside global, local, constant: somefunc(generic float *p1, generic float *p2) or anything else along these lines

  • as mentioned by sharpneli, the geometric functions are unnecessarily restricted in terms of which types they apply to; not only most of them make sense for wider vector widths than they are currently defined for, some of them (e.g. dot()) could be applied to integer types as well.

I have an additional feedback related to the local and constant address spaces.

The spec states that it’s possible to get a generic pointer out of local_ptr<T> using the data(), so you can give this pointer to a function that has generic address space arguments. The similar method does not work for constant_ptr. It’s not interchangeable. The constant_ptr could perhaps return const T*.

It should either be so that both of them are usable in generic address space or neither of them. Both due to consistency and due to HW implementations. As an example a platform that doesn’t have unified address space and has separate instructions for loading from local memory it requires the compiler to pack some extra info into the generic pointer and make every load compile into a branch for global and local memory loads depending on what kind of generic pointer was given to it. Because the compiler is basically forced to output branches and do weird packing due to the local_ptr it’s not that much of a hassle to force it to do the same for constant address space.

Personally I’d prefer return of the address spaces. One needs to have explicit control of these in order to write really performant code. However if it’s decided that generality is more important then it should apply everywhere.

P.S: Return goto. Just leave irreducible flow control out.

Adding more low-level functions, and leave high-level features to another layer - just create an official CLU or CLUT, next to SYCL. What I’ve seen the past 20 years is that many languages and libraries move from a core-language to a feature-rich does-it-all. My answer: no, no, NO! Keep the focus on making hardware features accessible, and never replace low-level access by functions that serve the lazy programmers. Those category can use SYCL, CLU, CLUT, the various libraries and the higher-level languages.

We at StreamComputing are happy to help develop an official CLU/CLUT (most of the actual work has been done by many devs around the world), if you promise not to solve the lazy-programmer problem at this layer. See the other remarks what is really needed at the OpenCL-layer.

I would like to point out a flaw in the C++ wrapper API. As far as I can tell, it wasn’t updated for 2.0, but uses deprecated APIs (clCreateSampler, clCreateCommandQueue, and clEnqueueTask). This leads to warnings for all users of the OpenCL 1.2 C++ Wrapper API, which is used in e.g. VexCL as well as clFFT.

I understand these warnings can be hidden by defining CL_USE_DEPRECATED_OPENCL_2_0_APIS, but essentially that is just putting your head in the sand.

The C++ wrapper API is very valuable as it contains a standard available RAII wrapper, which otherwise would have to be written time and time again by various C++ libraries. Please provide an updated header that uses the non-deprecated versions of the API.

For OpenCL 2.1+ it would be good to have integration between cl_event’s with a least-common-denominator wait system. This allows integration of multiple event notifications from sockets and IO devices, to timers, signals, and conditions. This system is poll on posix [ xttp://pubs.opengroup.org/onlinepubs/007908799/xsh/poll.html ] and WaitForMultipleObjects [ xttps://msdn.microsoft.com/en-us/library/windows/desktop/ms687025%28v=vs.85%29.aspx ] on Win32. It is worth noting that on posix, posix conditions are not integrated into this wait system - work arounds known as the self-pipe trick xttp://cr.yp.to/docs/selfpipe.html and on linux eventfd [ xttp://man7.org/linux/man-pages/man2/eventfd.2.html ]. The converse means we will have “wait” or future islands, see the boost ML threads xttp://lists.boost.org/Archives/boost/2015/01/219322.php , xttp://lists.boost.org/Archives/boost/2015/03/220752.php. If you don’t have such a least common denominator, prepare to jump through some hoops: You will either burn CPU with busy waits or have a fixed latency which can waste alot of cycles and accumulate throughout the system at different wait points or answer the problem with more threads which makes code harder, incurs substantial boiler plate in C/C++ especially and losses some guarantees assuming you are work thread-safe everything (thread-local storage can sabotage this assumption, legacy code may also perform more poorly or may not even be thread aware).

Sorry about the xttp’s, forum was giving rejections for links.

Example libraries/links who think this is a big deal (enough to specifically support or address it):
xttp://libusb.org/static/api-1.0/mtasync.html
xttp://damien.douxchamps.net/ieee1394/libdc1394/faq/#How_to_cleanly_stop_a_blocking_capture
xttp://docs.enlightenment.org/auto/eio/group__Ecore__Main__Loop__Group.html#ga111f77f35bf6f6065357dd0033d75e5c
xttps://rodgert.github.io/2014/12/24/boost-asio-and-zeromq-pt1/
xttp://stackoverflow.com/questions/24449936/integrate-boostasio-into-file-descriptor-based-eventloops-select-poll
xttp://nikhilm.github.io/uvbook/utilities.html#external-i-o-with-polling
xttp://zeromq.org/area:faq (search for integrate)
xttp://nanomsg.org/v0.5/nn_poll.3.html (quick disclaimer: nanomsg is from the author of zeromq).

I also wanted to note that the posix standard does not currently integrate posix condition variables into the poll waitable system and the OpenGroup won’t add it unless a larger / sponsoring institution such as Khronos or the IEEE opens the dialog - I’ve pinged them on it before and they’re very staunch about that.

Finally, noting that clSetUserEventStatus does not equate - it would only one-way notification from the least-common-denominator wait-system to notify cl_event’s. One still could not wait on a cl_event while also waiting on other pollable descriptors.

[QUOTE=VincentH;31274]Adding more low-level functions, and leave high-level features to another layer - just create an official CLU or CLUT, next to SYCL. What I’ve seen the past 20 years is that many languages and libraries move from a core-language to a feature-rich does-it-all. My answer: no, no, NO! Keep the focus on making hardware features accessible, and never replace low-level access by functions that serve the lazy programmers. Those category can use SYCL, CLU, CLUT, the various libraries and the higher-level languages.

We at StreamComputing are happy to help develop an official CLU/CLUT (most of the actual work has been done by many devs around the world), if you promise not to solve the lazy-programmer problem at this layer. See the other remarks what is really needed at the OpenCL-layer.[/QUOTE]

Absolutely agreed. An important case where the high-leve feature exposed in OpenCL C (or C++) would be better replaced by lower-level functions is that of work-group and subgroup scans and reductions. It would be much better if the actual hardware subgroup swizzling/exchange functions were exposed, since they would be useful in more general contexts (a longer presentation of this point can be found here http://oblomov.local/~oblomov/wok/tecnologia/gpgpu/opencl-high-vs-low-level/ )

An additional point, concerning the available device information:

  • as I mentioned, it would be better to have device info entry about the supported OpenCL C++ version; while currently there is only one, it’s quite likely different versions will be avaialble in the future, and rather than adding it later (as was done with CL_DEVICE_OPENCL_C_VERSION in OpenCL 1.1), we should add it now;

  • a clarification about what exactly should be used for CL_DEVICE_VENDOR_ID is sorely needed. Reading the current specifications (all of them, including 2.1) the text “A unique device vendor identifier. An example of a unique device identifier could be the PCIe ID.” would seem to hint that this should be the ID of the device by the vendor, but most current implementations use instead the (PCI) ID of the vendor of the device, and sometimes not even that, but rather the ID of the vendor of the platform. For example, for CPUs the AMD platform returns 0x1002 while Intel returns 0x8086, regardless of whether the CPU is GenuineIntel or AuthenticAMD, which is very confusing. Beignet (the open source OpenCL platform for Intel IGPs), on the other hand, uses the PCI device ID instead. This is all very confusing, since it’s not even clear if it should be the ID of the device or that of the vendor;

  • something which is long overdue in OpenCL is some form of UUID for devices regardless of platform. A CL_DEVICE_UUID property with well-defined production rules would be very useful.

Another point that needs clarification (aside from the meaning of CL_DEVICE_VENDOR_ID) is the behavior of sub-devices in terms of (pre-)existing contexts. I’ve opened a specific discussion about this elsewhere in the forum (this thread).

I will hijack that a bit, but if there is a single most important and annoying thing with all the OpenCL versions so far for us, this are the standard conformant implementations, that are obviously not compiling what they should. Everything else might be nice to have, but more or less useless without the basic stuff.
Here is what I mean in more detail - https://www.khronos.org/message_boards/showthread.php/9714-Differences-in-the-implementation-of-certified-OpenCL-vendors-makes-life-hard?p=31397
A place to report those will be nice to have, apparently nobody from Khronos reads (or cares) about the posts in this forum.

A specification cannot require conformance; or rather it can, but it has no power to enforce it. Even a conformance test only tests conformance to the tests, not to the spec. No conformance test can test everything in even a moderately complex specification.

The Khronos group, who governs the standard and owns these forums, cannot make people implement the standard correctly. They can ask, they can hope, but they can’t make it happen.

The only way that conformance can be improved is to go to the IHVs, the ones who write implementations, and file bug reports with them. Give them example code that fails. Cite the part of the specification that says how it’s supposed to work. And so on.

If you have a problem with the conformance test, if you think it’s not comprehensive enough, then by all means file a Khronos bug on the area(s) where it is deficient.

But complaining about quality of implementation issues on a forum owned by the makers of the specification will accomplish precious little.

[QUOTE=Alfonse Reinheart;31409]

But complaining about quality of implementation issues on a forum owned by the makers of the specification will accomplish precious little.[/QUOTE]

Clearly, I am complaining about the quality of the confmance tests and they are what should be fixed.
Thank you for the link to the bug report system, I will use it wisely.

I would like to propose a way to specify the required or maximum NDRange at kernel compile time. The need for this capability was highlighted when tasks were deprecated since there is now no way the compiler knows with absolute certainty that a kernel must be executed as a single work-item, which is extremely useful knowledge for a compiler to apply certain parallel optimizations when targeting pipeline architectures such as FPGAs. The number of effective compute units in such reprogrammable fabric architectures is also determined only after the kernel(s) have been loaded.

To fill the current gap, I would suggest adding the following paragraphs to the spec.

OpenCL (under Table 5.21 in 5.9.3):

CL_KERNEL_COMPILE_GLOBAL_WORK_SIZE

size_t[3]

Returns the NDRange size specified by the attribute((reqd_global_work_size(X, Y, Z))) qualifier. Refer to section 6.7.2. If the NDRange size is not specified using the above attribute qualifier (0, 0, 0) is returned.

CL_KERNEL_PREFERRED_GLOBAL_WORK_SIZE_MULTIPLE

size_t

Returns the preferred multiple of NDRange size for launch. This is a performance hint. Specifying a NDRange size that is not a multiple of the value returned by this query as the value of the global_work_size argument to clEnqueueNDRangeKernel will not fail to enqueue the kernel for execution unless the NDRange size specified is larger than the device maximum.

OpenCL (in 5.10):

The NDRange size to be used for kernel can also be specified in the program source using the attribute((reqd_global_work_size(X, Y, Z))) qualifier (refer to section 6.7.2). In this case the size of NDRange specified by global_work_size must match the value specified by the reqd_global_work_size attribute qualifier.

OpenCL C (in 6.7.2):

The optional attribute((reqd_global_work_size(X, Y, Z))) is the NDRange size that must be used as the global_work_size argument to clEnqueueNDRangeKernel. This allows the compiler to optimize the generated code appropriately for this kernel.

If Z is one, the work_dim argument to clEnqueueNDRangeKernel can be 2 or 3. If Y and Z are one, the work_dim argument to clEnqueueNDRangeKernel can be 1, 2 or 3.

The description of memory_order_relaxed on page the OpenCL 2.1 API spec (p. 42) contains some unclear examples of how it can be used: “It can be used to safely increment counters that are concurrently incremented […]”. Using that language seems to imply that a programmer or compiler writer can assume that the following code should not be racy with respect to accesses from other work-items:

*ptr = (*ptr)++;

However, this code likely compiles to separate load and store instructions, and memory_order_relaxed is not required to enforce that other work-items do not load from the same address between the load and store. This would result in a race.

Based on the note at the top of p. 49, it seems that the memory_order_relaxed description (p. 42) should read that it can be safely used with atomic (or read-modify-write) operations to memory locations that are concurrently modified with atomic operations from other work-items. If that is correct, the rest of the description paragraph may also need adjustment.

Apologies: The memory_order_relaxed brief description is on p. 42 of the OpenCL 2.0 API spec. It is on p. 44 of the OpenCL 2.1 API spec.

It is useful information, but I don’t see how clEnqueueTask helped with that. That function call is only known long past compilation and an implementation should be able to achieve the same thing with knowledge that the ndrange is 1,1,1 (or any other constant, for that matter). Tasks have never been available to the compiler.

We are looking at ways to integrate more information of that sort into the kernel language, though. The global range isn’t the only missing thing that would be useful to the compiler. One option would be to look at SYCL which does give the compiler that information as the host code that sets it is visible to the device compiler.

[QUOTE=jthestness;31546]The description of memory_order_relaxed on page the OpenCL 2.1 API spec (p. 42) contains some unclear examples of how it can be used: “It can be used to safely increment counters that are concurrently incremented […]”. Using that language seems to imply that a programmer or compiler writer can assume that the following code should not be racy with respect to accesses from other work-items:

*ptr = (*ptr)++;[/QUOTE]

You aren’t doing a relaxed increment here, you’re doing a non-atomic increment. The ++ operator isn’t applicable to atomic variables in OpenCL C 2.0. To use relaxed ordering you need to do:
atomic_fetch_add_explicit(ptr, 1, memory_order_relaxed);

It’s a little long-winded, but that’s the cost of the kernel language being C11-derived, for now. It’s consistent with the way you do it in C11.

It would be very helpful if both platforms and devices had some concise info that could be queried to correlate built binaries back to the target system for which the binary was built (either online or offline). The current difficulty I see, particularly for applications targeting embedded profiles where online compilation isn’t an option, is creating a resilient binary cache, i.e., one that can request the application vendor for the exact binary needed by a customer.

I’ve been burnt a couple times when someone built a binary on their system and I tried to use it on my system where some platform/device/driver versions were slightly off, or they matched but I later updated something for which the binary was not compatible. It was extremely hard to root cause this invalid binary error if you don’t have all the original platform and device info to compare the current system settings against. Even when I had the original system settings, that’s a lot of info to diff and I’m not entirely sure if some changed settings might not be reported.

A possible solution would be some kind of platform and device hash that could be obtained from clGetPlatformInfo and clGetDeviceInfo. Then a binary cache could be created and updated independently from the host program by saving binaries in some directory with the naming convention <CL_PLATFORM_HASH>_<CL_DEVICE_HASH>.bin. An OpenCL application could begin by quickly searching a directory for any necessary binaries, and if any were missing the host program could fail gracefully.

The restrict keyword will not be supported in the OpenCL C++ kernel language because it is not defined in C++14. The use of restrict in the specification is a mistake that will be corrected.

Geometric functions are intentionally not supported for all vector sizes. General geometric functions for N-dimensional vectors must be decomposed into powers of two, and there is no advantage to decomposing into vector widths wider than 4. OpenCL C++ provides the tools required for to write high-level libraries that can do this decomposition under the hood.

[QUOTE=sharpneli;31012]Page 20 of OpenCLC++ specification seems to have missing parentheses
It has:
float4 vb = static_cast<float4>u;
While it probably should be:
float4 vb = static_cast<float4>(u);

Restrict keyword seems to be missing. It is not in C++14 so it has to be separately added. printf function definition still uses restrict so it seems it’s implied to be present. On current CL1.2 implementations the restrict qualifier for kernel arguments is rather important. As an example on some implementations you get special caching only with const * restrict, because otherwise the compiler cannot know if same memory object is being used multiple times as an argument for the same kernel.

Geometric functions on page 64 are defined only for float, float2, float3 and float4 (and half and double variants). This is the same problem as with the previous specs.

As an example the new AVX instruction set supports float8 dot product natively. However the compiler is not smart enough to optimize float8 a,b; dot(a.hi, b.hi) + dot(a.lo, b.lo) into a single float8 dot product.

This is also a question of convenience. If the bigger vector types are to be supported at all they should be fully supported.

All of the geometric functions (with exception of cross that is meaningful only for gentype3 and 4) can be trivially implemented using the lower width vectors on hardware that does not natively support them. So it should not be an unreasonable burden for the implementers.[/QUOTE]