Official SYCL 2.2 Provisional feedback thread

April 18th 2016 – International Workshop on OpenCL, Vienna – The Khronos Group, an open consortium of leading hardware and software companies, announces the immediate availability of the OpenCL™ 2.2, SYCL™ 2.2 and SPIR-V™ 1.1 provisional specifications. OpenCL 2.2 incorporates the OpenCL C++ kernel language for significantly enhanced parallel programming productivity. SYCL 2.2 enables host and device code to be contained in a single source file, while leveraging the full power of OpenCL C++. SPIR-V 1.1 extends the intermediate representation defined by Khronos with native support for shader and compute kernel features to fully support the OpenCL C++ kernel language. These new specifications can be found at www.khronos.org and are released in provisional form to enable developers and implementers to provide feedback before finalization
About SYCL 2.2
SYCL 2.2 enables the capabilities of OpenCL 2.2 to be leveraged while keeping host and device code in a single source file. SYCL aligns the hardware features of OpenCL with the direction of the C++ standard, so that developers can write C++ template libraries that exploit all the capabilities of compute devices, from the smallest OpenCL 1.2 embedded device to the most advanced OpenCL 2.2 accelerators, without writing proprietary or non-standard code. The open-source C++ 17 Parallel STL for SYCL, hosted by Khronos, enables the upcoming C++ standard to support OpenCL 2.2 features such as shared virtual memory, generic pointers and device-side enqueue.

OpenCL C++ and SYCL between them now provide developers the choice of two C++ approaches. For developers who want to separate their device-side kernel source code and their host code, the C++ kernel language can be the best option. This is the approach taken with OpenCL C today, as well as the widely-adopted approach taken by shaders in graphics software. The alternative approach, commonly called ‘single-source’ C++, is the approach taken by SYCL, OpenMP and the C++ 17 Parallel STL. By specifying both SYCL and the C++ kernel language, Khronos provides developers maximum choice, while aligning the two specifications so that code can be easily shared between these complementary approaches.

Questions and Community Feedback
Questions on using SYCL can be asked here. Also, the Khronos SYCL working group is continuing to push the standard forward to support future OpenCL versions and new standard C++ capabilities.

Hello. My question is about vector types conversion.
SYCL lies on top of OpenCL. OpenCL restricts implicit vectors conversion and provides explicit functions like convert_char4_sat instead.
SYCL in turn provides conversion operator genvector() so basically we could use something like char4(float4(1.0f)). But specification states that conversion operator is only for OpenCL interoperability vectors. Also it seems that there is no way to define saturation and rounding modes directly. Of course we can use clamp and std::numeric_limits before passing argument to conversion operator, but will this map exactly to OpenCL’s convert_vec_sat()? Maybe it’s reasonable to provide clarification of vectors conversion in specification and define conversion operator to also support SYCL types? The same applies to SYCL 1.2 specification.
Thank you.

Hello again. I have another suggestion regarding SYCL standard.
As I’ve found there is no way to enqueue asynchronous read of buffer to host memory. Host accessor constructor blocks until processing is finished and data is copied from device to host memory.
In many use cases this limitation imposes unnecessary stalls on the host side. For example, we’ve executed several kernels and need to do some post processing of results on the host side. OpenCL allows to enqueue asynchronous read or map command for each buffer, then wait until first buffer is copied and start host processing. While host processes the first buffer runtime asynchronously copies other buffers. Then we wait for the second buffer and so on. This way we avoid unnecessary stalls on the host and thereby increase application performance.
As for SYCL we don’t have such mechanism. We have to wait until all buffers are copied to the host and then consequently process them. Or wait for one buffer, do processing and then again wait for another buffer. Thus we cannot do some useful job on the host while runtime is copying data. I think lack of such flexibility prevents implementation that deploy full power of underlying compute device.
Maybe it’s possible to provide another access target, for example access::target::host_buffer_async that makes accessor constructor non-blocking and add wait_data_ready() method that blocks until data is actually copied to the host.
Thank you.

And one more thought about SYCL standard.
It seems impossible to access buffer of scalars by vectors or buffer of vectors by vectors having other width. Accessor has to be parameterized with exactly the same type as buffer.
If I have for example float gray-scale image and want to process it deploying GPU’s vector capabilities and accessing data as float4. I have to either create float4 buffer imposing inconvenient access on the host side or use scalar float buffer loosing performance on GPU.
OpenCL allows access to data using any type you want. Actually it’s not type safe. But SYCL standard could relax current restriction. For example allowing to create vector accessors for scalar buffers or buffers with different width of vector when underlying type is preserved would provide much more flexibility.

[QUOTE=E.Peshkov;41113]Hello. My question is about vector types conversion.
SYCL lies on top of OpenCL. OpenCL restricts implicit vectors conversion and provides explicit functions like convert_char4_sat instead.
SYCL in turn provides conversion operator genvector() so basically we could use something like char4(float4(1.0f)). But specification states that conversion operator is only for OpenCL interoperability vectors. Also it seems that there is no way to define saturation and rounding modes directly. Of course we can use clamp and std::numeric_limits before passing argument to conversion operator, but will this map exactly to OpenCL’s convert_vec_sat()? Maybe it’s reasonable to provide clarification of vectors conversion in specification and define conversion operator to also support SYCL types? The same applies to SYCL 1.2 specification.
[/QUOTE]

Yes, we have spotted this is an omission in the SYCL 1.2 specification. We need to add in the explicit conversion operators. Thank you for reminding us and expect a fix for this

[QUOTE=E.Peshkov;41153]Hello again. I have another suggestion regarding SYCL standard.
As I’ve found there is no way to enqueue asynchronous read of buffer to host memory. Host accessor constructor blocks until processing is finished and data is copied from device to host memory.
In many use cases this limitation imposes unnecessary stalls on the host side. For example, we’ve executed several kernels and need to do some post processing of results on the host side. OpenCL allows to enqueue asynchronous read or map command for each buffer, then wait until first buffer is copied and start host processing. While host processes the first buffer runtime asynchronously copies other buffers. Then we wait for the second buffer and so on. This way we avoid unnecessary stalls on the host and thereby increase application performance.
As for SYCL we don’t have such mechanism. We have to wait until all buffers are copied to the host and then consequently process them. Or wait for one buffer, do processing and then again wait for another buffer. Thus we cannot do some useful job on the host while runtime is copying data. I think lack of such flexibility prevents implementation that deploy full power of underlying compute device.
Maybe it’s possible to provide another access target, for example access::target::host_buffer_async that makes accessor constructor non-blocking and add wait_data_ready() method that blocks until data is actually copied to the host.
Thank you.[/QUOTE]

A few people have requested manual memory movement. We will have a look at ways to enable this in SYCL. The scheduling in SYCL is asynchronous: all kernels are enqueued and are non-blocking, so by default most code should be highly parallel. Host accessors are blocking, but only for the thread they are in. Accessors for command groups that execute on the host device are non-blocking: the host kernel is scheduled to execute on the host when data is ready, but the accessor construction does not block.

Currently, we believe that what you want can be achieved through multi-threaded programming or multiple host-side kernels. Each thread, or host kernel, will be scheduled so that it can execute when its buffer is available. This should achieve what you want, if you can design your code in either of these two ways. There are other, more complex, examples where the current SYCL methods cannot be easily adapted, such as loading a buffer into device memory, from a file, using double-buffering.

      • Updated - - -

[QUOTE=E.Peshkov;41166]It seems impossible to access buffer of scalars by vectors or buffer of vectors by vectors having other width. Accessor has to be parameterized with exactly the same type as buffer.
If I have for example float gray-scale image and want to process it deploying GPU’s vector capabilities and accessing data as float4. I have to either create float4 buffer imposing inconvenient access on the host side or use scalar float buffer loosing performance on GPU.
OpenCL allows access to data using any type you want. Actually it’s not type safe. But SYCL standard could relax current restriction. For example allowing to create vector accessors for scalar buffers or buffers with different width of vector when underlying type is preserved would provide much more flexibility.[/QUOTE]

Buffers and accessors must have the same base datatype. However, once inside a kernel, on a device, the accessor can be safely cast to a pointer and then to a pointer of a different type. So, you should have no problem accessing data in a buffer using different pointer types.

[QUOTE=AndrewRichards;41178]A few people have requested manual memory movement. We will have a look at ways to enable this in SYCL. The scheduling in SYCL is asynchronous: all kernels are enqueued and are non-blocking, so by default most code should be highly parallel. Host accessors are blocking, but only for the thread they are in. Accessors for command groups that execute on the host device are non-blocking: the host kernel is scheduled to execute on the host when data is ready, but the accessor construction does not block.

Currently, we believe that what you want can be achieved through multi-threaded programming or multiple host-side kernels. Each thread, or host kernel, will be scheduled so that it can execute when its buffer is available. This should achieve what you want, if you can design your code in either of these two ways. There are other, more complex, examples where the current SYCL methods cannot be easily adapted, such as loading a buffer into device memory, from a file, using double-buffering.

      • Updated - - -

Buffers and accessors must have the same base datatype. However, once inside a kernel, on a device, the accessor can be safely cast to a pointer and then to a pointer of a different type. So, you should have no problem accessing data in a buffer using different pointer types.[/QUOTE]

Yes, obviously, scheduling in SYCL is almost asynchronous, thus it’s even more strange that host accessors are necessarily blocking.
I understand that I can implement post-processing kernels and execute them on CPU device for example, but then I’ll still have to wait for all buffers with final results in series. The question is more about post processing that requires complex algorithms, inefficient being implement in kernels or post processing using some side libraries. As for multi-threaded approach, I think it’s too complex and bulky for such simple task, while underlying OpenCL provides efficient and simple mechanism for such situations.
I assume every SYCL implementation built on top of OpenCL in host accessor constructor will anyway enqueue blocking read or map and return control to the host. Or enqueue unblocking read or map and then immediately wait on corresponding event. Implementation of non-blocking accessor is then straightforward. In constructor we enqueue unblocking read or map and add some blocking method for example wait_data(). User is required to call this method before accessing the data. Untill wait_data() is called accessor returns nullptr. wait_data in turn waits on corresponding event and then accessor returns valid pointer.

As for casting a pointer to a pointer of different type within a kernel. I’ve missed this opportunity. Thank you for advice.
But then I’ve decided to look deeper. Specification clearly states “Inside kernels, conversions between accessors to buffers, explicit pointer classes and C++ pointers are allowed as long as they reference the same datatype and have compatible qualifiers and address spaces.” Maybe it’s only for conversion from accesor to pointer. But I haven’t found any word about conversion between pointers of different types. Is it allowed and safe? Because at first sight it feels unsafe to cast between scalar and class, even if it’s vector. And what about such casting on the host side? In my opinion this topic could be covered in more detail in specification.

Hello? Does anyone track this thread? There are suggestions by the way.

Sorry, yes, we are tracking. It takes a while for us to discuss these things.

The pointer conversion we can make clearer in an update to the spec. As in OpenCL, pointer conversion within a kernel is allowed.

The asynchronous issue is more complex. The synchronization in SYCL works across device and OpenCL context, so it isn’t quite so simple. We are discussing ways of making it work. We need to understand the use case a little more, so I will contact you directly.