Official SYCL 1.2 Provisional feedback thread

March 19, 2014 – San Francisco, Game Developer’s Conference – The Khronos Group today announced the release of SYCL™ 1.2 as a provisional specification to enable community feedback. SYCL is a royalty-free, cross-platform abstraction layer that enables the development of applications and frameworks that build on the underlying concepts, portability and efficiency of OpenCL™, while adding the ease-of-use and flexibility of C++. For example, SYCL can provide single source development where C++ template functions can contain both host and device code to construct complex algorithms that use OpenCL acceleration - and then enable re-use of those templates throughout the source code of an application to operate on different types of data.

The SYCL 1.2 provisional specification supports OpenCL 1.2 and has been released to enable the growing community of OpenCL developers to provide feedback before the specification is finalized. The specification and links to feedback forums are available at: www.khronos.org/opencl/sycl.

While SYCL is one possible solution for high-level parallel programming that leverages C++ programming techniques, the OpenCL group encourages innovation in diverse programming models for heterogeneous systems, including building on top of the SPIR™ low-level intermediate representation, using the open source CLU libraries for prototyping, or through custom techniques.

“Developers have been requesting C++ for OpenCL to help them build large applications quickly and efficiently and there are lots of useful C++ libraries that want to port to OpenCL,” said Andrew Richards, CEO at Codeplay and chair of the SYCL working group. “SYCL makes this possible and we are looking forward to the community feedback to help drive the final release and future roadmap. We are especially keen to work with C++ library developers who want to accelerate their libraries using the performance of OpenCL devices.”

SYCL 1.2 Features
SYCL 1.2 will enable industry innovation in OpenCL-based programming frameworks:

[ul]
[li]API specifications for creating C++ template libraries and compilers using the C++11 standard;[/li][li]Easy to use, production grade API that can be built on-top of OpenCL and SPIR;[/li][li]Compatible with standard CPU C++ compilers across multiple platforms, as well as enabling new SYCL-based device compilers to target OpenCL devices;[/li][li]Asynchronous, low-level access to OpenCL features for high performance and low-latency – while retaining ease of use;[/li][li]Khronos open royalty-free standard - to guarantee ongoing support and reciprocal IP coverage;[/li][li]OpenGL® Integration to enable sharing of image and textures with SYCL as well as OpenCL;[/li][li]Development in parallel with OpenCL – future releases are expected to support upcoming OpenCL 2.0 implementations and track future OpenCL releases.[/li][/ul]

SYCL Homepage
An Overview of SYCL 1.2
OpenCL DevU at GDC 2014

Going through the specs slowly. Very high level feedback is that we need more examples. I already mentioned this to Andrew on twitter.
Also, compilation workflow is really unclear.
For example:

  1. Let us say I have my favourite C++11 compiler installed. GCC, VS2013 whatever. Let’s say I do NOT have any other compilers installed, nor any OpenCL drivers and just want to compile SYCL code to (parallel) native code using my native compiler. I guess this will require you to release some header files so that classes such as cl::sycl:buffer are understood by the C++ compiler to allow it to generate CPU code. This will be useful for development at least and for porting code to platforms where OpenCL drivers are not available (eg: WinRT).
    Will this supported? If so, how do things work? Should we expect a royalty-free solution for this?

  2. Single-source SYCL compilers are easy to understand. These will take all your source files, including both regular C++ and SYCL, and generate a single binary containing both host and device code. What about multi-compiler solutions mentioned? Are those solutions likely to look like, say, nvcc? I.e. compiling device code itself and inserting any required glue code for host, and then pass all the original host code as well as generated host code to available C++ compiler such as gcc, vs etc.?

Good questions, thankyou

Any SYCL implementation is require to support execution of any code on the host CPU using just the host compiler as well as execution of device code on one or more OpenCL devices. A host-only implementation would not be conformant, but you could use a conformant implementation of SYCL to run code only on host.

SYCL is a royalty-free standard. Whether a specific implementation has licensing terms requiring payment or royalties is up to individual implementers.

How SYCL is compiled is not actually defined in the spec. This was a deliberate decision to allow implementers freedom. However, an implementation could operate like this:

You compile your source file with a SYCL device compiler and it produces a header file containing the compiled kernel and implementation-specific glue code to invoke the kernel on an OpenCL device. E.g. mysyclcompiler mysourcefile.cpp -omysyclheader.h

Then you could compile the same source file with your host compiler and tell it where the compiled kernel header is. E.g. gcc -c -DSYCLHEADER=“mysycleheader.h” mysourcefile.cpp

The sycl header files and runtime sort out the rest.

Alternative approaches would still be valid

Thank you for your feedback.

More examples would definitely help in describing the features of the SYCL specification and this is something that we are currently looking into. We will shortly be posting a series of blogs on the Codeplay website that will be aimed at describing the SYCL programming model and the available work flow solutions as well as providing more practical examples.

A couple of typographical issues:

p.14: “For a kernel to access local memory on a device, the user can either create a dynamically-sized local accessor object to the kernel as a parameter.” – typically “either” is followed by an “or”, and also “to the kernel as a parameter” seems like it is missing something ahead of it.

p.75: “the device.” is hanging there by itself with blank space above it. It seems like something is missing prior to it.

Thanks, we will have a look at these.

I am happy to see something like SyCL develop. The question of how to we best program accelerators is still unanswered. I doubt there is an universal answer at all. The more things we try, the closer we will get to a satisfying solution. So I applaud your efforts. I read the provisional specification and I have a few questions and comments:

  • SyCL is not something that can be implemented as a standard C++ library but is a compiler extension or an additional compiler, not unlike C++ AMP is that correct?

  • command_group: this concept seems to try to fuse memory transfer and compute together; with command_groups are things like pipelines and double-buffering still possible? How would one go about implementing overlapping copy and compute using the command_group concept?

  • The accessor seems interesting - it’s actual usefulnesses can best be assessed once we can implement code using SyCL: when can we expect a working prototype? I really dislike the name “accessor” though. C++ AMP calls this an array_view which is a lot nicer.

  • I dislike the name of the queue concept; it is too generic and usually means something completely different. I know there is the namespace but still. I obsessed about the exact name for the thing that is a stream or a command_queue and came up with the concept of a ‘feed’ in my GPU library Aura.

Cheers,
Sebastian

Thank you for your feedback.

SYCL defines two components; a C++ runtime library and a device compiler. The SYCL runtime library uses C++11 features, however as it includes OpenCL language extensions, no new language extensions are required.

As SYCL is asynchronous; all commands defined within a command_group are enqueued asynchronously, double buffering can be achieved automatically by the runtime providing that the command_groups or the individual commands are defined such that they can be executed in parallel.

There is currently no implementation of SYCL available, the only announced implementer is Codeplay. If you are interested in more practical examples of SYCL, a series of blogs will be posted on the Codeplay website, the first of these can be found here.

SYCL is still in the provisional stage of specification and is therefore still subject to change based on the feedback from potential developers and implementers so any feedback regarding naming and the programming model are appreciated and will be discussed within the Khronos working group.

At the moment I very much like using the cl.hpp and was wondering whether this will be continued to be available for future versions of OpenCL or whether SYCL
will supersede this?

The SYCL header sycl.hpp includes cl.h, but doesn’t not include cl.hpp. SYCL is an alternative to the C++ wrappers and does not conflict with them, therefore they will continue exist as part of OpenCL as long as the OpenCL working group maintains them.

Do I take it correctly from the specs, that now structs or classes can be passed to kernels? Or what exactly is the difference in capturing variables and passing them as lamdba parameters?

Yes, SYCL allows you to pass any struct that is POD and doesn’t contain pointers.

Variables that are captured by the lambda are kernel arguments, these can be accessors, samplers and POD data types which don’t contain pointers. The lambda parameters are specific types that are constructed within the kernel that are used to give host/device compatible access to the current work item’s id information. For example, the parallel_for API takes an item object as the lambda’s parameter.

Is there any way to get early access to the reference implementation done by Codeplay? Or at least get an estimate, as to when it will be available? (Although I would be glad to give it a spin)

The Codeplay implementation of SYCL is not a reference implementation and is currently still in development, however anyone looking to get involved with the development of SYCL can contact us to discuss it further. Additionally there is an open source implementation of SYCL called triSYCL, which is still in the early stages of development, this can be found here.

What are the restrictions on how device functions and kernels can be declared and defined? It wasn’t clear to me from reading the spec.

For example, do you allow for forward declarations? e.g.

// a.cpp
int f(); //declaration


single_task(…
f(); //kernel call
}

int f() {
///definition
}

I image you couldn’t allow one to call a device function that’s defined in another translation unit.

By comparison c++amp, which has “restrict(auto)” that seems to match how sycl wants to implicitly treat all functions, only allows “restrict(auto)” for functions where the definition is the same as the declaration. See section 13.1.1 of their specification.

blogs.msdn.com/b/nativeconcurrency/archive/2012/02/03/c-amp-open-spec-published.aspx

I would recommend also adding attributes that you can use if you want to define device functions in a different translation unit from where they’re called. Something like

//a.h
int f() [[sycl::device_function]]; //declaration

b.cpp
//definition
int f() [[sycl::device_function]]
{

}

Why require variables shared between host and device to be PODs? This is much more restrictive than either CUDA or C++AMP. It would seriously suck in my opinion if you could not pass an object like

class A {
public:
A(int a) : _a(a) {}
private:
int _a;
}

from a host to a device kernel simply because it has a constructor and is therefore not a POD.

A common technique with C++ numerical computing is to build up small expression template objects that represent, for example, something like arithmetic with matrices. One would want to pass such objects to a kernel to evaluate them and it would be inconvenient if such object weren’t allowed to have constructors.

I would recommend instead mirroring CUDA’s restrictions which are more permissive:
(see CUDA C++ Programming Guide)

Could you please clarify this clause:

“If the lambda function relies on template arguments, then the name of the lambda function must contain those template arguments”

If I take this kernel as an example

template<int N, class A>
void f() {
A a;

single_task(kernel_lambda< [name] >([=] {
int x[N];
do_something(a);
});

}

does this mean that the name has to be something like

template<int N, class T>
struct MyLambdaName {};

[name] = MyLambdaName<N, A>

Why are such requirements on the name necessary? C++AMP doesn’t have anything comparable and there’s already a project that implements C++AMP using OpenCL

bitbucket.org/multicoreware/cppamp-driver-ng/wiki/Home

so I would think you could also implement sycl without them

Thank you for your response. The provisional specification was released to enable developers to comment on, so these comments are very helpful to us. We will use all of this feedback to develop the specification further.

In response to comment #15:

There is no requirement in SYCL to declare functions as being “host” or “device”. That is auto-deduced by SYCL compilers. There is no equivalent in SYCL of C++ AMP’s “restrict(auto)” as there is no requirement to annotate functions in this way.

In the SYCL programming model, kernel functions can be defined by a functor object or a lambda expression. The definition of the kernel function must be within a command_group scope as it requires the accessors for the data that is accessed within the kernel function. This restriction only applies to kernel functions themselves and not to functions called from within a kernel.

It is not possible to forward declare the kernel functions themselves in the normal C++ sense as they are functor objects or lambda expressions, however you can forward declare the functor type or define the functor object prior to being executed.

In SYCL as there are different options for build systems, i.e. you can have a single compiler or a separate host and SYCL device compiler, the specifics of a SYCL build system are implementation defined. This means that the method of calling SYCL functions outside of the translation unit containing the kernel function is not defined in the SYCL specification and may vary from one implementation to another. This is complicated by the fact that in OpenCL, linking is done at runtime, not compile time.

In response to comment #16:

There are a couple of restrictions that SYCL needs to ensure that data can be copied between host and device. For OpenCL 1.x generation devices, which is what SYCL supports in its current provisional specification, data must be able to be copied between host and device memory. The copy is potentially performed by hardware, which means that calling constructors when doing copying cannot be guaranteed.

Also, SYCL is a shared source programming model, meaning that the source files are compiled with both a SYCL device compiler and a host compiler. This allows users of SYCL to use their host compiler of choice, for example VisualC, GCC or clang, making the programming model more flexible and portable and giving better integration with existing build systems. However in order for SYCL to provide this programming model, a SYCL runtime implementation must be able to make certain assumptions about the data layout of the functor or lambda that defines the SYCL kernel functions.

The restriction in the current provisional specification is non-POD. There is some change between different C++ versions on what is meant by non-POD. We are considering alternative restrictions that provide the necessary restrictions on data movement that would allow sharing in SYCL, without over-restricting what is possible. We are also considering updating the specification to more recent C++ standards than is in the SYCL 1.2 provisional specification.

In response to comment #17:

As SYCL is a shared source programming model, every kernel function requires a unique name, so that the host side of a SYCL runtime implementation is able to identify the binary and kernel argument information that is output from the SYCL device compiler. This allows the kernels to be compiled with a SYCL device compiler and the host code to be compiled with a CPU compiler of the user’s choice. It enables maximum flexibility for users because it is possible to choose different host and device compilers according to the devices and CPUs that you want to support. This is appropriate for SYCL because it is targeting a very wide range of vendors, CPUs, operating systems and devices. However, we do need to add in a naming system to allow lambda functions compiled with one compiler to be linked with CPU code compiled with another compiler.

The reason that the template argument is required for kernel functions defined by a lambda expression is that the C++ specification does not define a naming convention for lambda expressions, meaning that otherwise a SYCL runtime implementation wouldn’t be able to make assumptions about the name of the kernel function.

When you have a kernel function that is dependant on template types, i.e. defined within a template function or class, each instantiation of the corresponding context generates a separate kernel function definition. Therefore, every SYCL kernel function must have a uniquely identifiable name. A SYCL runtime implementation can use that name, to differentiate the kernel functions. Each of those instantiated kernel functions must have a unique name. As a result a template kernel function must contain those same template arguments in the kernel function name, i.e. either in the lambda expression name or in the functor type name in order to avoid ambiguities between different instantiations of template kernel functions.

This does create a little extra effort for the user, but allows far greater flexibility for supporting different CPUs, operating systems and devices. So we think that the effort is worth it. We are considering ways of making the naming of lambda functions easier for the user in future versions of the specification.

I hope this answers your questions.

The specification says that kernel function can call non-kernel functions. If they do, then of course the LLVM IR code for those functions need to be included in the SPIR binary as well. From section 5.2:

SYCL device compiler must compile only kernels for the device, as well as any functions that the kernels call.

Which means you absolutely do have issues with things like forward declarations. For this example:

//a.cpp
void f();

class MyKernel {

void operator() (cl::sycl::id<2> myId) {
f();
}

}

//b.cpp
void f() {
}

How does the device compiler know how to generate SPIR code for f when it’s compiling the file b.cpp? And for that matter, even if it did, how would the linker know to link in the SPIR code for f into the SPIR code containing MyKernel.

And should these functions include code that generates illegal SPIR instructions (e.g. exceptions, rtti, virtual function calls, etc) how will the compiler know to generate an error? Is it suppose to traverse the entire call graph of kernels to figure out which functions they call, then check each one?

There’s a reason other frameworks (c++amp and cuda) require you to explicitly specify that such functions are going to be called on the device: implicitly determining this introduces a lot of complexities. Not saying there’s no solution (c++amp’s requirements on restrict(auto) might point to one), but I certainly don’t see any of the potential problems addressed in the specification or even acknowledged.

The updated specs for SYCL and OpenCL 2.0 have been Neil Trevett, “president of the Khronos Group, chair of the OpenCL working group and vice president of mobile ecosystem at NVIDIA”. May I inquire if SYCL will be portable if it builds atop SPIR? NVIDIA is known for evading all questions concerning OpenCL, and their OpenCL implementation has been pretty much abandoned. Both C++AMP on Linux (named Clamp and done by Multicoreware) and SYCL suffer greatly from this fact.

Could we get an official statement from NVIDIA what their plans are for the future in terms of OpenCL?

It would be great to see it written in plain black & white that they WILL or WILL NOT support OpenCL in the near future?? This could help all current SYCL/C++AMP projects decide upon tools to use, so everyone could decide on whether they should start taking on the aggravation of porting their app to CUDA if they want to remain portable.