Using structs to pass optional features into kernels

I’m curious how much the spec says (I can’t find much) about using structs as kernel arguments to turn kernel features on and off. For example, I’ve found this tidbit in the ATI release notes:

AMD’s OpenCL runtime requires that structures are packed to native alignment, as specified by the OpenCL 1.0 specification. In particular, struct packing (and alignment) on the Windows hosts, when compiling with MSVC is the user’s responsibility.
For example, the following struct does not respect the required alignment for float4:

typedef struct _MonteCarloAttrib
{
cl_int noOfSum;
cl_float4 strikePrice;
cl_float4 c1;
cl_float4 c2;
cl_float4 c3;
cl_float4 initPrice;
cl_float4 sigma;
cl_float4 timeStep;
}MonteCarloAttrib;

The users must add explicit padding between noOfSum and strikePrice:

typedef struct _MonteCarloAttrib
{
cl_int noOfSum;
cl_int pad[3];
cl_float4 strikePrice;
cl_float4 c1;
cl_float4 c2;
cl_float4 c3;
cl_float4 initPrice;
cl_float4 sigma;
cl_float4 timeStep;
}MonteCarloAttrib;

To allow code reuse in my OpenCL code I would like to implement a struct of options like the following:


typedef struct _Options
{
cl_bool   feature1Enabled; 
cl_float4 feature1Args;
__constant cl_float *feature1Data;  

cl_bool   feature2Enabled; 
__global cl_float *feature2Data;

cl_bool   feature3Enabled; 
__read_only image3d_t feature3Image;
} Options; 

How much of a world of hurt am I asking for? For example, I already noticed I shouldn’t use cl_bool:

Just trying to figure out what code abstractions I can use to make maintainable OpenCL host/device code.

I’ve done a bit of research into this issue. Turns out it’s so complex it has turned me into a blogger: http://iheartcode.blogspot.com/2010/04/ … n-gpu.html

I’m curious what other people think about the problem. Am I just being a programming ween? Or are other professional software developers worried about the maintainability of OpenCL?

Cheers,
Brian

Have you a realer example of what you want to do?
A first sight, I think I would have used the macro style to choose different paths, because the kernel will be lighter and this will facilitate the compiler’s work.

Unfortunately, if I get anymore “real” I would be giving away IP. :frowning:

What I can say is that I already have 3 global pointers, 1 local pointer, 1 image3d_t, and 5 scalars for the first component/feature. For the 2nd component/feature I’m looking at an additional 7 image3d_t arguments, and 2-3 constant pointers. And we know we will eventually need even more components/features. So any abstraction I can introduce at this time will be helpful. I can start by collecting scalars into structs, but the real abstraction comes with the complex memory objects since they require data transfers between host and device.

I’m actually leaning towards using the macros to turn off sections of code that won’t be used and just enumerating all the required memory objects as kernel arguments. Turning off those sections of code should be good enough to let the compiler do its thing. I just can’t get over how ugly the macros look when trying to figure out whether a comma is needed to separate the arguments. Especially because debugging this when I don’t have a standalone preprocessor I can run through will be a big pain.

I know I can pass NULL to clSetKernelArg for global and constant memory pointers. I also know I can’t pass NULL for image objects. But my guess is there shouldn’t be too much overhead to declare an image3d_t argument then set a cl_mem object that’s only been initialized with clCreateImage3D (but not actually transferred to the device).

Two possibly useful comments:

  1. I don’t think you can pass global pointers in in a struct from the host since the address of the global pointer can only be set by clSetKernelArgs.
  2. You can always use sprintf to build the kernel you want at runtime and then compile it. As long as you can accept the overhead of compiling it at that point you can generate exactly what you want. (I’ve used this method a lot when doing tests.)

Yup, that’s the feature I would like added. :slight_smile:

Interestingly enough someone pointed out on the blog that clEnqueueNativeKernel has the ability to do what I want. Just the equivalent is not available for NDRange kernels.

I’ve written a simple string replacement system in C++ that uses the python string replacement format. The format “%(MYVARIABLE)s” is used in the source code and then a std::map of key values is passed to specify what the values should be. It’s so simple I can paste it here. :slight_smile:


bool StringLexReplace(string &dst, const string &src, const map<string, string> &lex)
{
  dst = src;

  size_t bgnpos = 0;

  while ((bgnpos = dst.find("%(", bgnpos)) != string::npos)
  {
    size_t endpos = dst.find(")s", bgnpos);
    string key = dst.substr(bgnpos + 2, endpos - (bgnpos + 2));
    
    map<string, string>::const_iterator iter = lex.find(key);
    if (iter == lex.end())
      return false; 

    dst.replace(bgnpos, (endpos-bgnpos)+2, iter->second);
  }

  return true;
}


This saves the annoyance with positional arguments like sprintf when you slightly reorganize kernel code having to go reorder the arguments to sprintf.

Though alas, this is not an option for me since I work at a company that greatly values their source code. In fact, the port to the GPU I’m working on was already attempted by a graduate student (but he didn’t get anywhere near the speed up we’re seeing). So we need to be able to compile to binary before shipping the application to hide the source code. (OpenCL source code is incredibly easy to crack, just break on clCreateProgramFromSource in a debugger.) Unfortunately, all the implementations at this time have left this as a feature to complete later (it’s planned for CUDA 3.2).

Thanks for the comments none-the-less.

compile to binary before shipping

How do you know that binaries you compile on your development platform will be compatible with your client’s hardware?

For the foreseeable future we’re going to have to do individual builds for each vendor’s hardware platform. We already do it for x86 vs x64 vs PPC vs sparc. Ideally, we can make “universal” binaries that support the major implementations: NVidia and ATI.

What do the Apple’s and Adobe’s of the world do to hide their OpenCL source code? I imagine game companies run into the same problem with shader code: you don’t want your competitor to get it.

With IBM OpenCL implementation for CELL machine (and Playstation3 as well) you can compile and save the kernel binary.
More, all the examples in their dev kit compile and save the kernels, so you can after load the binary with the option --binary. That because their runtime compiler is terribly SLOW.

By the way, you can “hide” the source code, using a “code obfuscator”. You can write a simple one, doing a “search and raplace” of variable and function names, replacing them with “weird” names.

P.S: I remember with Doom3 and Quake 4, the shader source core were avaiable, you found the files in the game directory. Most algorithms of Computer Graphics are work of researcers, so they are public. Often game programmers public their algorithms, and discuss them at Game Developer Converence, for example.

Yes, I too find the restrictions on struct members to be a bit bothersome. Though, the problem with pointers as members seems to be endemic to the design of the host-device interface. (That is,
because the host-device API apparently binds a memory address to a particular cl_mem handle in a way that makes it impossible to distinguish a pattern of bits in a struct as indicating a pointer in device
memory, “float * foo” as a struct member isn’t going to work. The best you can do is pass in “int index_into_foo” as a member of the struct. Welcome to the reprise of FORTRAN. )

And compiling the “options” into the kernel isn’t an option for a number of market areas. My clients have real time performance targets, and launching a compilation op in the middle of a run isn’t going to pass muster. Nevermind the issues with kernel source distribution.

Good luck! This is really hard to get right with the current API. E.g., on a Mac, if you compile on an SSE4.2 machine your saved kernel binary will crash on an SSE4.1 or SSSE3 machine with invalid instructions. This is unfortunately not a very well thought-out part of the design.