Passing variable length structure to kernel

Hi!

I’ve been working with OpenCL for a while now and I think it’s really great. I write c code on a linux machine using an nVidia 8600GT.
I have some questions regarding structures and passing them as kernel arguments though. I wouldn’t call myself an experienced c programmer so forgive me if some of the answers are obvious…

  1. Structure definitions. When you define a structure you specify what variables you want it to contain, all right. But since you cant import files or definitions into your CL code, do you have to have multiple definitions of the same structure (i.e. one in the host code and one in the device code)? You could have a separate structure definition in a completely separate file and just #include it to the c code and concatenate the file to the CL source before building the program, but is there another way…?

  2. Structure usage. Lets say I have a structure


typedef struct components{ 
int *Y1;
int *Y2;
};

and I know the lengths of the arrays at runtime only. How could I pass this structure to a kernel? Currently I have a cl_mem memory buffer which I try to a clEnqueueWriteBuffer on but all I get is segmentation falut.

  1. Also, is it possible to pass a two dim array to a kernel? I haven’t gotten it to work… maybe these problems are related some how (both variables are a pointer to pointers). Like I said, my knowledge of c is a bit limited.

Can anyone please help me?

//Kristoffer

Kristoffer,
You are correct that you need to have the structure defined in two places (host and kernel) to use it across them. Make sure you use the cl_ types for the variable definitions on the host or you may have problems with the host’s native size (e.g., 64b vs. 32b int) being different from what OpenCL uses (which is fixed).

With regards to a structure-of-arrays, you will need to copy the data into your cl_mem object. E.g., something like:
clEnqueueWrite(my_mem, offset=0, length=lengthofY1, components.Y1);
clEnqueueWrite(my_mem, offset=lengthofY1, length=lengthofY2, components.Y2);

In the example you gave, OpenCL is more targeted towards having two cl_mems, one for Y1 and one for Y2.

Passing a 2D array to OpenCL is kind of possible, but OpenCL only deals with pointers. So you can simply copy the data into a cl_mem, but then you will have to do your own index calculations in the kernel, e.g.,:

value = kernel_input_array[array_width*sizeof(array_element)*y+x];

Thank you for the quick reply!

I was afraid of this. The reason I’m asking to begin with is that I’m trying to optimize the performance of my code. I’m rewriting an simple JPEG encoder to work with OpenCL and it’s going quite well. I’m currently sending four arrays of components (Y1, Y2 Cb and Cr) to a kernel (one call for each component) using four different cl_mem objects and I am trying to minimize the memory transfer overhead. So instead of using four clEnqueueWriteBuffers for each component I would like to only use one and place the components, for example in a structure. Of cource I could just copy each component value into a new array and then pass it over, but then I would loose performance on the host. However, according to your comment, this could also be done using a 2D array instead since all components are integers. I will try this out and see what happens.

Just a thought: wouldn’t passing a 2D array be the same as passing a structure? I mean I would have to to one clEnqueueWriteBuffer per first dimension of the array.

I know the whole approach to this problem is wrong, I could do this much easier in other ways, but this is only an experiment to get to know how OpenCL works.

Anyway, thanks for the help!

//Kristoffer

clEnqueuWrite is just going to copy some chunk of data given by the pointer you pass in. If your compiler organizes your struct into such a chunk then it should work fine.

As for doing JPEG encode, you may want to pass in the RGB values and convert them to YCrCb on the GPU while doing the downsampling at the same time. There will be a tradeoff between the transfer overhead of more data and the computational cost of doing the conversion/downsampling.

Either way, you are going to want to move your data into local memory for the DCT phase, and I’ll be very interested to hear how you handle the zero counting and Huffman!

I’m sorry to disappoint you but I don’t think I will be doing any Huffman or zero counting on the GPU. I’m working on my master thesis and my supervisor gave me as an learning exercise to try to rewrite an existing JPEG encoder to utilize the GPU with OpenCL. So as a design strategy I decided to make it as easy as possible for me, therefore my code is structured as follows:
host only:

  • read bmp file and parse RGB array
  • extract component arrays and downsample
    device AND host:
    for each block in each component{
  • adjust
  • DCT
  • Quantization
  • ZigZag
    }
    host only:
    ( - compare component array results)
  • rlc vlc encode
  • padd
  • write to output

I do the for each loop twice, once on the host and once on the device. I then compare execution time and calculated values for that loop. With this setup I get a 30 times speed up with OpenCL (FullHD bmp takes about 60 ms to encode, where ~10*2 is data transfer, ~40 ms kernel time)(but I expect it to be even greater as soon as I get a newer graphics card :slight_smile: ). I have a local work group size of 64, which equals one block. I can then wary the number of work groups to change how many blocks I run in parallel (just to see the difference in execution time). And of course I each block that is processed is first copied into local memory. I found the Nvidia OpenCLVisualProfiler to be extremely helpful, there I can see what all GPU power is used for. If you’re not registered to the Nvidia Developer Program I really recommend it (I think you can only get it from there, and assuming you also use an Nvidia card).

However, my actual thesis will (probably) be to examine (and rewrite?) the H.246/AVC video decoder with the SVC (Scalable Video Codec) extension. I could report back to you on that project, if you’re still interested. (I’ll probably have more questions later on anyway :slight_smile: ).

I’ll mess around a bit more with my existing code, just to see what more I can optimize. Thanks for your help!