OpenCL Newb Here - OpenCL Scoping

I was just designing my first ‘major’ OpenCL program and was wondering about scoping.

Now the spec says that the CL code is Ansi C, which is fine, but seeing as the file is just cracked open and the source fed into an OpenCL library routine, how exactly does the scoping work?

I mean if I create a program on the CPU per-se, which reads in the kernel code, the kernel code wouldn’t have access to a struct defined at the top of the standard C/C++ code, would it? I mean could you even include a common header because of the memory differences? Or do you just have to pass the kernel routines primitive types?

The scoping isn’t really intuitive due to how the code is read in and I can’t find a lot of resources that explain it clearly.

Thanks!

The scoping refers to the kernel code itself. One kernel function calling another kernel function. As you observed correctly, the kernel code does not have any relation to the host code. But, as you said, it is possible to define common headers and include these in host and kernel code. If there are any issues with structure padding, you can always use your compiler’s packing attributes such as gcc’s attribute ((packed)).

So any common structs that the kernel wants to access can be defined in a header file common to both?

Hopefully I won’t need to worry about structure padding :(. Never dealt with it before, I’d have a miserable time dealing with it if it popped up.

Yep; I just include the same header file for OpenCL that I do in my C code that sets up the structure.

The only thing is, instead of
<<#include “myfile.h”>>
you have to say:
<<#include “the/full/path/to/myfile.h”>>

and it works just fine!..

Okay, thx, maybe I can get this working now!

If you’re only accessing certain fields of long arrays of structures (e.g. each work-item is processing one object), you might want to read up on ‘structures of arrays vs arrays of structures’. Objects/structures can lead to sub-optimal memory access patterns on any cpu, but for a parallel one the cost can be quite high.

The ideal memory access pattern is in consecutive addresses per work-item (i.e. use get_global_id() as the index), and usually for a GPU an item size of 16 bytes (e.g. float4), although smaller sizes aren’t too bad either.

Yeah, I’ve read about the get_global_id() method of indexing before, that’s the way the tutorial showed it too.

Should be fine with the 16 byte deals, the code I’m parallelizing uses doubles, so they should be 16 bytes / 64 bits on most machines.

The problem is (this is a ray tracer) that there’s many structs (ie. pixels, rays, etc.) which typically have like RGB values or XYZ/DXDYDZ so like 3 16 byte floats or 6 16 byte floats typically.

I’m not sure what the best setup for that is, but since I’m porting code to use OpenCL, and the structs are already not of arrays, it seems like arrays of structs are pretty much all I have to work with. Which shouldn’t be terrible, I don’t think, but probably not as optimal as structs of arrays, am I right?

This probably won’t be optimal, I am an OpenCL newb and this project kind of has to get done rather than “done quickly” - you know what I mean? As long as it’s ray tracing real time at a decent resolution I’ll probably leave it alone.

[i]I’m not trying to change the topic, sorry to jump off topic, but you seem to know a bit about OpenCL, so I figure I’ll just ask:

  • Can you perform recursion in the OpenCL code? I know you can’t in earlier versions of Cuda for some reason. I can convert the process to iterations if I need to, but I’m pretty sure it’s setup in recursion right now, as that’s typically what happens w/ ray tracing.[/i]

Yeah, I’ve read about the get_global_id() method of indexing before, that’s the way the tutorial showed it too.

Should be fine with the 16 byte deals, the code I’m parallelizing uses doubles, so they should be 16 bytes / 64 bits on most machines.

The problem is (this is a ray tracer) that there’s many structs (ie. pixels, rays, etc.) which typically have like RGB values or XYZ/DXDYDZ so like 3 16 byte floats or 6 16 byte floats typically.
[/quote]

At the moment you can’t pass pointers around, so if you have any of those you will have to flatten the data to another structure anyway.

And once you’re doing that, you may as well put it in a format that suits the device that will be using it.

I’m not sure what the best setup for that is, but since I’m porting code to use OpenCL, and the structs are already not of arrays, it seems like arrays of structs are pretty much all I have to work with. Which shouldn’t be terrible, I don’t think, but probably not as optimal as structs of arrays, am I right?

This probably won’t be optimal, I am an OpenCL newb and this project kind of has to get done rather than “done quickly” - you know what I mean? As long as it’s ray tracing real time at a decent resolution I’ll probably leave it alone.

Hah, you don’t ask much :wink:

The problem with GPU code is that poor code can easily be 10-100x slower than good code (whereas on a cpu it might only be 2-5x difference).

But if you can get adequate performance for your problem, then yes, don’t get too involved with it - but just keep it in mind if you are seeing particularly poor results.

[i]I’m not trying to change the topic, sorry to jump off topic, but you seem to know a bit about OpenCL, so I figure I’ll just ask:

  • Can you perform recursion in the OpenCL code? I know you can’t in earlier versions of Cuda for some reason. I can convert the process to iterations if I need to, but I’m pretty sure it’s setup in recursion right now, as that’s typically what happens w/ ray tracing.[/i]

Originally gpu’s had no return stack and all code was basically unrolled into a linear stream of operations (with at best, loops/branches only). Basically the hardware didn’t support recursion at all, and it is never necessary to implement any algorithm (i.e. you can code it with only loops).

I haven’t had the need for recursion, so I don’t know - just google it. From what i can tell from the first page of google results, it isn’t available.

Yeah, I’ve read about the get_global_id() method of indexing before, that’s the way the tutorial showed it too.

Should be fine with the 16 byte deals, the code I’m parallelizing uses doubles, so they should be 16 bytes / 64 bits on most machines.

The problem is (this is a ray tracer) that there’s many structs (ie. pixels, rays, etc.) which typically have like RGB values or XYZ/DXDYDZ so like 3 16 byte floats or 6 16 byte floats typically.
[/quote]

At the moment you can’t pass pointers around, so if you have any of those you will have to flatten the data to another structure anyway.

And once you’re doing that, you may as well put it in a format that suits the device that will be using it.

I’m not sure what the best setup for that is, but since I’m porting code to use OpenCL, and the structs are already not of arrays, it seems like arrays of structs are pretty much all I have to work with. Which shouldn’t be terrible, I don’t think, but probably not as optimal as structs of arrays, am I right?

This probably won’t be optimal, I am an OpenCL newb and this project kind of has to get done rather than “done quickly” - you know what I mean? As long as it’s ray tracing real time at a decent resolution I’ll probably leave it alone.

Hah, you don’t ask much :wink:

The problem with GPU code is that poor code can easily be 10-100x slower than good code (whereas on a cpu it might only be 2-5x difference).

But if you can get adequate performance for your problem, then yes, don’t get too involved with it - but just keep it in mind if you are seeing particularly poor results.

[quote:2sbduohl]
[i]I’m not trying to change the topic, sorry to jump off topic, but you seem to know a bit about OpenCL, so I figure I’ll just ask:

  • Can you perform recursion in the OpenCL code? I know you can’t in earlier versions of Cuda for some reason. I can convert the process to iterations if I need to, but I’m pretty sure it’s setup in recursion right now, as that’s typically what happens w/ ray tracing.[/i][/quote]

Originally gpu’s had no return stack and all code was basically unrolled into a linear stream of operations (with at best, loops/branches only). Basically the hardware didn’t support recursion at all, and it is never necessary to implement any algorithm (i.e. you can code it with only loops).

I haven’t had the need for recursion, so I don’t know - just google it. From what i can tell from the first page of google results, it isn’t available.[/quote:2sbduohl]

Yeah, flattening the data and copying it into the GPU buffers shouldn’t be a problem. I thought that ptrs wouldn’t work because you’re dealing with different memory pools.

It’s just difficult to tell “how the data suits the device” -lol. Easy to say, though :-P.

And yeah, I ask a lot, haha.

I didn’t know that about GPU code being slower if written poorly but I guess I should have figured it would be. I’ll keep that in mind.

If the code does contain recursion (which it most likely does) that shouldn’t be an issue to unroll.

Erm, it could be the same line if you pass the path via the “-I” flag to the OpenCL compiler just as you would with your C compiler.