3D Image or 2D Image Array?

I’m new to OpenCL (and GPGPU in general) and I was hoping to get some advice on my problem. I need to process an array of 2D images (each one exactly the same size) and I’m unclear if the 2D image array introduced in OpenCL 1.2 is the way to go or if I should use a 3D image. Are there any guidelines as to when to use each of these types? Here’s a quick breakdown of my problem:
[ul]
[li]My target device is an intel HD4400 with 20 CUs and shared memory with the host.
[/li][li]Each image is the exact same size (~5MP) and there is a fixed number of images (~20)
[/li][li]My kernel can execute independently for each 2D pixel row/col but depends on all of the copies of that pixel on each slice.
[/li][/ul]
Here’s a quick sketch of a non-parallel implementation if that helps:


for(int row = 0; row < maxRows; row++)
{
    for(int col = 0; col < maxCols; col++)
    {
        //The body of this loop is the kernel implementation, no further parallelism is allowed
        output[row, col] = ChewData(images[0][row,col], images[1][row,col], ..... images[19][row,col]);
    }
}

Obviously, my code will be a bit more complex than that but this should give the general idea without any unneeded detail.
Here are my specific questions:
[ul]
[li]Is there any particular reason to favor a single 3D image over an array of 2D images?
[/li][li]With only 20 compute units and ~5 million work items (one per pixel) clearly each CU will be handling many pixels… Do I need to do anything to partition the tasks or will OpenCL handle that for me? I know I will need to specify the NDRange according to my problem set but I’m not sure what else I need to do. Perhaps there is a better way to partition the large input data set in memory that will improve performance…(?)
[/li][li]Catch all: Am I missing something key here? Is there a better way to solve this problem (based on the admittedly little information I’ve provided…)
[/li][/ul]
Thanks in advance! (and sorry for the clearly newby level of questions :slight_smile: )
-Matt

Another question I should probably have asked… Should I even bother with images? Further reading seems to indicate the image objects are designed to handle caching neighboring pixels so that they are readily available in parallel operations. However, my calculation does not make use of neighboring pixels in 2D, only the slices in 3D… Can I configure the image object cache to behave in a useful way? If I were to change to a simple buffer of 16 bit integers could I pass in an array of these buffers to my kernel? (one per image)
-Matt

Matt,

Lots of good questions here.

  • On the hardware I have, 2D image arrays performs equal to or better than 3D images. If you needed linear interpolation or clamping, then you would be forced to use 3D images. But looking at your code, it appears you are not in that scenario. So I would suggest trying 2D image arrays.
  • You shouldn’t need to do anything special to distribute your work between the execution units. The runtime and hardware will do that for you.
  • The access pattern you show in your pseudo-code should operate well in buffers or images. The advantage to using buffers on the integrated GPU you are using is that you can move between CPU and GPU without copying the data. But then you lose the ability to pass arrays of buffers like you can with images. You will either have to pass in one big buffer and compute offsets into that buffer (best choice if it works for your algorithm), or pass in 20 buffers (might still work).

Hope this helps!

Aaron

My .02, from a graphics background.

[ul]
[li]Is there any particular reason to favor a single 3D image over an array of 2D images?[/li]In practice no, there isn’t because the two are not freely interchangeable. That is, the slice selection of a 3D volume is not the same thing as the resource selection of an array. It isn’t the same thing at HW level (for some hardware) and does not provide the same functionality either so it isn’t the same at high-level either. You won’t likely be given a choice. Use 3D to have accelerated filtering.
[li]With only 20 compute units and ~5 million work items (one per pixel) clearly each CU will be handling many pixels… Do I need to do anything to partition the tasks or will OpenCL handle that for me?[/li]In theory you should not have any issue. In practice I know some drivers have watchdogs and kill kernel execution if taking too long. I know for sure an Intel driver does that on some Apple OS. In practice the number of WIs is hardly relevant, you should be concerned about how frequently they are dispatched to hardware level, as long as each kernel is below a certain threshold you should have no issues.
[li] I know I will need to specify the NDRange according to my problem set but I’m not sure what else I need to do. Perhaps there is a better way to partition the large input data set in memory that will improve performance…(?)[/li]Not by just changing the NDRange parameters. If you think 1 WI = 1 thread you’ll most likely consume way more memory bandwidth than needed… but considering that’s an Intel iGPU, it might as well be just the opposite as they basically have their own architecture.
[li]Should I even bother with images?[/li]Could be debated as images have some advantages. Main advantage is you get free filtering and wrapping addressing through the tex units. As they can be packed in arrays, they are slightly more flexible than buffers. In practice I haven’t found them much valuable features so far; your mileage will vary.
[li] Further reading seems to indicate the image objects are designed to handle caching neighboring pixels so that they are readily available in parallel operations.[/li]This is truly implementation dependent. Images have indeed hardware constructs to avoid bank/channel conflicts (on some hardware). Some hardware caches everything no matter what. Instructions such as “gather4” might be interesting to you but I don’t recall CL1.2 having it.
[li]Can I configure the image object cache to behave in a useful way?[/li]Nope; maybe through an extension but I’m not aware of such.
[li]If I were to change to a simple buffer of 16 bit integers could I pass in an array of these buffers to my kernel? (one per image)[/li]Nope; as kunze notes, you cannot array buffer pointers. At hardware level buffer pointers must often be statically determined. If you can fit all the data on a buffer, you can be fairly sure it’ll work best.
[/ul]