RGB images

To create an image object the format must specify the order and data type of the channels. Table 5.4 in the spec lists supported image channel order values, and says among other things:

CL_RGB. This format can only be used if channel data type = CL_UNORM_SHORT_565, CL_UNORM_SHORT_555 or CL_UNORM_INT_101010.
This leaves out the most common image format of all – good old 8-bit RGB. On the other hand Nvidia’s implementation succeeds in creating an image object with CL_RGB and CL_UNORM_INT8 specifying the format. Is that a bug, or is the spec supposed to say “must be used” instead of “can only be used”? I was hoping to create RGB8 and RGB32F image objects.

You can query the device to see what formats it supports using clGetSupportedImageFormats(). I would suggest you do that and only use those formats as I’ve seen other people creating images with unsupported formats without error. This appears to be a bug in the Nvidia driver. If you want 8-bit images you will have to use CL_RGBA.

Bug indeed, the driver reports that it doesn’t support CL_RGB at all, and while it succeeds in creating the image object, it crashes if I attempt to map it.

What I want is 32-bit floating point RGB, which I realise I can’t get from an image object. Should I waste memory on an unused alpha channel, or is it better to use a float buffer? I know textures have caching, is that faster than using local memory? Are there any other advantages or disadvantages to using buffers instead of textures?

Today texture caching is a major performance win if you have locality of access to your image. Local memories will be faster if used correctly, but the overhead of loading/accessing them is high. I would advise you to start with an RGBA image and if that doesn’t give good enough performance/uses too much memory then investigate using a float buffer and local memory. It’s really a pain to manually tile image data into local memory if you need to have overlapping regions between workgroups, though.

And that’s what I’d need, since I’m resampling the image with a 4x4 filter. Another problem is that the image is smaller than the NDRange (half the size). Sounds like texture caching is the easiest (and quite fast) way to go, but since I’d like to learn, what’s the right way to use local memory while avoiding uncoalesced memory access? I suppose I have to make sure only some threads do the reading while others idle, since there are more threads than pixels.

Memory accesses to the local memory are a bit different than global, and depend a lot on the hardware. On current-generation Nvidia machines I believe they are generally 8-banks, which means access to 8 consecutive addresses at a time by 8 different work-items should be allowed. The optimizations for local memory access consist of making sure that you avoid as many bank-conflicts across your threads as possible so they can get the full bandwidth. There are a bunch of papers that talk about how to do this, but basically it means figuring out what your data access patterns really look like, and making sure they map well to the way the banks are split up. (Again, this is highly hardware-dependent, so it will vary from architecture to architecture.)

With regards to coalesced memory accesses, this is only relevant in that sense to the global memory. There you just want to make sure each work-item in the work-group accesses a sequential memory location when you read data into the local memory or write it back out to global.