Constant Memory latency

So we know that on GPU (Nvidia specifically) that global memory access is a lot slower than local storage. Does anybody know how the memory spaces, in particular constant memory, compare?

I have some routines that calculate values based on tables as static constants in the source and I’m wondering if I copied these in to local memory whether I might get a speed increase. The OpenCL spec says that constants are allocated in an area of global memory, so should I expect to have to do similar caching techniques as I do with global memory, or do constants get loaded into a faster access store?

I’d suggest looking at the Nvidia programming guides. I don’t remember off-hand where they are stored in hardware, but it isn’t the same physical location as global memory.

My understanding is that to get the absolute maximum performance for MADs you need to have one source come from local memory, one from registers, and one from constant memory, so that would suggest it’s different.

You could always write a simple copy kernel to see what is fastest. :slight_smile:

Paul,

nVidia’s OpenCL best practices guide 3.2.5 Constant Memory is saying:

… The constant memory space is cached. As a result, a read from constant memory costs one memory read from device memory only on a cache miss; otherwise, it just costs one read from the constant cache. For all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address. …

Also I recommend reading Dr.Dobb’s “CUDA, Supercomputing for the Masses” whole article, CUDA, Supercomputing for the Masses: Part 1 | Dr Dobb's.

… For all threads of a half warp, reading from the constant cache is as fast as reading from a register as long as all threads read the same address. …

Thanks for the link, but that sounds like it could be what’s causing my problem. Each work-item is (intentionally) accessing these tables randomly, so I’ll be cache missing. Sounds like it’s worth an experiment moving things to local memory.