Nature of private memory missing from OpenCL spec

As far as I can tell, it is not possible to determine whether private memory is mapped to global or is true local memory, just not shared across the workgroup. I know it would be strange to have private memory mapped to global when local memory exists, but I see nothing that prevents this. Conversely, it is easy to imagine that private memory could exist where local doesn’t, but again, I don’t see how to determine this. Even the local/global specification is inadequate without providing relative access times. In many matrix algorithms, fast local memory makes it worthwhile to first copy a submatrix to shared local memory, then have the threads access that instead of global memory. But unless the local memory is faster by some appropriate ratio, this is a waste. I don’t see how to query for this information.

The specification wouldn’t be the right place to give you this information.
Where private memory is mapped to is dependent on the device and the compiler. On some devices (e.g. CPUs) there simply isn’t any local memory that is software managed. The memory access times are obviously also hardware dependent.
So the only sensible thing would be to be able to query this kind of information. I don’t think this is possible though.

Guess I wasn’t clear. I was suggesting that the spec should include a means to query for this information.

The speed is so highly dependent on how a memory is used that providing a number would be about as useful as the specs for the GPU in the first place. (You never reach the maximum theoretical numbers they quote.) The only way to rationally decide whether to use a certain feature is to actually test the performance you’ll get. It’s easy enough to write a kernel to test local memory performance, so if you’re trying to decide if it is worth using, you should just run that kernel and decide on a per-GPU basis.

Will there ever be an easy way to query this info?

There is a device property CL_DEVICE_LOCAL_MEM_TYPE which tells you:

Type of local memory supported. This can be set to CL_LOCAL implying dedicated local memory storage such as SRAM, or CL_GLOBAL.

I believe this is what you are looking for, although it is for __local, not __private. Private memory is intended to be as ‘close’ to the processor as possible, so it can be presumed to be as fast as possible.