Whises for OpenCL 1.1 and more!

Viewing info from GTC seems that Khronos is searching for feedback for OpenCL 1.1 and more.

Here all my dump :wink: of crazy ideas.
All are for GPU devices:

Make core DirectCompute 5.0 hardware features:

*Atomics to global and local mem. (int32 base and extended extensions)
-> now that is supported would be good to add also next:
*Append/consume buffers (see AMD stuff), a global queue/stack accesable with no hazards…
*Byte addressable support.
*Half support (cl_khr_fp16)
*Require that local mem is not of type global (as in 4xxx cards due to write to LDS restrictions…)
*Expanded DirectCompute 5.0 integer support (bit count,bit reverse,etc…)

As doubles (cl_khr_fp64) is an optional feature of compute shaders as 57xx proof that no luck…

Also if it’s not currently required require:

*Image support for FULL profile.
*OpenGL interop for GPU devices.

Add extensions or promote to core depending if is AMD/Nvidia specific support or multivendor:

  • Multivendor:
    *Add support for accessing system host mem from GPU kernels:
    thats currently supported in both Nvidia and AMD devices an exposed in CUDA 2.2 and up and CAL.
    so called pinned system mem (in CUDA 2.2 for GT 200 devices), host mem export (AMD CAL)
    *Implement DirectX interop (AMD ships header)
    *Getting info of integer support… if there are native 24 int muls (CUDA devices before Fermi and AMD 5xxx (every ALU)) or int32 muls (Fermi, AMD 4xxx and 5xxx(only 5th ALU))…

AMD proposed ones (some are said hardware features 5xxx press kit some 4xxx hardware support):

*Global Data Share and Wave sync support (GDS,etc…)
*Native SAD hardware support (expose SAD instruction and abiltiy to query if it’s supported by hardware).
*Expose registers shared per SIMD… (shared registers avaiable in compute shader in CAL which allow doing reductions in fixed number of steps say 2 or 3 vs. logN)

Nvidia ones:

*Improve memory API for supporting CUDA 2.2 mem impovements: Expand support for creating “shared pinned buffers” (in cuda parlance) (buffers of host mem that are pinned and usable from multiple GPUs as pinned mem (using DMA)
and also shared pinned system mem.

*Expose partial support for mem image objects with simultaneous read/write support with strict limitations: exposing current RWTexture Direct3D 11 abilities (as seen in Direct3D SDK August 2009 OIT demo) and also of NV_texture_barrier OpenGL extension of reading to an already bound FBO texture:
basically reading the same texel before writing to it… possibly also some instruction in host or kernel code allowing flush texture cache data…

*Expose interop with CUDA:
Code interop: support for interchanging PTX kernel code from CUDA functions or OpenCL functions with identical name and arguments (signature) and using at clBuildfromBinaries…
Mem interop: Ability to use mem buffers allocated from CUDA in OpenCL or viceversa…
This should allow directly suportig proposed “shared pinned buffers”

*Fermi support. Provide new extensions supporting this features:

*Expose function pointer and stack support which provides true function calls and recursivity…
*Expose Fermi support for executing host code inside kernels
*Expose Fermi support for allocating mem in kernels (malloc/free functions)
*Expose C++ language in Kernels (?)
*Expose expanded information of ECC support: say ECC protected registers, and mem(local/global), ECC protected path from mem GPU <-> GDDR chips… also if possible ECC codes info: error detection capability (Fermi can detect 3 bits in and 1 bit recovery support for every xx bits…)
*Add perhaps some exception support (assuming not full C++ support as CUDA 3.0) for managing/getting acknowledged of irrecoverable errors (where (in mem chips or registers) in kernel code… If not possible in kernel code at least finish kernel and return via some mechanism to the host this info…

*Add perhaps some info of where atomics are implemented for knowing if we can expect high performance or not (say if they are handled in L2/L3 caches (Fermi) or in memory controllers or compute units (ALUs) (preFermi))

Also NVIDIA implement some features that require no extension to OpenCL API as API model allow that… and allow getting device info querying information of if it’s avaiable:

For example using multiple command_queues and events support for hardware that supports it:
*Concurrent mem/kernel exec… CUDA 1.1 devices (G9x,GT200,Fermi) and AMD(?)
*Concurrent kernel execution… Fermi (also AMD on 5xxx)
*Concurrent H2D and D2H… using Fermi twin DMA engines.

*Predication support (I have doubts?) Equivalent to CMOV avoiding using branching hardware. Basically avoiding that conditional code gets executed executing both paths.(?)

I strongly agree to the C++ language support (at least the extensions already available in CUDA), especially function templates. It really bugs me to write a library of functions that should work on any floating point type and having to copy-paste it as many times as there are types.

Exceptions (or some form of non-hanging errors) would also be nice, especially if a kernel is writing outside buffer boundaries.