My suggestions for OpenCL 2.0..

I have published in my blog: http://bit.ly/YKa8sg
Just reposting here:
Hi,
I think it’s time for publishing my OpenCL 2 requests so they maybe get considered for inclusion into it:
I’m not requesting what it hopefully will be in it like C++ extensions etc…
Depending on wheter they plan for support on existing GPUs or not will determine if some of these can be included. Anyway getting a cl_khr or cl_ext extension would be good…
but just before it a good remainder of thing that still are to being implemented…

*starting to see cl_ext_device_fission implemented on GPUs that should be doable on AMD 7xxx GPUs but on NV GK110 still not? even better seems new AMD Sea Islands should support partitioning in up to 8 sub GPUs…
*Implementing new OCL 1.2 extensions like graphic ones MSAA and depth access…

For OpenCL 2 would be good to have:
*Atomic counters (cl_ext_atomic_counters_32) in core… they provide an order of magnitude improvement vs global atomics at least on old D3D11 HW (Fermi, AMD 5xxx series) and are foundation of HW accelerated queues.
*Kernels can send interrupts to CPU and/or initiate host system calls… that seems coming for a while I think even Fermi whitepaper suggested that but still no avaiable… AMD SI support SEND_MSG in ISA as Lottes suggest in his blog so AMD should be able too…
*warp/wavefront vote functions: this functions are in NV HW since GTX 2xx (2008) useful for example in currently most better dynamic mem allocator for GPUs see “Fast Dynamic Memory Allocator for Massively Parallel Architectures” they said:
“The used hardware must provide a voting function for an effi cient implementation” thus seems and OpenCL port will need exposure of that…
*Dynamic parallelism: well that should be expected also now that GK110 is shipping and also seems SI could support some limited form of it as shown in a ADFS session…
*Named barriers: Well this is shipping in CUDA since Fermi days and can be used for warp specialization like in CUDADMA project that can bring better memory bandwith explotation in some apps and also as shown in HPP study can bring support for “true function composability” i.e. GPU functions that use barriers can call other GPU functions that use barriers without breaking expected usage see HPP paper by Gaster et al.
*Crossvendor MultiGPU like CUDA P2P functionality: i.e. memory from one GPU addressable by other GPU directly from kernel without previous copy (also present in cl_amd_bus_addressable in AMD OCL)
*Exposing some common intra warp/wavefront ops? (like existing NV shuffle… makes sense more like median, min/max could be beneficial for platforms like Xeon Phi but not on GPUs)
*Expose some cross vendor multimedia extension ISAs? i.e. some common cl_amd_media_ops/cl_amd_media_ops2 and ptx SIMD instructions… this can be good jointly with interop with video encoders and encoders for accelerated video processing and even NV uses in their fast raytracing kernels…
*Finalize to bring parity vs exisiting compute exposure in graphics APIs like OGL 4.3/D3D 11 compute shaders: like said atomic counters where one thing…
->other being new gather4 instuctions…
->DispatchComputeIndirect: i.e. ability to launch kernel with size of workgroup total size being fetched from GPU mem… it’s more efficient for variable work kernels that depend on work generated by a previous kernel… in this case we avoid a CPU trip but note that could be done with new Dynamic Parallelism so perhaps doesn’t need to be exposed…
->Promote into core MSAA and depth extensions
->MipMap support like in CUDA 5
->compressed tex formats support
->a cross vendor extension for bindless support (assuming will get broad support in coming years)
->cross vendor ext for sparse texture/buffer support…

To finalize also exposing advanced control of ld/st operations such as cache modifiers and even using texture path (in GK110)…

Finally seems future GPUs could support unified register/local mem mem so explicit size control for better optimization could be good, also seems local mem could be allocated dynamicaly inside a kernel via extension to barrier function argument for better use of it so an extension to barrier operator could be good and also a scalar processor is present on recent archs so altough could be intended for executing common scalar code in kernel (extracted by a compiler) could also be exposed for direct programmability…

Coming not shortly(?) for me with atomic counter bringing possibly very fast queues and exposing all graphics functionality in kernels in OpenCL like said above primary targets to expose are rasterizer, z-buffer and rop functinality…
*the most interesting for me is exposing Z-buffer… GPUDet papers shows an usage of it…
*exposing rasterizer what could be?: exposing perhaps via a generalized dynamic parallelism a funtion that takes a buffer or “geometry” to rasterize and some kernel that would be called in some specified grid size (tiles 8x8?) via dynamic parallelism… all in all somewhat crazy seems…

More thoughts?