This is a continuation of the discussion that evolved here: Relaxing queue family index requirements on buffer/image barriers. · Issue #650 · KhronosGroup/Vulkan-Docs · GitHub
My apologies for the extended discussion on Github. I did not have a forum account, as I’m normally non-vocal about issues like these.
There is, and has been, confusion about what I’m trying to accomplish and why.
Basic overview of requirements:
- The rendering engine must be able to recover from a “soft” device loss (e.g. resolution change) with the cooperation of the client application. Meaning: All objects are re-constructed, but the application must repopulate resources with data.
- Plug-ins must be able to interface with the rendering engine at the finest possible granularity (e.g. no one-size-fits-all “mesh” class).
- The rendering engine must be usable in two modes of operation: Free-threaded and staged.
- In free-threaded mode, entry points must not block client threads for the duration of a transaction being completed on another thread, unless that operation is a Join. The only blocking action permitted is the acquisition of any resource locks required to assemble the transaction for submission.
- In staged mode, all resources are under the exclusive control of the rendering engine. No transactions may take place concurrently while a staged mode execution is completing. All entry points are allowed to block client threads until a staged mode execution completes.
- All image resources, with the exception of swapchain images, must be usable for any purpose, or it should be possible to configure the image up front at creation time.
- All buffer resources must be usable for any purpose (buffers are sub-allocated from arenas, so we don’t have a choice).
- The rendering engine may take advantage of an application-controlled thread pool for any long-running work. This includes command buffer submission. If provided a thread pool, the rendering engine must distribute as much work across as many threads as it has been configured to (thread pool bindings come with task limits, etc…).
Some implementation details:
• A task is one of two things: A transaction between the host and device involving a resource, or a compiled display list.
• A display list is a sequence of scoped operations which include renderpass invocations, a series of bare compute program invocations, and device-local resource transfers. When compiled, a display list will automatically detect dependencies between its constituent scopes, and may re-order them to minimize the number of queue submissions. This includes determining where to insert resource and memory barriers. All internal command sequences and scope barriers are written into secondary command buffers, where only those affected by an external change of any resource are re-compiled prior to submission.
• A resource may only be accessed by the host if it is mapped. A resource must be “Fetched” prior to reading any of its contents on the host, and “Stored” if any changes made by the host need to be reflected on the device. This may involve only flushing part of a memory mapping, or a host<->device copy, which requires a set of queues.
• A “Fetch” or a “Store” is a transaction task (the API exposes two functions Fetch and Store, which pull from various pools to assemble a transaction). The internal transaction object consists of a workspace containing multiple buffers of frequently needed API structures, and upon submission, a set of queue controls optionally endowed with VkCommandBuffers and VkCommandPools allocated for each respective family, but if requested by the task - display lists have their own command buffers, but may request additional buffers from the task engine if resources need to undergo ownership/layout transitions prior to invocation. A task is given exactly one queue from every family it has requested.
• All tasks are guaranteed exclusive access to all VkCommandBuffers/Pools/Queues they have requested for the duration of their onPush calls. Again, tasks are guaranteed exclusive access to all related transients.
• The precise manner in which a task uses the queues it has requested is entirely task-specific. For example, a display list will simply iterate over all command buffers destined for each queue in an optimal order respecting dependencies between scopes.
Problems:
• VkXXX structures are “rigid” in the sense that we can’t get away with just putting them inside another structure and having an array of those instead. For example, with VkXBarrier, the display list scheduler needs to keep track of pipeline stage masks, object references, and various other pieces of dependency information in addition to all of the barrier data, and having that in a parallel array is causing a noticeable problem with cache pollution.
• Pursuant to requirement #1, we need to keep a considerable number of VkXXX initialization structures around, which present similar difficulties to barrier structures when, for example, performing a validation pass on a display list.
• This goes all the way back to the first appearance of indexed drawing. When dealing with dynamic geometry (e.g. an adaptive mesh), we need to have more information in an element than just indices, and parallel arrays are a source of heap fragmentation and cache-related performance issues. Consequently, a very large amount of index data needs to be duplicated in a very cache-unfriendly way (3 shorts or ints do not line up nicely with cachelines). Having a stride on an indexed draw operation would have solved this problem.
• VkFences are halfway useful. We need a way to signal a fence from the host, or we need to use some kind of completion port or epoll handle to manage concurrent task completion. The use of the fence-then-select paradigm currently leads to a problem where a task pop thread will end up waiting on a batch of fences obtained from long-running tasks, but fail to respond to the completion of a short-running task. This is currently handled by having one pop thread per priority level (there are currently 8). The amount of work involved in a task pop is miniscule, and having multiple threads for this purpose is wasteful.
Concerns:
• I feel there is too much focus on interactive media. Normally, I’m not vocal about the concerns of other industries, but we’re looking at a future where the hardware we have to work with does not meet our requirements, and does so in such a way that we need to approach those inconsistencies with grossly inefficient patterns or to use unpredictable libraries. I would like to have seen “graphics on compute” instead of “compute shoehorned into graphics”. There is an enormous application domain awaiting a tetrahedron voxelizer that can work directly with index quadruplets. Rasterization just doesn’t cut it here.