Confused about image memory barriers

Fishbiter · March 29, 2016, 8:48am

My experience of Vulkan has been pretty good so far, but I’ve run into an issue I’ve failed to resolve after quite a few hours…

I’m trying to generate mipmaps for a cubemap colour attachment, but some of the faces end up with bad data. Mipmap generation is working in other places (including another cube map target!) so I think my approach is generally valid, but I guess the mipmap generation is happening before the render job is finished.

This seems like a classic place were I need a memory barrier. But I feel like I’ve tried every possible version, so I must be missing something… I’m not posting code here because (a) there’s quite a bit of it and (b) I’ve tried so many variations, it feels like the bug is on this side of the screen I’m looking for a sanity check, basically.

The flow is:

Render to attachment
Change attachment format to TRANSFER_SRC_OPTIMAL using image memory barrier (this should wait for rendering to the attachment to finish?)
Change texture format to TRANSFER_DST_OPTIMAL using image memory barrier (this should wait for rendering using the texture to finish?)
Copy attachment to texture
Change texture format to TRANSFER_GENERAL using image memory barrier (this should wait for copy to finish?)
Blit from level 0 to the other levels
Change texture format to SHADER_READ_ONLY using image memory barrier (this should wait for blits to finish?)

Also… In vkCmdPipelineBarrier there are essentially 3 parameters on the destination and source side: The pipeline command stage, the image formats and the access flags. I’m confused about the distinction between these three inputs. The image format seems to strongly imply what kind of tasks/access could be outstanding on the image (e.g. if it’s in TRANSFER_DST_OPTIMAL, it’s probably being used as the target of a STAGE_TRANSFER command, which needs TRANSFER_WRITE access). My code reflects this assumption (and example code I’ve seen does too) but it’s a redundancy in the API I don’t understand the motivation for, which makes me nervous. (I have tried just setting all the flags to see if it resolves the issue, and it didn’t!)

Or maybe I’ve just missed the point of the image memory barrier and I need some other synchronization primitive?

Any insight much appreciated!

krOoze · March 29, 2016, 8:31pm

Not really confident with the new synchronization elements myself. I recommend studying the texture sample from github of Sasha Willems for the basics.

Do you use the validation layers? Very useful things these…

Last time I had problem like these, I forgot to post my CmdBuffer to the Queue :rolleyes:

The flow seems reasonable. It’s not “format”, it’s layout. The copy texture transfer to TRANSFER_DST_OPTIMAL doesn’t have to “wait” for the rendering (It is not it’s target at all). Just have to be “before” the copy.

Not sure it is healthy to think about barriers this way. Better think of them as kind of dependency between two sets (yes, sets, not sequences) of commands. The five important parameters are:

position, where you record the barrier (splits to the two sets)
pipeline stage src and dst(no dst stage of anything in the second set can happen before src stage of anything in the first set and vice-versa - if unsure, use src=ALL_COMMANDS and dst=TOP)
layout conversion - simply supply what was before and what you need now. (or src=VK_IMAGE_LAYOUT_UNDEFINED if you want/dont mind throwing away data)
access, which copies the layout mostly. It defines the kind of memory dependency (read-write, write-read…) on top of the stage dependency. Use the thing that matches the layout and specify read or write or 0(for VK_IMAGE_LAYOUT_UNDEFINED and when you do not care). You can try the ACCESS_MEMORY_READ/WRITE instead, which should be stricter on what happens.
the queue family transfer - probably don’t want that so VK_QUEUE_FAMILY_IGNORED

Also:
Do you do that all in single CommandBuffer? Try spreading that to multiple and putting vkDeviceWaitIdle in between submissions to pinpoint the problem (and confirm it’s the synchronization). Do you change the layout for all subresources of the image (all mip levels and all 6 sides=layers?

Fishbiter · March 30, 2016, 11:54am

Good idea - stalling the device between every mip blit convinced me it wasn’t a synchronization problem. I’m already running with clean validation.

I’ve figured out what’s going wrong, but I’m no nearer to knowing why. It looks like the image blit and the sampler disagree about the layout of mipmaps in memory. The “garbage” is actually deterministic - at mip level 0 all image layers are correct. At mip level 1, only image 0 is correct, but image 1 contains the mip tail of image 0, and a bit of image 1. It looks like one expects the mips to be interleaved and one expects them to be sequential. There’s some stuff about this in the spec, but it looks like it’s only relevant for sparse arrays, and none of the layouts look correct.

Unfortunately I’ve run out of time to look into it for now, but thanks anyway.

ScottD · March 31, 2016, 12:11pm

I have a similar issue and i am at the point of blaming the AMD drivers and or the spec :razz:

I have 2 command buffers which each display an object. The app will show either object 1 or 2 or both. The render thread will actually just execute command buffer 1 and/or 2. Now of course the RenderPass must state the layout, so i tried to make it _GENERAL and it worked/works perfectly (after a lot of trial and error). Then i wanted to get synch done right and now try to use the _OPTIMAL layout and thats when problems started.

Just 2 examples:

The images are in layout UNDEFINED for the first run (only). So i need to transition the layout to _OPTIMAL. Thats what a 3. command buffer shall do. Sounds like a trivial task… when it is not. What does the spec say:

“A pipeline barrier inserts an execution dependency and a set of memory dependencies between a set of commands earlier
in the command buffer and a set of commands later in the command buffer.”

“VK_ACCESS_MEMORY_READ_BIT indicates that the access is a read via a non-specific unit attached to the memory.
This unit may be external to the Vulkan device or otherwise not part of the core Vulkan pipeline. When included in
dstAccessMask, all writes using access types in srcAccessMask performed by pipeline stages in srcStageMask
must be visible in memory.”

“The VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT is useful for accomplishing memory barriers and layout transitions when the next accesses will be
done in a different queue or by a presentation engine”

Yes, thats confusing, not to say contradicting. But it should… or may work… Lets see what happens:

t{0} vkBeginCommandBuffer(commandBuffer = 00000000052C20F0, pBeginInfo = 000000000020ECA8) = VK_SUCCESS
pBeginInfo (000000000020ECA8)
sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO
pNext = 0000000000000000
flags = 1
pInheritanceInfo = 0000000000000000

t{0} vkCmdPipelineBarrier(commandBuffer = 00000000052C20F0, srcStageMask = 8192, dstStageMask = 8192, dependencyFlags = 0, memoryBarrierCount = 0, pMemoryBarriers = 0000000000000000, bufferMemoryBarrierCount = 1, pBufferMemoryBarriers = 000000000020EF38, imageMemoryBarrierCount = 1, pImageMemoryBarriers = 000000000020EED0)
pImageMemoryBarriers[0] (000000000020EED0)
sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER
pNext = 0000000000000000
srcAccessMask = 0
dstAccessMask = 34304
oldLayout = VK_IMAGE_LAYOUT_UNDEFINED
newLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
srcQueueFamilyIndex = 4294967295
dstQueueFamilyIndex = 4294967295
image = 00000000054A3350
subresourceRange = 000000000020EF00
subresourceRange (000000000020EF00)
aspectMask = 1
baseMipLevel = 0
levelCount = 1
baseArrayLayer = 0
layerCount = 1
[…]

t{0} vkEndCommandBuffer(commandBuffer = 00000000052C20F0) = VK_SUCCESS
t{0} vkQueueSubmit(queue = 00000000051313F8, submitCount = 1, pSubmits = 000000000020EF90, fence = 00000000054A3550) = VK_SUCCESS
pSubmits[0] (000000000020EF90)
sType = VK_STRUCTURE_TYPE_SUBMIT_INFO
pNext = 0000000000000000
waitSemaphoreCount = 0
pWaitSemaphores = 0000000000000000
pWaitDstStageMask = 0000000000000000
commandBufferCount = 1
pCommandBuffers = 00000000054B31C0
signalSemaphoreCount = 0
pSignalSemaphores = 0000000000000000
pCommandBuffers[0].handle = 00000000052C20F0

t{0} vkDeviceWaitIdle(device = 0000000005130A10) = VK_SUCCESS
t{0} vkWaitForFences(device = 0000000005130A10, fenceCount = 1, pFences = 00000000054B3200, waitAll = 0, timeout = 18446744073709551615) = VK_SUCCESS
pFences[0] (00000000054B3200)
00000000054A3550
t{0} vkResetFences(device = 0000000005130A10, fenceCount = 1, pFences = 00000000054B3200) = VK_SUCCESS
pFences[0] (00000000054B3200)
00000000054A3550

AND NOW…

[…] draw etc. pp

ERROR: Cannot submit cmd buffer using image with layout VK_IMAGE_LAYOUT_UNDEFINED when first use is VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL.
ERROR: Cannot submit cmd buffer using image with layout VK_IMAGE_LAYOUT_UNDEFINED when first use is VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL.
t{1} vkQueueSubmit(queue = 00000000051313F8, submitCount = 1, pSubmits = 000000000708E730, fence = 00000000054A3550) = VK_SUCCESS

Voila…Nothing happens. So much for changing layout. Its undefined for the 1. object only btw. So the layout is transitioned somewhere somehow, i dont know. Is this correct behaviour, is it not? I do not know. Reading the spec, it could be…
Just doesnt produce the result i would like to see.

Even more confusing to me was this:

t{1} vkCmdEndRenderPass(commandBuffer = 00000000052C20F0)
t{1} vkCmdPipelineBarrier(commandBuffer = 00000000052C20F0, srcStageMask = 65536, dstStageMask = 65536, dependencyFlags = 0, memoryBarrierCount = 0, pMemoryBarriers = 0000000000000000, bufferMemoryBarrierCount = 0, pBufferMemoryBarriers = 0000000000000000, imageMemoryBarrierCount = 2, pImageMemoryBarriers = 000000000708E180)
pImageMemoryBarriers[0] (000000000708E180)
sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER
pNext = 0000000000000000
srcAccessMask = 256
dstAccessMask = 32768
oldLayout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL
newLayout = VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
srcQueueFamilyIndex = 4294967295
dstQueueFamilyIndex = 4294967295
image = 00000000054A3310
subresourceRange = 000000000708E1B0
subresourceRange (000000000708E1B0)
aspectMask = 1
baseMipLevel = 0
levelCount = 1
baseArrayLayer = 0
layerCount = 1

pImageMemoryBarriers[1] (000000000708E1C8 )
sType = VK_STRUCTURE_TYPE_IMAGE_MEMORY_BARRIER
pNext = 0000000000000000
srcAccessMask = 1536
dstAccessMask = 2048
oldLayout = VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL
newLayout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL
srcQueueFamilyIndex = 4294967295
dstQueueFamilyIndex = 4294967295
image = 00000000054A3350
subresourceRange = 000000000708E1F8
subresourceRange (000000000708E1F8)
aspectMask = 1
baseMipLevel = 0
levelCount = 1
baseArrayLayer = 0
layerCount = 1
(its actually: )
{
VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT|VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT, //VkAccessFlags srcAccessMask;
VK_ACCESS_TRANSFER_READ_BIT, //VkAccessFlags dstAccessMask;
VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL, //VkImageLayout oldLayout;
VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, //VkImageLayout newLayout;
VK_QUEUE_FAMILY_IGNORED, //uint32_t srcQueueFamilyIndex;
VK_QUEUE_FAMILY_IGNORED, //uint32_t dstQueueFamilyIndex;
}
with VK_PIPELINE_STAGE_ALL_COMMANDS_BIT

ERROR: Cannot copy from an image whose source layout is VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL and doesn’t match the current layout VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL.
t{1} vkCmdCopyImageToBuffer(commandBuffer = 00000000052C20F0, srcImage = 00000000054A3350, srcImageLayout = VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL, dstBuffer = 00000000054A3380, regionCount = 1, pRegions = 000000000708E6D8 )
pRegions[0] (000000000708E6D8 )
bufferOffset = 0
bufferRowLength = 800
bufferImageHeight = 600
imageSubresource = 000000000708E6E8
imageOffset = 000000000708E6F8
imageExtent = 000000000708E704
imageExtent (000000000708E704)
width = 800
height = 600
depth = 1
imageOffset (000000000708E6F8)
x = 0
y = 0
z = 0
imageSubresource (000000000708E6E8)
aspectMask = 2
mipLevel = 0
baseArrayLayer = 0
layerCount = 1

t{1} vkEndCommandBuffer(commandBuffer = 00000000052C20F0) = VK_SUCCESS

Less error checking emphasizes the need for correct source code. Reading the spec I just can’t see whats wrong with the source. The result is wrong though

Switching the source layout to VK_IMAGE_LAYOUT_GENERAL works, while VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL does not. I’m not too happy with the solution, but for now it works. Maybe that can help with your issue as well.

krOoze · March 31, 2016, 4:42pm

BTW, there’s new non-beta fully-conformant AMD driver 16.3.2. (Also new 1.0.5.0 LunarG SDK for some time)

The images are in layout UNDEFINED for the first run (only). So i need to transition the layout to _OPTIMAL. Thats what a 3. command buffer shall do.

Not necessarily. Two cmdBuffers usually do suffice. Src=UNDEFINED also mean “whatever the layout was before (trash data)”. Very useful (e.g. for the default framebuffer images), and likely the faster way.

“The VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT…”

There’s a little bit more to read than that (ch 6.4).

BTW2: You can disable smileys and/or embed your code/raw text in a CODE block to not mess it up. Also perhaps show the code (it’s hard to translate the enums by eye; lazy-ass layer not even translating that for poor humans…)

Suspicious:

srcStageMask = 8192, dstStageMask = 8192

BOTTOM; BOTTOM. Usualy not what one wants.

dstAccessMask = 34304

Depth attachment read+write + MEMORY_READ? That looks excessive.

aspectMask = 1

COLOR_ASPECT. Should be DEPTH? Or was it ment to be Color all along? In that case the VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL was wrong.

vkDeviceWaitIdle;
vkWaitForFences;

That’s paranoiaaa!

srcStageMask = 65536, dstStageMask = 65536

ALL; ALL. excessive, but should work (I think)

srcAccessMask = 1536
dstAccessMask = 2048

READ in the src? why read-read dependency?

aspectMask = 1

COLOR. Definitely should be DEPTH this time.

ScottD · April 1, 2016, 9:57am

Yes i know and using both versions.

I know. Additional paranoia was added when the thing did not work. Also the MEMORY and BOTTOM flags since those look the most suitable to force “immediate” layout change. Looking at the specs i am still not sure if this is supposed to work. As i am clearly not inserting a dependency between commands in the commandbuffer. However…

Thanks for pointing that out. You are absolutely correct the aspectMask was wrong. I tried re-using the VkImageSubresourceRange from the presentation part…bad idea.

Changing the aspect mask to DEPTH|STENCIL did change things. I also switched back from BOTTOM to ALL_COMMANDS.
Result: No errors from the debug callbacks and the result is correct. Both visually and depth values.

After those changes, skipping the initial command buffer (layout remains undefined) and executing the drawing command buffers, produces correct results now.
BUT i see all the error messages posted above for each layout change and the copy command. The spec states explicitely (in 11.4) “For either of these initial layouts, any subresources must be transitioned to another layout before they are accessed by the device.” so i dont think one is supposed to skip layout change.

krOoze · April 1, 2016, 11:01am

Glad to hear, things mostly work out.

“For either of these initial layouts, any subresources must be transitioned to another layout before they are accessed by the device.” so i dont think one is supposed to skip layout change.

Thats a misunderstanding - I didn’t mean skipping the layout transition. Just the redundant initial layout change command buffer (and it’s submission).

E.g. imagine the basic hello triangle example. I would use single command buffer.
It would start with a barrier with a layout transition from UNDEFINED to COLOR_ATTACHMENT.
And it would end with a barrier with a layout transition from COLOR_ATTACHMENT to PRESENT.

Effect: On the first frame it would transition the new Image from UNDEFINED to defined. On subsequent frames it would transition from PRESENT (while throwing the data of the old frame I do not need).

Generaly speaking things should even be faster (in fact Spec qoute: “This may allow a more efficient transition, since the data may be discarded.”) So I use it whenever I intend to completely (re)write the resource content.

ScottD · April 2, 2016, 2:11am

Ah okay. The reason why i used a 3rd command buffer in this scenario is that the other 2 buffers are just recorded once. Everything dynamic is done via uniforms etc. So in fact the 3rd command buffer is not just executed once for the very first frame.

This is more of an experiment to see if/how it works with Vulkan and if command buffer re-use is maintainable. I am not quite sure what’s the cost of command buffer recording, however the downside of this approach seems to be that the layout must be stable and the renderpasses also must not clear the color/depth output. So you have to clear (and switch layout eventually) beforehand and that obviously requires an additional command buffer which also (if i understood correctly) needs to be protected by a barrier AND a event/semaphore/fence.
I think so because of “In Vulkan, there are four forms of concurrency during execution: between the host and device (fence), between the queues (semaphore), between queue submissions (event), and between commands within a command buffer (pipeline barrier).”