Oddball memory setups.

Alfonse_Reinheart · June 1, 2016, 3:59pm

I’ve been spending some time with the Vulkan database registry, comparing the various memory types used by different vendors. But there are some setups that I can’t wrap my head around.

Zero memory flags

Memory types have flags, defined by VkMemoryPropertyFlagBits. For each memory type, the propertyFlags field describing it must match one of the specified sets of flags.

One of these sets is 0. My question is… why? What does such a memory type actually mean and when would you allocate through it?

Such memory is not DEVICE_LOCAL, so allocating through it won’t achieve the best performance. Such memory is not HOST_VISIBLE, so mapping it is not possible. This means you have to treat it as though it were DEVICE_LOCAL in terms of access (ie: you have to use staging and transfers), but you don’t get the performance out of it.

Several pieces of NVIDIA hardware, using recent drivers, expose memory types that have no flags set. In these cases, the un-flagged memory types all are associated with the CPU-memory heap.

That last part suggests a purpose. It could be for streaming or times when there is contention. But this would only be reasonable if the non-HOST_VISIBLE types were in some way faster for the GPU to access or copy from; otherwise, you’d just make them HOST_VISIBLE and be done with it.

Anybody got any insight into why implementers expose such memory types?

Double-device local

AMD hardware presents another interesting memory setup. Lots of AMD hardware has two sets of device local memory heaps.

It seems like they carve 256MB out of their GPU memory and set it aside in a special memory pool. This pool is accessible through a memory type that is DEVICE_LOCAL and HOST_VISIBLE (and HOST_COHERENT).

The thing is, I’m not sure what you would use that for. Even though its visible directly, the fact that it’s DEVICE_LOCAL probably means that such accesses are not fast for the CPU. So staging would probably best be done using the CPU memory pool instead of this cutting of GPU memory.

What’s the use case for this buffer? Images/buffers you frequently access on the CPU and GPU? Is this intended for things like UBOs, so that they can be in faster memory without you having to DMA data?

ratchet_freak · June 1, 2016, 4:43pm

The “no flags” set could be for overflow. Where if the device_local memory starts to fill up you “page out” some memory to the “no flags” for a bit.

The AMD host visible & device local is fast mapped memory that is ideal for accessing volatile state from HOST, I think I saw in one of their presentations that they also use that memory for pushConstants

krOoze · June 1, 2016, 4:49pm

0:
Spill-over memory? Something like a pagefile.sys on windows?
Besides it’s usually the same memory as the mappable one, except you hint to the Vulkan, you won’t need the mapping feature. So, why forcefuly forbid that type, if it’s possible to provide it.

mappable, device-local:
Maybe some hint by them: “we can do better job than you” for moving memory between host and device. I would use it for the staging or data streaming and avoid one explicit copy by that.

EDIT: darn, your ninja skills Ratchet…

Alfonse_Reinheart · June 1, 2016, 7:19pm

The “no flags” set could be for overflow. Where if the device_local memory starts to fill up you “page out” some memory to the “no flags” for a bit.

Right, but why would you need a memory type for that? Sure, you’re not going to actually access it from the host either way, but it seems odd for the implementation to offer a specific memory type for such uses. Not unless there is a performance difference between memory allocated this way and host-visible memory.

The AMD host visible & device local is fast mapped memory that is ideal for accessing volatile state from HOST

So things like streaming vertex data and so forth.

So, why forcefuly forbid that type, if it’s possible to provide it.

I’m not suggesting that it should be forbidden. I was asking what it semantically means or represents. Basically, I’m asking why someone would want to allocate memory that’s not fast to access by anybody.

ratchet_freak · June 2, 2016, 1:58am

[QUOTE=Alfonse Reinheart;40333]Right, but why would you need a memory type for that? Sure, you’re not going to actually access it from the host either way, but it seems odd for the implementation to offer a specific memory type for such uses. Not unless there is a performance difference between memory allocated this way and host-visible memory.
[/QUOTE]
Perhaps is a memory module on the GPU that’s not wired optimally into the main memory and only the DMA has fast access to it.

I was under the impression it was more for low-latency per-frame data. Like view matrix and model transforms.

The vertex/texture data (which can have a higher latency) would go through a host staging buffer and let the transfer-only queue do the transfer to make the DMA module do the work.

Salabar · June 2, 2016, 2:13am

On AMD’s memory model: http://developer.amd.com/tools-and-sdks/opencl-zone/amd-accelerated-parallel-processing-app-sdk/opencl-optimization-guide/#50401315_92101

Alfonse_Reinheart · June 2, 2016, 7:36am

[QUOTE=ratchet freak;40334]I was under the impression it was more for low-latency per-frame data. Like view matrix and model transforms.

The vertex/texture data (which can have a higher latency) would go through a host staging buffer and let the transfer-only queue do the transfer to make the DMA module do the work.[/QUOTE]

I said “streaming” vertex data, where it’s changing every frame. GUIs, CPU-generated particle systems, etc. You really want that stuff to be available this frame, just like matrices and such. So a DMA operation seems inefficient.

ratchet_freak · June 2, 2016, 8:14am

oh right, I was talking there about mesh and texture loading. Where you hopefully have some advanced knowledge of which ones are needed.

Sascha_Willems · June 2, 2016, 8:30am

The NVIDIA memory types with zero flags refer to unmappable host memory and can be used to preserve virtual address space for e.g. 32 bit applications. I’m not sure though as to why recent drivers seem to report these twice as opposed to older ones (that only had one).

Karol_Gasinski · June 30, 2016, 3:52pm

Attaching myself to this thread with another question related to memory types.
I understand all allowed flags combinations except the two below:

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
MEMORY_PROPERTY_HOST_CACHED_BIT

How I read it, is that it marks resources that reside in system memory only (CPU RAM only), but if GPU decided that it wants to keep them locally in VRAM after all (pull in), it needs to be manually synced by the app? It doesn’t make a lot of sence (how app will know?).

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
MEMORY_PROPERTY_HOST_CACHED_BIT |
MEMORY_PROPERTY_HOST_COHERENT_BIT

This one I have no clue.
Can anybody explain more deeply what is the purpose of allowing those two flags combinations?
Any hints are welcome.

krOoze · June 30, 2016, 4:37pm

The first is obvious host side RAM (nothing weird about it - you don’t have direct access to it anyway without Map, so synchronization is not a problem - and Map has synchronization commands).

The second is the same as VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT only with cache(=faster host access through map). E.g. iGPU bios-reserved host-side RAM could fit the bill?

Alfonse_Reinheart · July 1, 2016, 7:38am

[QUOTE=Karol Gasinski;40420]Attaching myself to this thread with another question related to memory types.
I understand all allowed flags combinations except the two below:

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
MEMORY_PROPERTY_HOST_CACHED_BIT

How I read it, is that it marks resources that reside in system memory only (CPU RAM only), but if GPU decided that it wants to keep them locally in VRAM after all (pull in), it needs to be manually synced by the app?[/quote]

Remember, all memory heaps that are exposed by Vulkan can be accessed by the GPU directly. There is no “pull in” with Vulkan unless you specifically say so.

What this memory type refers to is memory that the CPU and GPU will have cached access to (HOST_VISIBLE). And the CPU will have cached access to that memory (HOST_CACHED). However, because it does not have the HOST_COHERENT bit, then it means that the CPU must ensure that the caches are flushed. That is, while the GPU can access CPU addressable memory, it cannot access the CPU’s caches.

That’s what the Vulkan flush commands are for.

[QUOTE=Karol Gasinski;40420]VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
MEMORY_PROPERTY_HOST_CACHED_BIT |
MEMORY_PROPERTY_HOST_COHERENT_BIT

This one I have no clue.[/quote]

Given what I said before, the meaning is obvious. The CPU and GPU can access the memory (HOST_VISIBLE), the CPU can access the memory the memory through the caches (HOST_CACHED), and the GPU does not need the CPU to flush the caches explicitly (HOST_COHERENT). The implication here being that the GPU can access the CPU’s caches.

Intel’s integrated GPUs have apparently mastered this, while many mobile GPUs have not.

Karol_Gasinski · July 1, 2016, 12:06pm

The implication here being that the GPU can access the CPU’s caches.

That was the information I was missing. Thanks!

errissa · August 16, 2016, 2:08pm

I also thought it strange it was reported twice. Now it’s reported 7 times. And there are 2 memory types associated with the device local heap – Types 7 and 8 – though I always seem to get a memory match for 8 and never 7 when allocating resources. Any idea why Nvidia is reporting memory types this way?

Karol_Gasinski · September 28, 2016, 8:26am

Another interesting finding:

GTX 1080

as well as for e.g. AMD R9 390, report memory type:

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT| MEMORY_PROPERTY_HOST_CACHED_BIT| MEMORY_PROPERTY_HOST_COHERENT_BIT`

As Alfonse explained, this type of memory assumes that GPU have direct access to CPU cache (for e.g. they share it). That makes sense on Intel integrated GPU’s, but I’m wondering why NV and AMD report such memory type?

Shouldn’t they report instead:
VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT |
VK_MEMORY_PROPERTY_HOST_CACHED_BIT
?

Salabar · September 28, 2016, 9:11am

Writing directly into memory is faster than using copy engines when every 0.1 ms of response time matters. It can even be not that useful for graphics per se, but when you want to use GPGPU for something, it allows you to read data back without full cache invalidation.

Karol_Gasinski · September 28, 2016, 9:23am

For what you’re saying they report separate type:

HOST_VISIBLE_BIT
HOST_COHERENT_BIT

which I suppose is equivalent of Write-Around access.
I think that combination:

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT|MEMORY_PROPERTY_HOST_CACHED_BIT|MEMORY_PROPERTY_HOST_COHERENT_BIT`

means Write-Through access where data goes to both cache and memory at the same time at CPU write, thus making memory coherent and still cached for faster local CPU reads.

Salabar · September 28, 2016, 10:24am

ViSIBLE|COHERENT means you can only use linear writes to stay on the fast path (according to AMD documentation anyways). Random writes, random reads and linear reads with CPU prefetch support all require cached bit.