Results 1 to 5 of 5

Thread: OpenGL Compute Issue, Possibly Switching to Vulkan

  1. #1
    Junior Member
    Join Date
    Nov 2017
    Posts
    3

    OpenGL Compute Issue, Possibly Switching to Vulkan

    Hi,

    I've been having an issue with OpenGL where using a compute shader to copy a struct of around 300 bytes between buffers is 100s of times slower than I expect. If I eliminate the double buffering and just update the struct, it's real-time performance fast, but not setting the world on fire. If I just copy the struct one line at a time, for each vec4 I write the performance is halved.

    If this is driver issue and the shader is not being optimized, switching to Vulkan and maybe also using the SPIR-V optimizer could solve my problem. But this is a large project I have and I want to do some research first before putting all the leg work in. Has anybody using Vulkan been doing anything similar and had performance issues?

  2. #2
    Senior Member
    Join Date
    Mar 2016
    Posts
    232
    Why are you using compute shaders to do simple copy between buffers? There is glCopyBufferSubData.
    Similarly, Vulkan also has dedicated copy commands. There's nothing much to optimalize anyway in a shader -- copying 300 bytes is quite straightforward. You are probably looking at the overhead of setting up a compute pipeline, enqueueing the work onto GPU queue, etc.

  3. #3
    Junior Member
    Join Date
    Nov 2017
    Posts
    3
    Quote Originally Posted by krOoze View Post
    Why are you using compute shaders to do simple copy between buffers? There is glCopyBufferSubData.
    Similarly, Vulkan also has dedicated copy commands. There's nothing much to optimalize anyway in a shader -- copying 300 bytes is quite straightforward.
    Because I've stripped away all the actual code in the shader to find where the performance issue is. The goal is not to copy between buffers, that's just what drops my application to <1fps. So I can not do any operation that requires double buffering without killing performance.

    Quote Originally Posted by krOoze View Post
    You are probably looking at the overhead of setting up a compute pipeline, enqueueing the work onto GPU queue, etc.
    I've tested that this is not the case, the number of instructions to copy the struct is what changes the execution speed, as well as being between buffers.
    Last edited by krackaan; 11-01-2017 at 06:13 AM.

  4. #4
    I've been having an issue with OpenGL where using a compute shader to copy a struct of around 300 bytes between buffers is 100s of times slower than I expect. If I eliminate the double buffering and just update the struct, it's real-time performance fast, but not setting the world on fire. If I just copy the struct one line at a time, for each vec4 I write the performance is halved.
    None of this is surprising.

    Compute shaders are for computing, not shuffling data around. They're pretty terrible at copying data. Having each compute shader invocation read 300 bytes of memory, then write 300 bytes of memory is going to be extremely slow.

    Moving from reading and writing to different locations to reading and writing to the same location certainly should improve performance. Because at least in that case, the memory addresses you're updating are already in the cache. Similarly, writing less data ought to increase performance, since that's what is driving your performance.

    The best optimization you could do is to stop copying data. In your OpenGL thread, you mentioned you were implementing a sort algorithm. Well, you don't have to copy data to do that. Your sort algorithm should sort indices, not the actual struct objects. That is, instead of copying a struct into its location in the sorted array, you copy an index to the struct in the sorted array of indices. And the CS doing the sort should only read the absolute minimum data from the struct that it needs to in order to do the comparison.

    And the comparison data ought to be in an array by itself. That is, you should make your data a struct of arrays (and your CS doing sorting should only access the array(s) that it needs to), not an array of structs.

  5. #5
    Junior Member
    Join Date
    Nov 2017
    Posts
    3
    Quote Originally Posted by Alfonse Reinheart View Post
    300 bytes of memory is going to be extremely slow.
    Great, this is what I needed to know. It might not be surprising to you, but It's surprising to me, because it's hard to find this information and I'm easily within L1 cache so everything looks fine.

    I'm sorting the structs so they're in coherent read order for later steps, but if can't it that that's fine, it could be possible to find another way to mitigate the cache in-coherance. The issue is more how much data my algorithm requires writing, I will re-design and hope I can find a way to only write single vec4s each time.
    Last edited by krackaan; 11-01-2017 at 11:15 AM.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
Proudly hosted by Digital Ocean