vulkan-tutorial.com type app which hangs on linux nVidia K600

dhubbard · August 3, 2016, 4:07pm

I realize a more appropriate forum for my question might be the devtalk.nvidia forums, but linux questions seem to get more support here on the khronos forums.

I have followed vulkan-tutorial.com all the way up to Rendering and presentation. The equivalent to my code would be:
https://vulkan-tutorial.com/code/hello_triangle.cpp

My code is available at GitHub - davidhubbard/v0lum3: GPL3 Vulkan voxel library. It should compile on Windows but I have not tested it – to build on Windows, I have not yet created a project file or linked the external dependencies. To build on Linux, clone the repo and run “build.sh” then “make”. “build.sh” will ask you to type “export PKG_CONFIG_PATH=$PWD/vendor/lib/pkgconfig” in your shell.

The issue I am seeing is that if the main loop runs for 665 iterations, vkDeviceWaitIdle hangs for a while but exits cleanly. (I bisected the number of iterations by changing the test if (count > 1000) break; in main.cpp.)

If the main loop runs for 666 iterations, vkDeviceWaitIdle fails with VK_ERROR_DEVICE_LOST.

Enabling VK_INSTANCE_LAYERS = VK_LAYER_LUNARG_standard_validation does not reveal any useful information about why the nvidia driver is pausing for so long. The following output is easy to connect to the code in main.cpp:

vkQueuePresentKHR done, end of loop 665
vkAcquireNextImageKHR
vkQueueSubmit
D MEM: code0: Details of Memory Object list (of size 0 elements)
D MEM: code0: =============================
D MEM: code0: Details of CB list (of size 3 elements)
D MEM: code0: ==================
D MEM: code0:     CB Info (0x0x2581dd0) has CB 0x0x2588720
D MEM: code0:     CB Info (0x0x257ecf0) has CB 0x0x2587900
D MEM: code0:     CB Info (0x0x2592be0) has CB 0x0x257f5d0
vkQueuePresentKHR
vkQueuePresentKHR done, end of loop 666
vkDeviceWaitIdle
W ParameterValidation: code9: vkDeviceWaitIdle: returned VK_ERROR_DEVICE_LOST, indicating that the logical device has been lost
vkDeviceWaitIdle returned -4

This is my first attempt at a Vulkan project and I am trying to stick close to vulkan-tutorial.com, but I have wrapped many of the function calls in a class hierarchy. I suspect the issue to be:

Maybe I have an incorrect parameter during all the function calls to init everything. I have double- and triple-checked that all the values are right, and can’t find where I am passing the wrong parameters. standard_validation of course reports no errors. I have also tested with valgrind for any memory errors, and don’t think I have any.
Maybe I am missing a necessary function call or sequencing. It seems I am leaking a resource within the graphics driver and this results in the driver “crashing” after the 666th iteration. Googling VK_ERROR_DEVICE_LOST shows that this error generally means the driver crashed and was restarted. I also notice that when I run the app (and the X server freezes up for a few seconds) I see this in the /var/log/Xorg.0.log, which seems to mean the GPU was reset and reinitialized:

[709440.056] (--) NVIDIA(GPU-0): CRT-0: disconnected
[709440.056] (--) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
[709440.056] (--) NVIDIA(GPU-0): 
[709440.059] (--) NVIDIA(GPU-0): DFP-0: disconnected
[709440.059] (--) NVIDIA(GPU-0): DFP-0: Internal TMDS
[709440.059] (--) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
[709440.059] (--) NVIDIA(GPU-0): 
[709440.059] (--) NVIDIA(GPU-0): DFP-1: disconnected
[709440.059] (--) NVIDIA(GPU-0): DFP-1: Internal TMDS
[709440.059] (--) NVIDIA(GPU-0): DFP-1: 165.0 MHz maximum pixel clock
[709440.059] (--) NVIDIA(GPU-0): 
[709440.060] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): connected
[709440.060] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): Internal DisplayPort
[709440.060] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): 960.0 MHz maximum pixel clock
[709440.060] (--) NVIDIA(GPU-0): 
[709440.066] (--) NVIDIA(GPU-0): CRT-0: disconnected
[709440.066] (--) NVIDIA(GPU-0): CRT-0: 400.0 MHz maximum pixel clock
[709440.066] (--) NVIDIA(GPU-0): 
[709440.069] (--) NVIDIA(GPU-0): DFP-0: disconnected
[709440.069] (--) NVIDIA(GPU-0): DFP-0: Internal TMDS
[709440.069] (--) NVIDIA(GPU-0): DFP-0: 330.0 MHz maximum pixel clock
[709440.069] (--) NVIDIA(GPU-0): 
[709440.069] (--) NVIDIA(GPU-0): DFP-1: disconnected
[709440.069] (--) NVIDIA(GPU-0): DFP-1: Internal TMDS
[709440.069] (--) NVIDIA(GPU-0): DFP-1: 165.0 MHz maximum pixel clock
[709440.069] (--) NVIDIA(GPU-0): 
[709440.070] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): connected
[709440.070] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): Internal DisplayPort
[709440.070] (--) NVIDIA(GPU-0): HP Z30i (DFP-2): 960.0 MHz maximum pixel clock
[709440.070] (--) NVIDIA(GPU-0):

Here is the output of nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.35                 Driver Version: 367.35                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro K600         Off  | 0000:04:00.0      On |                  N/A |
| 25%   47C    P8    N/A /  N/A |    187MiB /   979MiB |      4%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 0000:05:00.0     Off |                  N/A |
| 22%   34C    P8    15W / 250W |      1MiB / 12206MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3817    G   /usr/bin/X                                      96MiB |
|    0      5543    G   ...iveTaskBlocking/Enabled/StrictSecureCooki    89MiB |
+-----------------------------------------------------------------------------+

krOoze · August 3, 2016, 6:24pm

666? seriously? :twisted:

Sometimes layers themselves are the fault – try running it with them off.
Driver bug or bug in your code are possible culprits too.

I am not quite ready to go over your whole code base yet. Update your drivers AND the SDK. Test other programs if similar happens on your HW. Try to make minimal reproducible example.

dhubbard · August 3, 2016, 8:19pm

I’ve tried it without any validation layers and vkDeviceWaitIdle still returns the same VK_ERROR_DEVICE_LOST

dhubbard · August 3, 2016, 8:47pm

I looked at the Intel Tutorial, only focusing on the VkSubpassDependency and VkAttachmentDescription differences vs. vulkan-tutorial.com. I decided the Intel Tutorial isn’t going to help much (through testing and also explanations here on the forum).

So Sascha Willems has a triangle example. I’m going to start comparing his calls to Vulkan with mine and try to bisect where things come out differently.

dhubbard · August 5, 2016, 1:11pm

I noticed today that if I run the code under valgrind, instead of 666 iterations to get the error I need to run it for 805 iterations. Since the biggest difference running under valgrind is that the code runs slower, I am beginning to think my problem is that I am not synchronizing to the GPU correctly.

This will help as I look at the differences in Sascha Willems’ triangle demo.

If I am not synchronizing correctly, and the nvidia linux driver just buffers up the commands and processes them when it can, then the vkDeviceWaitIdle() would time out while the commands were processed.

And I have noticed that vkDeviceWaitIdle() pauses a lot longer when I have run the main loop for more iterations.

dhubbard · August 6, 2016, 10:46pm

Here is a vktrace of the program: github

Here is the text dump of the trace: gist

krOoze · August 7, 2016, 6:47pm

Oh! It’s good ol’ triangle.

Well, you have:

if (count > 805) break;

in your code, so…

anyway, otherwise worked fine for me on Win10+AMD.

dhubbard · August 7, 2016, 8:26pm

Yes, I have the if (count > 805) break; because if I let it run indefinitely, I can’t diagnose this nvidia linux specific issue…

Thank you, however, for testing it! Good to know it’s (mostly) there!

Do you have an nvidia card? I am looking for some help. What if I sent one to you?

Sascha_Willems · August 8, 2016, 1:59pm

Can you try adding a fence for the command buffer submission and check if that helps?

I’m currently rearranging my examples a bit (clearing up code, removing unnecessary post and pre present barriers) and also added fences to sync submission.

dhubbard · August 9, 2016, 8:24am

Great suggestion, Sascha. Thanks!

Unfortunately I will be on business-related travel until Aug 15, so I will have to wait until then to implement your suggestion. I will post the results.

dhubbard · August 23, 2016, 8:37am

I have resolved the problem by upgrading my linux nvidia driver to 370.23.

Anyone still interested in what happens if I try some of the very good suggestions? If so, I can always check out GitHub - davidhubbard/v0lum3: GPL3 Vulkan voxel library commit 4c0ecee and downgrade my linux nvidia driver to 367.35 and re-test.