Parallel compute shaders execution in Vulkan?

Question

I have several compute shaders (let's call them compute1, compute2 and so on) that have several input bindings (defined in shader code as layout (...) readonly buffer) and several output bindings (defined as layout (...) writeonly buffer). I'm binding buffers with data to their descriptor sets and then trying to execute these shaders in parallel.

What I've tried:

vkQueueSubmit() with VkSubmitInfo.pCommandBuffers holding several primary command buffers (one per compute shader);
vkQueueSubmit() with VkSubmitInfo.pCommandBuffers holding one primary command buffer that was recorded using vkCmdExecuteCommands() with pCommandBuffers holding several secondary command buffers (one per compute shader);
Separate vkQueueSubmit()+vkQueueWaitIdle() from different std::thread objects (one per compute shader) - each command buffer is allocated in separate VkCommandPool and is submitting to own VkQueue with own VkFence, main thread is waiting using threads[0].join(); threads[1].join(); and so on;
Separate vkQueueSubmit() from different detached std::thread objects (one per compute shader) - each command buffer is allocated in separate VkCommandPool and is submitting to own VkQueue with own VkFence, main thread is waiting using vkWaitForFences() with pFences holding fences that where used in vkQueueSubmit() and with waitAll holding true.

What I've got:

In all cases result time is almost the same (difference is less then 1%) as if calling vkQueueSubmit()+vkQueueWaitIdle() for compute1, then for compute2 and so on.

I want to bind the same buffers as inputs for several shaders, but according to time the result is the same if each shader is executed with own VkBuffer+VkDeviceMemory objects.

So my question is:

Is is possible to somehow execute several compute shaders simultaneously, or command buffer parallelism works for graphical shaders only?

Update: Test application was compiled using LunarG Vulkan SDK 1.1.73.0 and running on Windows 10 with NVIDIA GeForce GTX 960.

What makes you think that none of these are executing the shaders in parallel? Equally importantly... why do you care? What matters is how quickly the work gets done, not whether they execute "in parallel", right? If the GPU has 20 compute units, and each dispatch would require say 60 compute units, then it won't be any faster by executing each compute operation over 10 units (for parallel execution) than to execute them over 20 units. — Nicol Bolas, Jun 19 '18 at 13:22
You are right: the only thing that I actually want is the highest performance that I can achieve. I design algorithms in such way that they can be easily done in parallel, so I'm trying to maximize the profit. — zedrian, Jun 19 '18 at 14:33
Have you tried using one command buffer, with back-to-back dispatches in it? As long as there aren't barrier/event dependencies between them, they will *begin* in order but can progress in parallel after that, e.g. if the first dispatch doesn't fill up all execution units the second dispatch can fill in the holes (if the hardware is capable of this). I believe most hardware can support this level of parallelism even if they don't support multiple independent queues -- it allows them to keep utilization high as one dispatch finishes and the next begins. — Jesse Hall, Jun 19 '18 at 15:05
@JesseHall, could you please provide an example of the idea or some link? I don't completely understand what do you mean while googling "back-to-back dispatch" does not help. — zedrian, Jun 19 '18 at 15:11
I just mean begin a command buffer, bind descriptor set(s) for compute1, dispatch compute1, bind descriptor set(s) for compute2, dispatch compute2, ..., end command buffer. — Jesse Hall, Jun 19 '18 at 15:25
@JesseHall, I've tried as you proposed. For extra small dispatch grids (N=10) it provides speedup near +100%, for small grids (N=100) performance boost is near +6%, for medium (N=1000) and large grids (N=10000) boost is predictably tiny. But we can see, that for small dispatch grids this trick can be really useful, thank you! — zedrian, Jun 19 '18 at 16:18

score 1 · Answer 1 · answered Jun 19 '18 at 11:47

This depends on the hardware You are executing Your application on. Hardware exports queues which process submitted commands. Each queue, as name suggests, executes command in order, one after another. So if You submit multiple command buffers to a single queue, they will be executed in order of their submission. Internally, GPU can try to parallelize execution of some parts of the submitted commands (like separate parts of graphics pipeline can be processed at the same time). But in general, single queue processes commands sequentially and it doesn't matter if You are submitting graphics or compute commands.

In order to execute multiple command buffers in parallel, You need to submit them to separate queues. But hardware must support multiple queues - it must have separate, physical queues in order to be able to process them concurrently.

But, what's more important - I've read that some graphics hardware vendors simulate multiple queues through graphics drivers. In other words - they expose multiple queues in Vulkan, but internally they are processed by a single physical queue and I think that's the case with Your issue here, results of Your experiments would confirm this (though I can't be sure, of course).

Seems that you are right about multiple queues simulation. But let's wait for other answers - I still hope that here is some magic somewhere within millions of Vulkan parameters that I've missed :) — zedrian, Jun 19 '18 at 11:56
@zedrian For example, here is information about simulating multiple queues: https://www.reddit.com/r/vulkan/comments/7ynlcl/cost_of_including_queues_in_logical_device/ This part is especially interesting: *"None of the vendors have more than 1 hardware graphics queue AFAIK, only multiple compute queues. (AMD for example has this, NVidia emulates it in the driver)."* — Ekzuzy, Jun 19 '18 at 12:14
Yes, it's intersting. But where are these multiple compute queues? As my experiments show, it looks like there is only one compute queue. — zedrian, Jun 19 '18 at 12:32
@zedrian According to http://vulkan.gpuinfo.org Nvidia 960 indeed has 2 queue families with 16 general queues and 1 transfer queue. But newer GPUs (like 1050) have 3 queue families (with additional 8 compute queues). — Ekzuzy, Jun 20 '18 at 06:27

Parallel compute shaders execution in Vulkan?

1 Answers1

Linked