enqueuing multiple kernels with same input buffers

Dear all,

According to the OpenCL standard, is it ok to enqueue sequentially (from a single host-thread) a number of kernels (clEnqueueNDRangeKernel) where:
-) all refer to the same context
-) all input data buffers are shared (i.e. refer to the same cl_mem objects); input means I only read data from them, but the kernel function buffer arguments are not declared as const
-) each enqueued kernel has its separate cl_kernel, cl_command_queue and output buffer cl_mem objects (output means data get written to)
-) the first explicit synchronization barrier from my side occurs after several (e.g. handful) such kernels were enqueued (copying data to the host in blocking manner, where for the copying the same command-queue is used as for enqueuing the kernel which calculated the data).
-) the kernel code is the same; all command-queues are in-order (though that should be irrelevant here)
What I effectively do is calling clEnqueueNDRangeKernel for different work-items, so processing on the host-side of results for the first kernel invocations (after copying data) then runs in parallel with subsequent kernel calculations to fully use both device (GPU) and host (CPU ressources).
[I am using the C++ wrapper API, but again that shouldn’t matter here]. OpenCL is 1.1.

thanks !

And… what is your question?

Sorry if my question was not explicit enough. The question is if the described usage pattern is OK according to the OpenCL specification, i.e. doesn’t violate anything, introduces a race, relies on some implementation properties, or otherwise triggers undefined- / implementation-defined behaviour.
The reason is that I observe runtime problems for AMD’s implementation on Tahitis (foremost spurious driver hang-ups; these occur non-deterministially and relatively rarely, e.g. millions of kernel calls may go fine until a hang-up occurs, but persistent enough to disallow lengthy simulations) while no problem is observed for Nvidia.

The usage pattern is “fine” in the sense it is allowed by the spec.
If you sync correctly using events it will also work as intended.
I honestly cannot see how can you use the same buffers AND have concurrent kernels (sub-buffers perhaps?) operating on them as proper sync would (in general) prevent that from happening.

AMD implementation has a few quirks. On some drivers there seems to be a problem protecting memory from overflows. Sometimes the compiler will trip on itself and forget a barrier… what I can say is that to my experience valid kernels usually work, with the exception of the compiler not doing its job which is not very common.

It is completely possible you’re hitting a driver bug but I’m not currently inclined in believing that.

Just in case; AMD does not support out-of-order queues. Maybe they do with the latest driver but I honestly see those quite redundant.

If “operating on” refers to buffers for which both read & write access occurs (and specifically output of kernel calculations), indeed for these my code and the synchronization does not allow any concurrent kernel execution. But for buffers from which data are only read this does not hold, i.e. concurrent kernels may read data from the same input buffer(s). I ensure that these input-only buffers are in a proper state prior to enqueuing the first kernel in the sequence of concurrent kernels (the values in these input-buffers were themselves calculated by independent kernels previously invoked), but otherwise these input buffers are not considered any further during subsequent synchronizations. To me reading values from a buffer sort of is also operating on a buffer, so just to clarify that we have here merely a language issue: I may access these input-only buffers from concurrent kernels, without any need to synchronize anything for them (what could I synchronize for them at all?), right?

If you’re sure the synchronization WRT completion of previous kernel is ok, you’ll be ok. This is the same as previously noted; just make sure the previous kernel has fully completed.
Just because the kernels are encoded previously does not imply they have completed as well (this only holds when using a single in-order work queue).