Forced stopping a launched kernel

I’ve been wondering for some time now and haven’t seen anyone else ask this. Is there anyway to force stop a launched kernel? What happens if for whatever reason a kernel gets stuck in an infinite loop or takes much longer than anticipated? I guess it would be like a dequeue as opposed to enqueue. Thanks!

Is there anyway to force stop a launched kernel?

There’s no way to do that in the core specification. I don’t know of any extension that allows that either.

Cancelling a command mid-execution would necessarily going to leave the contents associated with memory objects that were in use in an undefined state. It’s not clear that an application can recover from that cleanly. It’s not very different from cancelling a worker thread while it’s busy doing something.

What happens if for whatever reason a kernel gets stuck in an infinite loop or takes much longer than anticipated?

The OpenCL specification doesn’t currently define what happens in that scenario. In practice on Windows systems you may expect the WDDM watchdog to reset the GPU after a few seconds.

Cancelling a command mid-execution would necessarily going to leave the contents associated with memory objects that were in use in an undefined state. It’s not clear that an application can recover from that cleanly. It’s not very different from cancelling a worker thread while it’s busy doing something.

If the memory object is read only then the state should be well defined. For read-write and write only memory objects then the state should be undefined like you said; however if the command was successfully terminated then those objects should be able to be reset. These seem like the same state descriptions for a running kernel, but maybe the successful termination is troublesome.

The OpenCL specification doesn’t currently define what happens in that scenario. In practice on Windows systems you may expect the WDDM watchdog to reset the GPU after a few seconds.

Thanks for the info. However, that seems to only work for CPUs and accelerators, and of course windows only. Could you bring (again?) this issue up in the OpenCL specification discussions?

Thanks!

Sean,

Believe it or not but most of the requests for features that we see in these message boards were already discussed months ago in the Working Group – for example, adding a linking stage was already committed into CL 1.2 before anybody brought it up in the forums.

Cancelling kernels that are going to be executed or, even worse, are already mid-execution is nontrivial. Notice for example that WDDM will reset the whole GPU – affecting all applications using it-- when a single kernel takes too long to run. Imagine how you would feel in the CPU world if your computer had to be reset when a single rogue application enters an infinite loop – it’s obviously not a good solution and is the consequence of how GPUs are designed today.

The OpenCL Working Group has to create a standard that works across all kinds of devices and this means necessarily that some features that might be implemented reasonably well on a class of devices don’t make it into the standard because there is another important class of devices that can’t support them. This also applies whenever people request that feature X from API Y which is only supported in hardware from one vendor to be added to a standard that is supported by dozens of devices from a multitude of vendors.

I hope the above serves to give some confidence that the Working Group is not only aware of the limitations of the standard but also proactively working on them and if somebody’s favorite feature Z was left out of the standard there is a good chance that it’s because it’s substantially more complicated to implement in a variety of devices than people might think, not because the Group didn’t think of it or because the members are lazy.

Sorry for the rant :slight_smile: