ThreadSafety: One place where OpenCL should learn from CUDA

tqchen · February 28, 2017, 12:49pm

I used CUDA for quite a while for deep learning and recently I started to look into OpenCL. In general the OpenCL spec looks great, except for one place, which I think CUDA actually did a better job.
It is about thread-safety. OpenCL is thread-safe everything except for KernelSetArg, and that is the place which creates pain.

To put it short, OpenCL’s kernel calling function relies on KernelSerArg, which is not thread-safe, while CUDA’s calling convention is thread-safe(because arguments are allocated by caller), which uses void** to pass the arguments.

cudaLaunchKernel(kernel, grid-configurations, void** args);

I understand clearly the OpenCL’s guideline said that a separate kernel should be create per-thread. Specifically, it is hard to expose a PURE function interface in a thread-safe way. For example, consider the following code.

class CLWorkspace {
  public:
    static CLWorkspace* Global() {
      static CLWorkspace inst;
      return &inst; 
   }
   std::function<Tensor(Tensor, Tensor)> GetFunc(const std::string source) {
     cl_kernel k =  // compile logic
     return [k](Tensor, Tensor) {
        // launch logic
    });
  }
};

int main() {
  auto myadd = CLWorkspace::Global()->GetFunc("myadd.cl"); 
  Tensor a, b;
  Tensor c = myadd(a, b);
  return 0;
}

The above code is quite valid and can represent a way to quickly get std::function that can add two tensors together. It works great for front-end user because it abstract away the details such as kernel handle(no body want to get a list of kernel handles when performing multiple operations). Unfortunately, it is not thread safe, and the corresponding myadd can only work on one thread.
This makes it hard for the users. Because user will simply assume that the created resource is a function, and can be called from multiple threads. In advanced deep learning system where multi-threaded scheduling is used, this causes trouble when the function gets called from different threads.

Add a function that like CUDA will resolve this problem, and make the entire API threadsafe.

Salabar · March 1, 2017, 2:22am

In this particular use-case, wouldn’t it make sense to use thread::id to store a per-thread kernel? Also, why not simply create a kernel object from cl_program directly in the lambda and destroy it at the end? I don’t believe this is significantly different from CUDA way overhead-wise. (I mean, binding new buffers per launch is suboptimal regardless, I doubt there’d be much harm with this approach. No data to back it up though).

tqchen · March 1, 2017, 9:25am

Both your proposals will involves a lookup in the hash-map(because thread_id is not continuous and it cannot be an array loopup) or a lookup by name(in case of create new kernel per launch) every call. What I ended up doing is more complicated. I created a per thread kernel table in the Thread-local storage, for every kernel created a new kernel id is allocated(which is continuous) and it is stored in lambda, each call will lookup the kernel from the thread local kernel table. I also have to manage destruction carefully so destruct of a lambda will detach the touched kernel from each kernel table correctly.

I think the problem here is tying the parameter buffer to the kernels. If instead, you have something like cl_ArgBuffer, and you do

static thread_local cl_ArgBuffer buf
cl_SetArg(buf, i, my argument)
launch_kernel(my_kernel, buf)

The situation would be much easier. Because you can simply put the cl_ArgBuffer on thread-local storage, and re-used it across launch of multiple kernels. For CUDA style calls, you do not even need things like an opaque cl_ArgBuffer. I assume the callee can simply allocate the copy buffer on stack. The reason why thread-local solution for kernel is hard is because we may not know how many kernels are used ahead of time.

Either approaches can be used to support CUDA style call, and I think they can be as efficient as current OpenCL calls, while making the thread-safety easier for the user.

Salabar · March 1, 2017, 10:55am

I’ve actually proposed something similiar some time ago. Though my reasoning was to allow changes of parameters for a family of kernels in one call.
Also, there is a third option, granted somebody besides Intel will make an effort to support OpenCL 2.1. I’m kinda afraid we may confirm OpenCL as dead, cause neither NVIDIA nor AMD expressed any plans of supporting newer versions.

std::function<Tensor(Tensor, Tensor)> GetFunc(const std::string source) {
const cl_kernel k = // compile logic//I assume you store this object somewhere you can reach it to dispose of.
return [k](Tensor, Tensor) {
cl_kernel k_ = clCloneKernel(k);
// launch logic
clDestroyKernel(k_);
});