are clCreateProgramWithSource & clBuildProgram thread-safe?

I am working on a multi-threaded code to run an OpenCL kernel on several Compute Devices simultaneously. In this code, I first launch N threads, with N equal to the user specified work device count (CPU, GPU1, GPU2 …). In each thread, I run the same host code, creating context, create command queue, create buffers and compile a program etc.

I am wondering if this overall structure is safe and sound in a multi-threaded environment? Are there any specifications on which cl function is thread-safe or not?

Moreover, if my GPU1 and GPU2 are both of the same kind, say the two cores in a Radeon 6990, is there a way to save one clBuildProgram? the reason I ask is because the building process is kind of long at this point. It takes about 5 seconds to compile a 650 lines cl code with ati catalyst 11.3.

In each thread, I run the same host code, creating context, create command queue, create buffers and compile a program etc.

I am wondering if this overall structure is safe and sound in a multi-threaded environment? Are there any specifications on which cl function is thread-safe or not?

It depends on whether you are targeting OpenCL 1.0 or 1.1. In OpenCL 1.1. all functions are thread-safe except for clSetKernelArg(), so it is strongly recommended to have different threads use different kernel objects – you simply call clCreateKernel() multiple times, one for each thread. Note that it’s okay to have multiple kernel objects for the same kernel function in your source code.

In OpenCL 1.0 the only thread-safe functions are clCreateXXX, clRetainXXX and clReleaseXXX.

Moreover, if my GPU1 and GPU2 are both of the same kind, say the two cores in a Radeon 6990, is there a way to save one clBuildProgram? the reason I ask is because the building process is kind of long at this point. It takes about 5 seconds to compile a 650 lines cl code with ati catalyst 11.3.

If you have multiple devices from the same vendor I recommend putting all of them into the same context. That way you can share the same program object(s) across both devices and you only need to call clBuildProgram() once. Notice that when you call to clBuildProgram() you pass a list of device IDs. On sensible implementations of OpenCL you would expect that passing multiple devices would be faster than calling clBuildProgram() multiple times.

thanks!

just to make sure I understand, when I say “thread-safe”, I meant calling the same function(s) simultaneously in multiple threads. Is this the same definition in your comment?

For example, if I have a computer with a CPU, a Radeon 4xxx GPU and a 6xxx GPU, are you suggesting me to use two threads to build the program: 1 for CPU, 1 for the two Radeons?

just to make sure I understand, when I say “thread-safe”, I meant calling the same function(s) simultaneously in multiple threads. Is this the same definition in your comment?

Thread safety in OpenCL is defined in Appendix A.2. The idea is: any number of threads can call any of the thread-safe functions at any time and it must work, including things like multiple threads calling the same function simultaneously.

For example, if I have a computer with a CPU, a Radeon 4xxx GPU and a 6xxx GPU, are you suggesting me to use two threads to build the program: 1 for CPU, 1 for the two Radeons?

I’m not actually suggesting that. I mean, it will work but I doubt it will buy you much so why do it?

Generally speaking, what’s the reason you implemented this app with multiple threads? Do you expect to see any benefits?

Threads = headaches in my book :slight_smile: OpenCL driver developers go through quite a bit of pain enable developers to write apps that scale with multicore systems without having to write multithreaded code. That’s one of the big reasons why OpenCL exists.

hi David, thanks again for the insightful comments.

My motivation of doing so is based on an observation: when running a simulation kernel on the GPU, my CPU load drops to very low. That means, if I can have another thread to maintain a kernel running on the CPU simultaneously, I might be able to maximize the simulation performance. This naive view can be extended to multiple CPUs/GPUs on a heterogeneous system.

When writing this code, I first wrote a host function to compile/launch/post-process data for a single device at a time. To make it to work for multiple devices simultaneously, I came up with a lazy (or portable) approach where I added OpenMP constructs to parallize the host function, something like:

...
#ifdef _OPENMP
     omp_set_num_threads(activedev);
#endif

#pragma omp parallel private(threadid)
{
#ifdef _OPENMP
     threadid=omp_get_thread_num();
#endif
     // this launches the simulation
     run_simulation(&mcxconfig,threadid,activedev,fluence,&totalenergy);
}
...

where “activedev” is the number of working-devices that I want to use for my simulation; for each working-device, I launch an OpenMP thread to execute run_simulation(), where it builds/launches/post-processes on each device. Not surprisingly, several omp barrier/critical sections are needed in run_simulation(), but overall, this approach let me extend the simulation to multiple devices with minimum coding.

I wrote this code in early 2010, and did not read deeply into OpenCL, and did not follow the progress in OpenCL 1.1. I am not sure if there are easier-to-use OpenCL functions that can use multiple devices without involving multi-threading.

Although it works, I did observe a low efficiency running the above code on both CPU and GPU at the same time, as you have suspected. The time-saving is not attractive at all. Sometimes it takes even longer than running purely on the GPU. I am still investigating what’s going on.

In your opinion, is my above approach justified ? are you aware of any other robust alternatives that can fully exploit the computational power from a heterogeneous platform? or this is out of the design-scope of OpenCL?

My motivation of doing so is based on an observation: when running a simulation kernel on the GPU, my CPU load drops to very low. That means, if I can have another thread to maintain a kernel running on the CPU simultaneously, I might be able to maximize the simulation performance. This naive view can be extended to multiple CPUs/GPUs on a heterogeneous system.

You don’t need multiple threads for that. All you need is multiple command queues, one for each device. That will easily keep all devices at 100% usage.

I can’t think of any case where it makes sense to mix OpenMP with OpenCL. OpenCL basically does everything OpenMP does and then some… at the cost of being quite a lot more complex to program for, of course.

ah, looks like I haven’t really understood opencl :oops:

I will definitely do more readings on this, but to get me started, I am wondering if you can give me some quick suggestion on what needs to be modified to achieve what you recommended.

The following is a pseudo-code outline of what were done inside the run_simulation() function in my old implementation:

run_simulation(int devid,int iscpu){
  context=clCreateContextFromType(iscpu?CL_DEVICE_TYPE_CPU:CL_DEVICE_TYPE_GPU)
  clGetContextInfo(context,...,devices)
  commands=clCreateCommandQueue(context,devices[devid])
  var1=clCreateBuffer(context,...)
  var2=clCreateBuffer(context,...)
  ...
  program=clCreateProgramWithSource(context)
  status=clBuildProgram(program)
  kernel = clCreateKernel(program)

  clSetKernelArg(kernel,0,var1)
  clSetKernelArg(kernel,1,var1)
  ...
  clEnqueueNDRangeKernel(commands,kernel)
  clEnqueueReadBuffer(commands,var1)
  ...
}

As you see, it is designed for a single device per call. To make it work for all available OpenCL devices, I would imagine I need to do

  • [li] in clCreateContextFromType, I should use CL_DEVICE_TYPE_ALL[/:m:7mj2y5c6][/li][li] in the clCreateCommandQueue call, I do a loop for all members in the devices[] array to create commands[i][/:m:7mj2y5c6][/li][li] it looks like clCreateBuffer, clCreateProgramWithSource, clBuildProgram and clSetKernelArg are only associated with the context and not specific device, so, I don’t need to do anything[/:m:7mj2y5c6][/li][li] for the clEnqueueNDRangeKernel() call, I need to use another loop, launching my kernel for each commands[i]; after launching, I will wait for the results with a blocking or non-blocking read.[/:m:7mj2y5c6]

do the above changes look reasonable to you?

what if I want to launch kernels with diff arguments on each device (for example kernel(10) on CPU and kernel(1000) on a GPU for load-balancing purposes)?

thank you again for your kind suggestions.

Don’t worry, we are all learning one thing or another.

Answering your question, what the code looks like will depend on how you want to split the work across devices and read the results back. In the simplest case it could look like this:


run_simulation()
{
  context=clCreateContextFromType(CL_DEVICE_TYPE_ALL)
  clGetContextInfo(context,...,devices)

  program=clCreateProgramWithSource(context)
  status=clBuildProgram(program, all devices in context)

  for each device in context
  {
    commands=clCreateCommandQueue(context,devices[devid])
    // having different buffers for each device can prevent unnecessary
    // data transfers between devices
    var1=clCreateBuffer(context,...)
    var2=clCreateBuffer(context,...)
    ...
    // it's fine to create multiple kernel objects
    // all associated with the same kernel function
    // (or different ones for that matter)
    kernel = clCreateKernel(program)

    clSetKernelArg(kernel,0,var1)
    clSetKernelArg(kernel,1,var1)
    ...
    clEnqueueNDRangeKernel(commands,kernel)
    clEnqueueReadBuffer(commands, NON_BLOCKING, var1)
    ...
  }
  clWaitForEvents(all the ReadBuffer events above)
}

If you have multiple incompatible platforms on your system, such as an NVidia GPU and an x86 CPU (assuming NVidia’s OpenCL driver doesn’t support x86) then you will have to create one context for each device.

thanks a lot David, I briefly tested the work-flow as suggested in your previous post, it seems to work very well!

I have one more question: in the kernel parameter list, I have a scalar input to control how many photons to simulate (this application is designed for optical imaging, the CUDA version is at http://mcx.sf.net ). When launching this kernel for different devices, I came up with a load-balancing scheme by setting different photon numbers for each device: a smaller number for CPU and a higher number for GPU (can be a function of core numbers etc). From the code outline, the clSetKernelArg() calls only deal with a kernel, which does not seem to distinguish devices. To achieve the load-balancing, do I have to define another kernel for the CPU target and set arg. separately? Or I have to pass the full workload and total device/core numbers to the kernel and compute the device load inside the kernel?

Any other mechanism that I can use in OpenCL’s specification?

Monte Carlo simulations of light transport! That’s one of my favorite topics :smiley:

From the code outline, the clSetKernelArg() calls only deal with a kernel, which does not seem to distinguish devices. To achieve the load-balancing, do I have to define another kernel for the CPU target and set arg. separately? Or I have to pass the full workload and total device/core numbers to the kernel and compute the device load inside the kernel?

Since we have split the input data into different buffer objects for different devices, you will need different kernel objects for different devices as well. For example, if we have assigned buffer B1 and B2 to device D1, then you will create a kernel object K1 and set the kernel arguments to B1 and B2, then you will enqueue an NDRange using K1 on D1. Do the same for buffers B3 and B4 assigned to device D2: create a kernel K2, set the kernel arguments to B3 and B4, then enqueue an NDRange using K2 on D2. Etc.

thank you again for your quick reply, but I am still not quite clear.

let me ask this way: in your pseudo-code, the context and program are created for all devices, but kernel is created for each device. However, in clCreateKernel(), I can not find an argument to specify which device to associate with. I can only give “program” as the first argument, but it is already associated with all devices.

Can you explain a little bit more?

let me ask this way: in your pseudo-code, the context and program are created for all devices, but kernel is created for each device. However, in clCreateKernel(), I can not find an argument to specify which device to associate with. I can only give “program” as the first argument, but it is already associated with all devices.

Right. Kernel objects are associated with contexts, not with specific devices. However, kernel objects are nothing more than containers for the arguments that you will pass to the kernel function.

Notice the sentence “you will enqueue an NDRange using K1 on D1”. What that means is that when you call clEnqueueNDRange() you will pass it K1 and a command queue created from D1.

It’s important to distinguish between kernel objects (containers of function arguments), kernel functions (the actual function that is executed in the device) and NDRanges, that describe the amount of workload that will be executed and on what device it will be executed.

sorry for not being so careful reading your comments. Now I seem to get it. I modified the code and create multiple kernel objects for each device,setting arguments separately and pass it to clEnqueueNDRange. My first test went smoothly, except some compiling issues with NVIDIA drivers (I somehow got around by tweaking the cl code a bit).

Comparing with the structure you recommended, I found one difference: I moved all clCreateBuffer() calls outside of the device loop, as it is only associated with the context. Is this ok? will these buffers get messed up when I feed them to different kernel objects?

also, I saw your comments on memory transfer between multiple devices, can you explain this a bit more? or point me to the resources to explain what might be happening? thanks

Comparing with the structure you recommended, I found one difference: I moved all clCreateBuffer() calls outside of the device loop, as it is only associated with the context. Is this ok? will these buffers get messed up when I feed them to different kernel objects?

The problem you have if you declare buffers outside of the device loop is that if any of the devices attempts to write to a buffer while other devices are reading or writing to it, the results will be undefined. If all devices are reading from the same buffers that’s okay.

That’s why I recommended the simple approach where each device has its own buffer objects.

I saw your comments on memory transfer between multiple devices, can you explain this a bit more?

It’s pretty simple, actually. Let’s say you have a context with two different devices, and let’s say that these devices have physically separate memory, like for example a discrete PCI-Express GPU and a CPU. According to OpenCL, buffers are shared among all the devices in the context, right? However, these devices have physically separate memory, so they can’t actually share buffer objects.

What’s going on then? The driver will detect when a device needs to access a buffer and it will transfer it to that device. If later a different device needs to access the same buffer, the driver will transfer it to this second device. The application doesn’t need to do anything: from the application’s point of view buffer objects exist in global memory and all devices access that same global memory.

Notice that there are platforms like cell phones where all devices share the same physical memory so these memory transfers are unnecessary.

thank you so much David! that is really helpful.
I followed your advice, and created buffers for all devices, now the code works beautifully!

Just want to make sure, I leave the read-only (including __constant) variables outside of the loop, as you pointed out, they will be sent to the device per request, thus does not seem to introduce additional overhead.

I also have a __local variable, I call clSetKernelArg() per device to set the buffer length for each device. Is this the right way to do it?

thanks again.

I call clSetKernelArg() per device to set the buffer length for each device. Is this the right way to do it?

Yes :slight_smile: It looks like you have it all figured out already :slight_smile: Congrats!