NVIDIA Multi Device Command Queue Concurrency Issue

I’m struggling to understand why the execution of OpenCL enqueue* function calls is seemingly sequential in a multi-GPU environment with two independent CommandQueues. I have two GTX 780s. I’m using CUDA SDK 7.0 for the source files and the nvidia 346 driver’s libOpenCL.so.0 .

Here is what I am talking about: http://i.imgur.com/nWXzrLo.png

This code is part of a project I’m working on to showcase the feature set of OpenCL. The kernel is blindingly simple, it accesses one value each from READ_ONLY buffers, sums them, and writes them to a WRITE_ONLY buffer. When there are multiple devices, the original vector is cut in pieces and the iterators are offset by the size of the vector divided by the number of devices. To compile and execute my code, please do the following

git clone https://github.com/stevenovakov/learnOpenCL.git git checkout simple_events
$ make clean program

please use the branch simple_events and take a look at how the environment is constructed in oclenv.cc and how the calls are executed in main.cc. You can use my scripts to do profiling by calling the program with

$ ./profiling.sh <args>

I suggest args like

$ ./profiling.sh -datasize=50 -chunksize=25

for nice and fast execution. When that completes, look at the data with

$ nvvp cuda_profile.log

please see the README for more detailed compilation/execution/profiling instructions.

I would expect the profiler to put out something like this: http://imgur.com/fsYiihk (excuse the paint hackjob) with only some minimal time required for the host to execute looping/etc between enqueue* calls. The commandqueues are out-of-order and the enqueue* calls are only dependent on events. This has been plaguing me for some time and I would appreciate any help. Thanks!

NOTE: I also tried an approach with multiple contexts (one per device). And generated exactly the same type of output. If you want to try this, check out the single_thread_multi_context branch and run it the same way:
$ ./profiling.sh -datasize=50 -chunksize=25
you will have to remove empty context logs from cuda_profile.sh for it to display properly in nvvp.

If both kernels write to the same buffer on different devices, it means undefined behavior (It makes sence for READ_ONLY buffers to be shareable, but I’m not sure what specs say). You are lucky NVIDIA’s runtime seems to be smart enough to make operations sequential, but this behavior is not guaranteed across the platforms. You should create different WRITE buffers for both devices. I think OpenCL does not expose any nice way to concatenate two buffers when you need to use the results in further computations though.


UPD Oh wait, there is:
https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueCopyBuffer.html

Thanks for the reply. As far as I understand I’m already doing what you suggested, no? Check out 115-117 and 148-171 in main.cc of branch simple_events.


  std::vector<cl::Buffer> ones;
  std::vector<cl::Buffer> twos;
  std::vector<cl::Buffer> outs;

and



  for (uint32_t d = 0; d < gpus.size(); d++)
  {
    cqs.push_back(env.GetCq(gpus.at(d)));
    kerns.push_back(env.GetKernel(gpus.at(d)));

    ones.push_back(cl::Buffer(  (*cntxt), // cl::Context &context
                                CL_MEM_READ_ONLY, // cl_mem_flags
                                buffer_mem_size, // size_t size
                                NULL, // void *host_ptr
                                &err // cl_int *err
                             ));
    if (CL_SUCCESS != err)
      env.Die(err);

    // Set up data container OpenCL buffers

    twos.push_back(cl::Buffer((*cntxt), CL_MEM_READ_ONLY, buffer_mem_size,
      NULL, &err));
    if (CL_SUCCESS != err)
      env.Die(err);
    outs.push_back(cl::Buffer((*cntxt), CL_MEM_WRITE_ONLY, buffer_mem_size,
      NULL, &err));
    if (CL_SUCCESS != err)
      env.Die(err);

    // Set the kernel arguments
    kerns.back()->setArg(0, ones.back());
    kerns.back()->setArg(1, twos.back());
    kerns.back()->setArg(2, outs.back());

  }

I create one buffer set for each device: two READ_ONLY, one WRITE_ONLY per device.

Whoops, my bad. Try in-order queues and remove any events and flushes. Maybe you messed up the dependancies. And then add at 254:
cqs.at(0)->flush();
cqs.at(1)->flush();
cqs.at(0)->finish();
cqs.at(1)->finish();

[QUOTE=Salabar;38128]Whoops, my bad. Try in-order queues and remove any events and flushes. Maybe you messed up the dependancies. And then add at 254:
cqs.at(0)->flush();
cqs.at(1)->flush();
cqs.at(0)->finish();
cqs.at(1)->finish();[/QUOTE]

Nope, exactly the same thing. The edits I made to simple_events:



    this->ocl_device_queues.push_back(
      cl::CommandQueue(this->ocl_context, this->ocl_devices[k], CL_QUEUE_PROFILING_ENABLE));
        //CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE ||  CL_QUEUE_PROFILING_ENABLE));



  for (uint32_t c = 0; c < n_chunks; c++)
  {
    // Write to the input buffers
    for (uint32_t d = 0; d < gpus.size(); d++)
    {
      err = cqs.at(d)->enqueueWriteBuffer(
        ones.at(d), // address of relevant cl::Buffer
        CL_FALSE, // non blocking
        static_cast<uint32_t>(0), // offset (bytes)
        buffer_mem_size, // total write size (bytes)
        &input_one.at(d * n_gpu + c * n_chunk), // pointer to root of data array
        NULL, // no events to wait on
        NULL//&kernel_events->at(d).at(0) // output event info
      );
      if (CL_SUCCESS != err)
        env.Die(err);

      err = cqs.at(d)->enqueueWriteBuffer(
        twos.at(d), // address of relevant cl::Buffer
        CL_FALSE, // non blocking
        static_cast<uint32_t>(0), // offset (bytes)
        buffer_mem_size, // total write size (bytes)
        &input_two.at(d * n_gpu + c * n_chunk), // pointer to root of data array
        NULL, // no events to wait on
        NULL //&kernel_events->at(d).at(1) // output event info
      );
      if (CL_SUCCESS != err)
        env.Die(err);

      //cqs.at(d)->flush();
    }

    // execute the kernel
    for (uint32_t d = 0; d < gpus.size(); d++)
    {
      err = cqs.at(d)->enqueueNDRangeKernel(
        (*kerns.at(d)), // address of kernel
        offset, // starting global index
        compute_range, // ending global index
        cl::NullRange, // work items / work group (just 1)
        NULL, //&kernel_events->at(d), // wait on these to be valid to execute
        NULL //&read_events.at(d).at(0) // output event info
      );

      if (CL_SUCCESS != err)
        env.Die(err);

      //cqs.at(d)->flush();
    }

    // so that all subsequent enqueueNDRange calls wait on the read to finish
    if (c==0)
      kernel_events = &kernel_events_2;

    // read back the data
    for (uint32_t d = 0; d < gpus.size(); d++)
    {
      err = cqs.at(d)->enqueueReadBuffer(
        outs.at(d), // address of relevant cl::Buffer
        CL_FALSE, // execute and blocking
        static_cast<uint32_t>(0), // offset (bytes)
        buffer_mem_size, // total write size (bytes)
        &output.at(d * n_gpu + c * n_chunk), // pointer to root of data array
        NULL, //&read_events.at(d), // wait until kernel finishes to execute
        NULL //&kernel_events->at(d).at(2) // no events to link to for status updates
      );

      if (CL_SUCCESS != err)
        env.Die(err);

      //cqs.at(d)->flush();
    }
  }

  cqs.at(0)->flush();
  cqs.at(1)->flush();
  cqs.at(0)->finish();
  cqs.at(1)->finish();

Should I jump to multithreading? Give an std::thread object control of a single commandqueue? I thought that the spec said that every commandqueue already has its own execution thread, and that you can run non conflicting enqueue* calls concurrently.

Multithreading definitely won’t work. I don’t see any particular reason why one GPU have to stall. I’m only guessing it is your PC or NVIDIA’s runtime. Are there any multi-GPU samples in NVIDIA’s SDK? CUDA, or OpenCL, it’s irrelevant. Try them out.

Yeah I could rewrite in CUDA. That won’t take too long ( a day). Or just try a multi GPU example sure. It’s frustrating because the spec (https://www.khronos.org/registry/cl/specs/opencl-1.1.pdf) says right on the bottom of p 25 that I shouldn’t be having this problem…

It is Runtime issue most likely. Install newer driver, I suppose (or on the contrary, find a driver from before “screw OpenCL, lock-in to CUDA” age).

Yeah I think this confirms that the CUDA runtime is running as it’s supposed to be:

http://imgur.com/a/TPoL1

So basically…my libOpenCL.so is fucked. I’m going to try 352.30 for linux x64. Thanks for the help, I think we can conclude that this is a bug in nvidia346 driver libOpenCL.so.

I’m going to make a post on the NVIDIA dev forums. They are supposedly supporting OpenCL 1.2 for W10 so…

Yeah, it’s still messed up. Just to confirm I’ve installed a new driver today:


 $ ldd program
	linux-vdso.so.1 =>  (0x00007ffd147f7000)
	libOpenCL.so.1 => /usr/lib/x86_64-linux-gnu/libOpenCL.so.1 (0x00007f5f94de5000)
	libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f5f94ae1000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5f947da000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5f945c4000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5f943a6000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5f93fe0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f5f93ddc000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f5f95011000)
$ ls -l /usr/lib/x86_64-linux-gnu | grep libOpenCL.so
lrwxrwxrwx  1 root root       14 Aug  1 18:46 libOpenCL.so -> libOpenCL.so.1
lrwxrwxrwx  1 root root       16 Aug  1 18:46 libOpenCL.so.1 -> libOpenCL.so.1.0
lrwxrwxrwx  1 root root       18 Aug  1 18:46 libOpenCL.so.1.0 -> libOpenCL.so.1.0.0
-rwxr-xr-x  1 root root    26432 Aug  1 18:46 libOpenCL.so.1.0.0
$ grep -i "x driver" /var/log/Xorg.0.log
[    21.652] (II) NVIDIA dlloader X Driver  352.30  Tue Jul 21 18:01:20 PDT 2015

So even as of the latest driver, this is a problem… there must be tons of people using NVIDIA hardware and writing OpenCL software (because they want to support multiple platforms), how was this unnoticed???

EDIT: I’m just going to leave this here to confirm it’s not an issue of my compute/memory transfer times being too small or something: http://i.imgur.com/6drkHyT.png 200MB buffers and I made the kernel instructions loop 10x to prolong the time.

Also, could someone with two or more AMD GPUs please try my code out and confirm that there is concurrency? A profiler output would be nice, either text or graphical.

I only have an APU laptop, but a similiar kernel seems to be working concurently on GPU\CPU combo.

__kernel void test (__global int* in, __global int* out){
	int id = get_global_id(0);
	printf(" %d	", id);
	out[id] = in[id] + id;
}


void runTest(cl_context context, cl_command_queue commandQueue, cl_program program){
  cl_mem read = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * 1000, NULL, NULL);
  cl_mem write = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * 1000, NULL, NULL);
  cl_kernel initKernel = clCreateKernel(program, "test", NULL);
  int data[1000] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
  clEnqueueWriteBuffer(commandQueue, read, CL_FALSE, 0, sizeof(int) * 500, NULL, 0, NULL, NULL);
  clSetKernelArg(initKernel, 0, sizeof(cl_mem), (void *)&read);
  clSetKernelArg(initKernel, 1, sizeof(cl_mem), (void *)&write);
  size_t global_work_size[1] = { 500 };
  clEnqueueNDRangeKernel(commandQueue, initKernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);

  runTest(context, commandQueue, program);
  runTest(context, commandQueue1, program);
  clFlush(commandQueue);
  clFlush(commandQueue1);
  clFinish(commandQueue);
  clFinish(commandQueue1);

Except, I don’t do any clean up job what so ever, but neither do you, no destructors are called before the very end of the function.
Also, full code of my program (slightly altered AMD’s sample).


#include <CL/cl.h>
#include <string.h>
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <fstream>
#include <time.h>

#define SUCCESS 0
#define FAILURE 1

using namespace std;

/* convert the kernel file into a string */

ifstream cin("data.in");
ofstream cout("output.out");
int convertToString(const char *filename, std::string& s)
{
	size_t size;
	char*  str;
	std::fstream f(filename, (std::fstream::in | std::fstream::binary));

	if (f.is_open())
	{
		size_t fileSize;
		f.seekg(0, std::fstream::end);
		size = fileSize = (size_t)f.tellg();
		f.seekg(0, std::fstream::beg);
		str = new char[size + 1];
		if (!str)
		{
			f.close();
			return 0;
		}

		f.read(str, fileSize);
		f.close();
		str[size] = '\0';
		s = str;
		delete[] str;
		return 0;
	}
	cout << "Error: failed to open file
:" << filename << endl;
	return FAILURE;
}

void runTest(cl_context context, cl_command_queue commandQueue, cl_program program){

	cl_mem read = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int) * 1000, NULL, NULL);
  cl_mem write = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int) * 1000, NULL, NULL);
	cl_kernel initKernel = clCreateKernel(program, "test", NULL);
  int data[1000] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
  clEnqueueWriteBuffer(commandQueue, read, CL_FALSE, 0, sizeof(int) * 500, NULL, 0, NULL, NULL);
  clSetKernelArg(initKernel, 0, sizeof(cl_mem), (void *)&read);
	clSetKernelArg(initKernel, 1, sizeof(cl_mem), (void *)&write);
	size_t global_work_size[1] = { 500 };
	clEnqueueNDRangeKernel(commandQueue, initKernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);
}

int main(int argc, char* argv[])
{

	/*Step1: Getting platforms and choose an available one.*/
	cl_uint numPlatforms;	//the NO. of platforms
	cl_platform_id platform = NULL;	//the chosen platform
	cl_int	status = clGetPlatformIDs(0, NULL, &numPlatforms);
	if (status != CL_SUCCESS)
	{
		cout << "Error: Getting platforms!" << endl;
		return FAILURE;
	}

	/*For clarity, choose the first available platform. */
	if (numPlatforms > 0)
	{
		cl_platform_id* platforms = (cl_platform_id*)malloc(numPlatforms* sizeof(cl_platform_id));
		status = clGetPlatformIDs(numPlatforms, platforms, NULL);
		platform = platforms[0];
		free(platforms);
	}

	/*Step 2:Query the platform and choose the first GPU device if has one.Otherwise use the CPU as device.*/
	cl_uint				numDevices = 0;
	cl_device_id        *devices;
  status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices);

  devices = (cl_device_id*)malloc(numDevices * sizeof(cl_device_id));

  clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, numDevices, devices, NULL);
	/*Step 3: Create context.*/
	cl_context context = clCreateContext(NULL, 2, devices, NULL, NULL, NULL);

	/*Step 4: Creating command queue associate with the context.*/
	cl_command_queue commandQueue = clCreateCommandQueue(context, devices[0], CL_QUEUE_PROFILING_ENABLE, NULL);
  cl_command_queue commandQueue1 = clCreateCommandQueue(context, devices[1], CL_QUEUE_PROFILING_ENABLE, NULL);
  /*Step 5: Create program object */
	const char *filename = "test.cl";
	string sourceStr;
	status = convertToString(filename, sourceStr);
	const char *source = sourceStr.c_str();
	size_t sourceSize[] = { strlen(source) };

	cl_program program = clCreateProgramWithSource(context, 1, &source, sourceSize, NULL);


	/*Step 6: Build program. */
	status = clBuildProgram(program, 2, devices, "-w", NULL, NULL);

	if (status == CL_BUILD_PROGRAM_FAILURE) {
		// Determine the size of the log
		size_t log_size;
		clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, 0, NULL, &log_size);

		// Allocate memory for the log
		char *log = (char *)malloc(log_size);

		// Get the log
		clGetProgramBuildInfo(program, devices[0], CL_PROGRAM_BUILD_LOG, log_size, log, NULL);

		// Print the log
		printf("%s
", log);
	}
  runTest(context, commandQueue, program);
  runTest(context, commandQueue1, program);
  clFlush(commandQueue);
  clFlush(commandQueue1);
  clFinish(commandQueue);
  clFinish(commandQueue1);

	return SUCCESS;
}

That’s not my impression. Whenever I go to the NVIDIA developer forums, I see almost zero OpenCL discussions activity… most of the users who write there seem to be CUDA developers who also have some OpenCL knowledge. With khronos (supposedly) devoted to Vulkan, NVIDIA devoted to CUDA, and Apple devoted to Metal (shame on Apple, one of the companies who pushed hard for OpenCL), the only remaining players seem to be AMD and Intel.

Computing companies seem to have decided that this time of crisis needs proprietary non-standard APIs for the survival of their business. Very tough times for us who never liked to be tied to any vendor custom technology.

I’m also an OpenGL developer, and IMHO, OpenCL has never come close to OpenGL in terms of companies support. I’m happy OpenGL was a success, and I’ll continue using it for many years (maybe through compatibility translation layers in the future, but it’s a solid and good API that you can use for coding in whatever platform you wish to support). I cannot say the same for OpenCL. No idea if I’ll be using it a year from now: drivers have a hacky/nonsolid feeling, lots of unexpected behavior when you try the same kernel on different GPUs, while each company is trying to push their own custom proprietary stuff… very, very sad :frowning:

Vulkan is gonna be a primarily graphics API, though an extension that allows to feed Vulkan an OpenCL kernel could definitely find some users (there are weirdoes who write kernels in assembly, after all). And don’t forget about FPGA manufacturers on OpenCL side. Those are too expensive to buy in order to toy around, but they are getting popular in HPC and there is no currently a viable alternative to OpenCL for those things. And, by the way, OpenGL has exact same problems, especially in mobile segment. It is because OpenGL infrastracture is more mature bugs are being squashed a bit more actively.
Hopefully, SPIR-V will make a difference. They cannot mess up the compilation of byte-code, can they?

Despite the marketing slogan, Vulkan is not “primarily” a graphics API; it is a graphics API, period. Compute shaders exist, but just as in Direct3D and OpenGL, they exist to supplement graphics-related tasks. You’re still doing graphics work with them; you’re not directly rendering. You can use a compute shader to decompress textures or perform frustum culling or whatever. You would use compute shaders instead of OpenCL to avoid the expensive context switch that’s necessary when going from a compute API to a graphics API.

But compute shaders have all of the limitations of graphical shaders. They have lower precision than OpenCL requires. They have more restrictions on memory access (logical addressing means no pointer arithmetic, no pointers to pointers, etc). And so forth.

Metal is a bit different, in that the Metal shading language has both higher-precision functions and pointer arithmetic. Indeed, precision is just a matter of how it’s compiled. So Metal could be effectively used as an OpenCL alternative.

I don’t see anything in particular that stops us from adding high precision functions to GLSL as an extension (or I thought they are already there?). And the memory model is not really a limitation for things like linear algebra and computational math in general, which I imagine is the most popular application for OpenCL. Moreover, hardware seems to discourage using more advanced features. CUDA supports recursion? Stack is very limited, perfomance hit. Pointers to functions? Perfomance hit. Pointer arithmetics and pointer assigment in OpenCL? Blocks some useful optimizations, perfomance hit. But of course it is not always about the efficiency and there are reasons why OpenCL requires richier computation model than of OpenGL.

[QUOTE=Salabar;38141]Except, I don’t do any clean up job what so ever, but neither do you, no destructors are called before the very end of the function.
[/QUOTE]

With the C++ wrapper it’s not necessary. There are calls to the destructors in the C++ template functions in cl.hpp.

Thanks for running that, but I would still like to see AMD multi-GPU (2+ PCIe devices) using my code, specifically. I’m 99% sure the problem is not on my end, but, I’m ready to be surprised.

RE: the OpenCL vs CUDA/other APIs discussion:

There should be unambiguous OpenCL support at this point: GPU, CPU (ARM, X86, PowerPC), APU, FPGA, DSP. OpenCL is a general SIMD language, of which we do not have a dominant one (like C is a fairly generic Assembly-abstracter).

The bug report was submitted btw, will update when I get reply from eng.

Hello everyone,

I’m writing to verify I am having the same issue with not being able to achieve concurrent GPU operation on NVidia hardware. I have had screaming success with OpenCL on a variety of AMD GPUs and CPUs (Opteron, FX, Athlon, A6, R9 270x, Fury X) where concurrent execution occurred with a single context and multiple command queues. I recently tested my kernel on a dual NVidia 980 Ti setup and was quickly disappointed to find that they ran serially. I am using multiple OpenMP threads to call clEnqueueReadBuffer() with separate buffers for each device, and a single thread for each clEnqueueNDRangeKernel(). It seems like this last function is blocking on NVidia hardware, for some reason. I have not tested on Intel. I have CUDA Toolkit 7.5.18
and driver version 359.00 on Win8.1.

Is there any solution for this yet? I’ve posted in the NVidia devtalk forums but their OpenCL discussions seem a bit stale.

Other info:

  • The host is an AMD CPU
  • I have AMDAPPSDK3.0 installed, as well as CUDA TK
  • I do not have an SLI cable between the NVidia GPUs (waiting on a delivery just-in-case this is the problem, but didn’t need a CrossFire cable for AMD)
  • I have not tried the multiple context approach, but from the discussion above it seems not too fruitful

If anyone has ideas about how to achieve true multi-GPU OpenCL kernel execution on NVidia hardware, they would be much appreciated.

Thanks