Task parallel programming

(I’m cross posting this message from NVIDIA OpenCL forums here in order to widen the audience. And, it’s slightly different.)

According to the OpenCL spec section 3.4.2, task parallel programming (TPP) (aka “concurrent kernels” as in white paper on Fermi) should be possible by executing a single work-item in a work-group with an NDSpace of {1,1,1}. To check this, I wrote a program that tries to execute TPP. I then used that to check various platforms and devices for support.

What I’m finding is:

[ul]
[li]NVIDIA’s OpenCL does not seem to support TPP;[/:m:1oejpcsa][/li][li]AMD OpenCL supports TPP for my quad-core one way (see below);[/:m:1oejpcsa][/li][li]AMD OpenCL supports TPP for my GPU in a completely different way (not good if you want to write device independent code).[/*:m:1oejpcsa][/ul][/li]
NVIDIA’s CUDA Runtime API supports TPP (which I verified). I’m more interested in OpenCL, but at least the card implements TPP.

(My tests use the latest driver, 280.19 for NVIDIA OpenCL 1.1 on Windows 7.)

I’ve seen claims (published peer-reviewed papers) that say they have TPP for NVIDIA GPU’s in OpenCL, but I’m wondering if they are being level, or just haven’t actually tested it. Does anyone have code of an example that has TPP working in the NVIDIA OpenCL on a Fermi? Or, is TPP just not supported in NVIDIA OpenCL?

Ken

My test code:

The kernel for this test simply increments global memory in a for-loop, a “position” that is task dependent.


__kernel void aaa(__global int * v, int times, int position)
{
	for (int i = 0; i < times; ++i)
	{
		v[position]++;
	}
}

My code uses one kernel for TPP. In a really good example, each task should be different kernels. But, I didn’t want to write multiple kernels and use atomics and synchronization for TPP because the OpenCL memory model relaxes consistency between work-items in different work-groups. This code spins around for a long, but finite, time, perfect for my needs on testing the overall run time of scheduling tasks. Yes, this kernel could be executed in SIMD / data parallel fashion by just executing multiple work-items in one work-group. But, again, that’s not what I want to do.

In my program, the kernel is called three ways: “sequential”, “concurrent”, and “single command queue”.

The sequential code follows:

void sequential(cl_platform_id platform, cl_device_id device, cl_program program, cl_context context,
    int write_distance)
{
    cl_int err;

    struct _timeb  t1;
    struct _timeb  t2;
    std::cout << "Starting sequential...";
    _ftime_s(&t1);
 
    // Create memory for writes in the kernels.
    size_t asize = sizeof(cl_int) * write_distance * opt->children;
    cl_int * a = (cl_int*) malloc(asize);
    memset(a, 0, asize);

    cl_int r1;
    cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, asize, a, &r1);
    CHECK(r1);

    // Set up NDSpace, basically one work-item in the entire NDSpace.
    size_t tile[3] = {1, 1, 1};
    size_t tiles[3]  = {1, 1, 1};
    int max_dimensionality = 3;

    // Create kernels with arguments, command queue, queue kernel, flush, and wait, ALL SEQUENTIALLY.
    cl_kernel * kernels = (cl_kernel *)malloc(opt->children * sizeof(cl_kernel));
    cl_command_queue * queues = (cl_command_queue *)malloc(opt->children * sizeof(cl_command_queue));
    cl_event * events = (cl_event *)malloc(opt->children * sizeof(cl_event));
    for (int i = 0; i < opt->children; ++i)
    {
        kernels[i] = clCreateKernel(program, "aaa", &err);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 0, sizeof(cl_mem), (void *) &da);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 1, sizeof(cl_int), (void *) &opt->times);
        CHECK(err);
        int position = i * write_distance;
        err = clSetKernelArg(kernels[i], 2, sizeof(cl_int), (void *) &position);
        CHECK(err);
        queues[i] = clCreateCommandQueue(context, device, 0, &err);
        CHECK(err);
        err = clEnqueueNDRangeKernel(queues[i], kernels[i], max_dimensionality, NULL, tiles, tile, 0, NULL, &events[i]);
        CHECK(err);
        err = clFlush(queues[i]);
        CHECK(err);
        err = clWaitForEvents(1, &events[i]);
        CHECK(err);
        err = clReleaseCommandQueue(queues[i]);
        CHECK(err);
        err = clReleaseKernel(kernels[i]);
        CHECK(err);
    }
    // read output array
    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
    CHECK(err);
    err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, asize, a, 0, NULL, NULL);
    CHECK(err);
    err = clReleaseCommandQueue(cmd_queue);
    CHECK(err);
    err = clReleaseMemObject(da);
    CHECK(err);
 
    _ftime(&t2);
    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.
";
 }


In essence, this routine creates “children” number of tasks, each on its own cl_command_queue, which is clFlush’ed immediately. This code forces serialization of each task because a task is not created and executed until the previous is finished executing on the device.

The concurrent code follows:


void concurrent(cl_platform_id platform, cl_device_id device, cl_program program, cl_context context,
    int write_distance)
{
    cl_int err;

    struct _timeb  t1;
    struct _timeb  t2;
    std::cout << "Starting concurrent...";
    _ftime_s(&t1);
 
    // Create memory for writes in the kernels.
    size_t asize = sizeof(cl_int) * write_distance * opt->children;
    cl_int * a = (cl_int*) malloc(asize);
    memset(a, 0, asize);

    cl_int r1;
    cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, asize, a, &r1);
    CHECK(r1);

    // Set up NDSpace, basically one work-item in the entire NDSpace.
    size_t tile[3] = {1, 1, 1};
    size_t tiles[3]  = {1, 1, 1};
    int max_dimensionality = 3;

    // Create kernels with arguments, command queue, queue kernel, flush, and wait, ALL CONCORRENTLY.
    cl_kernel * kernels = (cl_kernel *)malloc(opt->children * sizeof(cl_kernel));
    cl_command_queue * queues = (cl_command_queue *)malloc(opt->children * sizeof(cl_command_queue));
    cl_event * events = (cl_event *)malloc(opt->children * sizeof(cl_event));
    for (int i = 0; i < opt->children; ++i)
    {
        kernels[i] = clCreateKernel(program, "aaa", &err);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 0, sizeof(cl_mem), (void *) &da);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 1, sizeof(cl_int), (void *) &opt->times);
        CHECK(err);
        int position = i * write_distance;
        err = clSetKernelArg(kernels[i], 2, sizeof(cl_int), (void *) &position);
        CHECK(err);
        queues[i] = clCreateCommandQueue(context, device, 0, &err);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clEnqueueNDRangeKernel(queues[i], kernels[i], max_dimensionality, NULL, tiles, tile, 0, NULL, &events[i]);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clFlush(queues[i]);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clWaitForEvents(1, &events[i]);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clReleaseCommandQueue(queues[i]);
        CHECK(err);
        err = clReleaseKernel(kernels[i]);
        CHECK(err);
    }
    // read output array
    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
    CHECK(err);
    err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, asize, a, 0, NULL, NULL);
    CHECK(err);
    err = clReleaseCommandQueue(cmd_queue);
    CHECK(err);
    err = clReleaseMemObject(da);
    CHECK(err);
 
    _ftime(&t2);
    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.
";
}

This code creates “children” number of tasks all at once. Then, each tasked is placed on its own queue. Then, each is queue is clFlush’ed. This code possibly executes tasks concurrently because each task is much longer than the time it takes to be placed on a command queue.

The “single command queue” follows:


void one_queue(cl_platform_id platform, cl_device_id device, cl_program program, cl_context context,
    int write_distance)
{
    cl_int err;

    struct _timeb  t1;
    struct _timeb  t2;
    std::cout << "Starting one queue...";
    _ftime_s(&t1);
 
    // Create memory for writes in the kernels.
    size_t asize = sizeof(cl_int) * write_distance * opt->children;
    cl_int * a = (cl_int*) malloc(asize);
    memset(a, 0, asize);

    cl_int r1;
    cl_mem da = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, asize, a, &r1);
    CHECK(r1);

    // Set up NDSpace, basically one work-item in the entire NDSpace.
    size_t tile[3] = {1, 1, 1};
    size_t tiles[3]  = {1, 1, 1};
    int max_dimensionality = 3;

    // Create kernels with arguments, command queue, queue kernel, flush, and wait, ALL SEQUENTIALLY.
    cl_kernel * kernels = (cl_kernel *)malloc(opt->children * sizeof(cl_kernel));
    cl_command_queue * queues = (cl_command_queue *)malloc(opt->children * sizeof(cl_command_queue));
    cl_event * events = (cl_event *)malloc(opt->children * sizeof(cl_event));
    queues[0] = clCreateCommandQueue(context, device, 0, &err);
    CHECK(err);
    for (int i = 0; i < opt->children; ++i)
    {
        kernels[i] = clCreateKernel(program, "aaa", &err);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 0, sizeof(cl_mem), (void *) &da);
        CHECK(err);
        err = clSetKernelArg(kernels[i], 1, sizeof(cl_int), (void *) &opt->times);
        CHECK(err);
        int position = i * write_distance;
        err = clSetKernelArg(kernels[i], 2, sizeof(cl_int), (void *) &position);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clEnqueueNDRangeKernel(queues[0], kernels[i], max_dimensionality, NULL, tiles, tile, 0, NULL, &events[i]);
        CHECK(err);
    }
    err = clFlush(queues[0]);
    CHECK(err);
    for (int i = 0; i < opt->children; ++i)
    {
        err = clWaitForEvents(1, &events[i]);
        CHECK(err);
    }
    for (int i = 0; i < opt->children; ++i)
    {
        err = clReleaseKernel(kernels[i]);
        CHECK(err);
    }
    err = clReleaseCommandQueue(queues[0]);
    CHECK(err);
    // read output array
    cl_command_queue cmd_queue = clCreateCommandQueue(context, device, 0, &err);
    CHECK(err);
    err = clEnqueueReadBuffer(cmd_queue, da, CL_TRUE, 0, asize, a, 0, NULL, NULL);
    CHECK(err);
    err = clReleaseCommandQueue(cmd_queue);
    CHECK(err);
    err = clReleaseMemObject(da);
    CHECK(err);
 
    _ftime(&t2);
    std::cout << (double)(t2.time - t1.time + ((double)(t2.millitm - t1.millitm))/1000) << " s.
";
 }

This code creates “children” number of tasks all at once. Then, each tasked is placed on one queue shared among all tasks. Then, the queue is clFlush’ed. This code possibly executes tasks concurrently because each task is much longer than the time it takes empty the command queue of all task.

My program accepts command-line parameters to adjust to choose the platform, device, the number of threads to create, the number of times the for-loop in the kernel loops, and how wide to spread memory writes from one thread to the next.

The complete MSVC++ 2010 solution is here.

Results:

When I run this on my quad-core using the AMD OpenCL platform, I get a nice speed up around 4x, which is what I would expect:


$ ocl-task-parallel.exe --platform 1 --children 10 --times 123456789
Number of platforms = 2
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.212)
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
Platform extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_
khr_d3d10_sharing
devices = 1
            Device [0]
                type                          = CL_DEVICE_TYPE_CPU
                name                          = Intel(R) Core(TM)2 Quad CPU
      @ 2.40GHz
Starting sequential...3.459 s.
Starting sequential...3.459 s.
Starting sequential...3.456 s.
Starting sequential...3.465 s.
Starting sequential...3.461 s.
Starting sequential...3.462 s.
Starting concurrent...0.924 s.
Starting concurrent...0.972 s.
Starting concurrent...0.94 s.
Starting concurrent...0.923 s.
Starting concurrent...0.926 s.
Starting concurrent...0.929 s.

However, on an NVIDIA GTX 470, the concurrent tasks run in the same time as the sequential tasks:


$ ocl-task-parallel.exe --platform 0 --children 4 --times 1234567
Number of platforms = 2
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.1 CUDA 4.0.1
Platform name: NVIDIA CUDA
Platform vendor: NVIDIA Corporation
Platform extensions: cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing
cl_nv_d3d9_sharing cl_nv_d3d10_sharing cl_khr_d3d10_sharing cl_nv_d3d11_sharing
cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
devices = 1
            Device [0]
                type                          = CL_DEVICE_TYPE_GPU
                name                          = GeForce GTX 470
Starting sequential...1.988 s.
Starting sequential...1.987 s.
Starting sequential...1.988 s.
Starting sequential...1.987 s.
Starting sequential...1.988 s.
Starting sequential...1.987 s.
Starting concurrent...1.986 s.
Starting concurrent...1.988 s.
Starting concurrent...1.987 s.
Starting concurrent...1.985 s.
Starting concurrent...1.988 s.
Starting concurrent...1.987 s.

Finally, on an ATI HD 6450 GPU, the “one command queue” tasks run faster than either the sequential tasks or concurrent tasks:


$ ocl-task-parallel.exe --device 0 --times 123456
Number of platforms = 1
Platform profile: FULL_PROFILE
Platform version: OpenCL 1.1 AMD-APP-SDK-v2.5 (684.212)
Platform name: AMD Accelerated Parallel Processing
Platform vendor: Advanced Micro Devices, Inc.
Platform extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_
khr_d3d10_sharing
devices = 2
            Device [0]
                type                          = CL_DEVICE_TYPE_GPU
                name                          = Caicos
Starting sequential...0.528 s.
Starting sequential...0.517 s.
Starting sequential...0.512 s.
Starting sequential...0.512 s.
Starting sequential...0.505 s.
Starting sequential...0.515 s.
Starting concurrent...0.518 s.
Starting concurrent...0.522 s.
Starting concurrent...0.52 s.
Starting concurrent...0.521 s.
Starting concurrent...0.526 s.
Starting concurrent...0.519 s.
Starting one queue...0.441 s.
Starting one queue...0.44 s.
Starting one queue...0.436 s.
Starting one queue...0.442 s.
Starting one queue...0.443 s.
Starting one queue...0.44 s.

Concurrent kernels for Fermi should be possible in CUDA. To test that, I wrote an program in the CUDA Runtime API similar to the OpenCL solution. This code definitely proves that concurrent kernels work on a Fermi (see CUDA Runtime API solution here).


$ cuda-r-task-parallel.exe --children 10 --times 1234567
devices = 1
Starting sequential...0.818 s.
Starting sequential...0.777 s.
Starting sequential...0.777 s.
Starting sequential...0.777 s.
Starting sequential...0.777 s.
Starting sequential...0.777 s.
Starting concurrent...0.079 s.
Starting concurrent...0.08 s.
Starting concurrent...0.079 s.
Starting concurrent...0.079 s.
Starting concurrent...0.08 s.
Starting concurrent...0.08 s.

Consider rerunning your various programs, assuming your implementation’s device allows it, to create an out-of-order command queue using CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE. The default is in-order and causes your individual enqueues to be serially executed, that is, each NDRange command must wait until the previous NDRange completes execution before the next one can execute.

I will simply elaborate a bit on what Brian Watt already correctly pointed out.

First, it would be a good idea to measure the overhead of creating and destroying multiple kernels and queues, flushing the queues, etc. We don’t know how many “children” tasks there are, so it may well be significant and perhaps explain why “single command queue” runs faster than “concurrent” on an AMD GPU.

Second, while the “concurrent” version of the code does allow multiple tasks to be executed concurrently, the “single command queue” version does not. It creates a single in-order queue, which according to the spec must sequentially run one task at a time.

When we write our own synthetic microbenchmarks it’s easy to draw incorrect conclusions.

You are right. It never occurred to me that the overhead of creating queues and kernels would be large. I changed my code to execute the same number of calls to OpenCL (clCreateQueue, etc) between the “single command queue” and “concurrent”, and the overall run-times for the AMD GPU are the same. So, my conclusion that AMD GPU implements TPP is wrong. When running the program with a “null” kernel, i.e., where the kernel for-loop “times” is 0, the overhead is quite shocking. Fortunately, I’m now at a point where I can use a profiling tool to see more detail here.

Second, while the “concurrent” version of the code does allow multiple tasks to be executed concurrently, the “single command queue” version does not. It creates a single in-order queue, which according to the spec must sequentially run one task at a time.

Yeah, I didn’t understand what was meant by “executed in-order” in the spec. But, I now see that the glossary defines it. Unfortunately, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE on the command queue create does not help in making kernels run concurrently, for either AMD or NVIDIA. Maybe I have to try clEnqueueTask() instead.

When we write our own synthetic microbenchmarks it’s easy to draw incorrect conclusions.

True, but that’s why it’s good to have others look at your work. :slight_smile: