Converting from Loops to ND Ranges

Hello,
I am trying to finish up my thesis code this week, and I’m completely stumped on one aspect that I thought was pretty intuitive. My code solves the 2D Compressible Navier-Stokes equations using an explicit method using OpenCL. My first step was translating a Fortran code (used 2D arrays) to a C++ code (using 1D arrays). The main solution step involves several thousand iterations of 8 different functions. I have been able to convert 7 of these 8 with no problem. It is this last one that I am having problems with, and unfortunately, it’s also the most important one of the code. It’s also the one where the largest benefits from parallelization could be seen.

My problem domain/grid is a rectangular grid of size 181x65, I.E., there are 181 columns of data, and 65 rows of data. These are termed imax and jmax, respectively.

This function involves really 2 main loops.

The first loop goes over the row

for(int j=2;j<=jmax;j++)
{
...Calculate Stuff On Rows...
}

and


for(int i=2; i<=imax;i++)
{
...Calculate More stuff, on columns, here...
}

I was able to basically copy-paste all of my functions from the C++ code into a kernel, and launch them through clEnqueueTask, and obtain the exact same results that the serial CPU code used. I then converted the functions one by one into kernels that would run using an NDRangeKernel, with a 2D range. After each one, I tested thoroughly, and was able to get 7 of the 8 done.

I broke the previous kernel into two kernels, that remained working as tasks, so long as the first one finished before the second one. [My goal is eventually to get them running side by side and then combine the results later, but I need this first step to work before I attempt that]. So, for example, my first task looked something like:


__kernel void calcFirstThing(arguments)
{
for(int j=2;j<=jmax;j++)
{
...Calculate Stuff on rows here...
}
}

This kernel has the outer loop over the rows, and then has a nested loop that goes over the columns. I have a function that takes my two indices and converts them into a memory address. Again, this works fine as a task. The math only cares about things on the same row, and doesn’t jump to different rows.

My problem arose when I tried to remove the outer loop from the kernels, and use a 1D NDRange to replace the loop. So, for example, my kernel would look like:


__kernel void calcFirstThing(arguments)
{
int j = get_global_id(0);
if(j < 2) return;               // Maintain lower bound of loop

...Calculate Stuff on rows here...

}

and I would enqueue the kernel via:


oclError = clEnqueueNDRangeKernel(queue, kernel, 1, offset, globalworksize, localworksize, 0, NULL, &event);
clEnqueueBarrier(queue);
clFlush(queue);
clFinish(queue);
// Check error...

where offset = 0, globalworksize=jmax+1, localworksize=1.

From my understanding, this should launch one kernel at a time until the global worksize is achieved, so it should launch kernels with a global location from 0 through 65, and because of the conditional inside the kernel, should return if the global location is 0 or 1 (I have boundary cell references and I don’t want to step out of bounds on them). However, what I notice is that the results are nowhere near the same…after anywhere from one or two times through the main loop to several hundred times, depending on my input conditions, my answer “blows up” and starts dividing by zero. This is confounding me, because other than the outer loop of the kernel being removed, I have not changed anything else.

So, now that my problem description is described to you all, I wonder the following things:

  1. The counters for the NDRange start at zero and go up to size-1, right? So if I want to run up to, and include, imax, my global work size should be imax+1 ?

  2. Is my logic even remotely sound on switching out loops for an NDRange? I’m not a computer scientist, and most of my programming experience is Fortran/basic C++, and I’m still getting my head around OpenCL.

  3. Lastly, I have five arrays in the OpenCL kernel, declared as:
    __local float4 MyArray[200]; (similarly for the other arrays).
    I know local memory is limited, and is probably hardware dependent, but if I am launching with a local work size of 1, this Array should only be present once, correct? I’m using this array to store some temporary values from each row or column depending on which kernel I am in.

So I’m hoping that someone might have some advice on these matters. I’d be extremely grateful if someone could help me out! Thank you so much!

Two observations for you:
[ul]
[li] The NDRange does not go from 0 to size-1… that implies a sequence to the processing of these indices. Instead the NDRange will process all the values from 0 to size-1, but it may do these in any order and they can happen sequentially or in parallel. You cannot make assumptions about the order or timing of the work-group’s execution.[/:m:1yc3ou4c][/li][] I’m not sure, but it sounds like you’re expecting your __local arrays to exist just once across your kernel’s execution. This is incorrect. It will exist once for each work-group. Since it sounds like you have only 1 work-item per work-group, each work-item will have its own copy of these arrays and they will begin in an undefined state as each work-item starts executing.[/*:m:1yc3ou4c][/ul]

Andrew,
Thank you for the response.
[ul][li]I think I understand the NDRange issue now. Essentially, my outer loop was set up to incrementally process either a row or column of data. In reality, it doesn’t matter to me what order I do it in. If I am understanding correctly, I can set up my NDRange to go from 0 to imax+1, and as long as the Kernel checks its location then I can make it ‘skip’ calculating those rows of data. I also understand that Offsetting can allow me to skip the first number of items in the global NDRange. So, I think I am ‘safe’ on these regards. All I care about is that they all get processed and get stored before I move onto my next step.[/:m:1rm97yuu][/li][li]Your description of what I wanted my __local arrays to do was correct. I do want one of these __local arrays per work-item. One of the very first things I do is initialize them to zero in a loop shortly after they are declared. [/:m:1rm97yuu][/ul][/li]One further question related to this - if I wanted to maintain this behavior, but increase the local work size to 2 or 3 or larger, would I need to change my __local arrays to __private to keep them independent of the other work items in the work group?

Thank you again!

Note that the EnqueueNDRangeKernel offset parameter is not supported in OpenCL 1.0, support for non-zero offsets was added in 1.1.

If you increase your work-group size and assign them different indices in your local arrays, you could use the get_local_id() functions just like you’re already using get_global_id(). Your note above indicates you need about 16Kbytes of space per work-item, which might be more than some (most?) devices have.

How are you ensuring that the results are all stored before you move onto your next step? What is your next step?

I put the following at the end of each kernel:
barrier(CLK_GLOBAL_MEM_FENCE)

also, for debugging right now, after each clEnqueueNDRangeKernel, I put:
clEnqueueBarrier(queue);
clFlush(queue);
clFinish(queue);

My next step is typically to run a different kernel on the updated data. I run about 20 kernels on mostly global memory before I read back an array to check the status of the solver - how close I am to achieving a good answer. However, this kernel is the first one that runs, and even it runs the very first time, there is a discrepancy between a CPU (serial-only) result and the OpenCL result.

This only synchronizes between work-items of the same work-group, and thus is redundant in your use case.

also, for debugging right now, after each clEnqueueNDRangeKernel, I put:
clEnqueueBarrier(queue);
clFlush(queue);
clFinish(queue);

Okay, that ought to ensure the results are written. And are you then, for testing purposes, calling clEnqueueMapBuffer and examining the data pointed at by pointer that comes back?

My next step is typically to run a different kernel on the updated data. I run about 20 kernels on mostly global memory before I read back an array to check the status of the solver - how close I am to achieving a good answer. However, this kernel is the first one that runs, and even it runs the very first time, there is a discrepancy between a CPU (serial-only) result and the OpenCL result.

Are you seeing the problem even if you use the OpenCL CPU device? Have you tried printing from the kernel when its running on the CPU?

I am using clEnqueueReadBuffer on the memory object, along with the flush/finish calls. I also have it set up to be blocking. What is the difference between clEnqueueMapBuffer and clEnqueueReadBuffer?

I have not tried printing from the CPU - I will test that and see if that comes up with anything usable. However, the problem occurs on both CPU and GPU when run with OpenCL. There are very minor differences in the results - but they’re on the order of 10^-7, so nothing major on float variables.

Interestingly, I just switched to the new NVIDIA drivers for Parallel NSight 1.5 (260.93, I think they are)…and I get nothing but junk as an output from running on my GPU. CPU still works fine. All of my arrays from the GPU are showing as :1.#QNAN0. Guess I’ll got hunt down this later. The price of living on the cutting edge, I suppose.

Mapping a buffer gives you a pointer to the buffer’s contents mapped into host-addressable memory. Read buffer will work fine too.

I have not tried printing from the CPU - I will test that and see if that comes up with anything usable. However, the problem occurs on both CPU and GPU when run with OpenCL. There are very minor differences in the results - but they’re on the order of 10^-7, so nothing major on float variables.

Its a good thing that the problem is consistent between the devices – otherwise the problem would be harder to identify. I don’t know what platform you’re using, but I think printf working on the CPU devices is typical.

Interestingly, I just switched to the new NVIDIA drivers for Parallel NSight 1.5 (260.93, I think they are)…and I get nothing but junk as an output from running on my GPU. CPU still works fine. All of my arrays from the GPU are showing as :1.#QNAN0. Guess I’ll got hunt down this later. The price of living on the cutting edge, I suppose.

I typically remove everything from my kernel, and then start adding one thing at a time and examine the results until it goes bad.

Hey, thank you for all your help tonight. I’ve re-started the conversion of the kernel from the very beginning, where it was but a task with the two loops. I was a lot more delicate this time with it, and left the other kernels as their task-based forms - so only 1/8 is in NDRange kernels, the other 7/8 are still tasks.

I discovered that, with a 1D Global ND Range, that if I set the

Global Worksize = imax+1
Local Worksize = imax+1
Offset = NULL (not using it)

and using

int i = get_local_id(0);
if(i<2) return;

, then my code works on the CPU - an Intel i7 920, with the AMD Stream SDK 2.2, OpenCL driver 1.1.

However, moving over to the GPU - an NVIDIA GeForce 460 GTX, I get the 1.#QNAN0 as a result of the calculation. There’s a couple of things that I think might be causing this - and I think it might be a difference in the ways arrays are handled between AMD and NVIDIA, but I will work on it a bit more tomorrow. I’m going to read over the NVIDIA OpenCL Programming Guide and see if there’s something I might be missing - maybe the way their implementation works is different than the AMD/ATI one.

I think I’m at a good stopping point for the night where my computer gets to live another day :smiley: Thank you again andrew.brownsword. I really learned a lot this evening, and I definitely appreciate your help. Maybe one day I’ll be good enough with OpenCL to help others out :slight_smile:

Hey everyone,
I was able to resolve my last problem with this program. It turned out to be a combination of an algorithm error on my part, as well as a change in hardware that forced me to use a smaller local work size for the kernel - changing from a GTX460 to a 9500GT (NVIDIA) really put a damper on what I was allowed to use for calculations.

I’m going to consider this case closed :slight_smile: Thank you for all your help. I’ve now got to finish writing a thesis in 2 weeks. Good news is that my parallel GPU code is 28X faster than my serial CPU only code. Couldn’t have done it without everyone’s help here. Hopefully I can continue to stay involved and help out others!