Bug in NV OpenCL compiler

Hello,

I already posted this information in nv forums, but it seems that NV people dont read them.

We’ve found a bug in the NV Compiler. Here is the text I’d wrote in the NV forums, and the test application I made to reproduce the problem.

"[i]Hello,

I think I’ve found a bug in the NV OpenCL compiler. The symthoms are: I have a double for loop, and after some operations (ifs, maths, etc) inside that loop, I write to a buffer both for control variables. Well, after reading back the buffer in the host application, the second for control variable is wrong (always the same) , but, if I use that variable in another expresion (for example: ++y; --y;) then the variable has its correct value. It seems like if the optimizer had removed that variable when it shouldn’t. Another sympthom, if I call clBuildProgram disabling the optimizations, then the results are as expected, correct.

I’m preparing a test application to show reproduce this bug, but it would be nice if someone could tell me if this is the right place to fill a bug report.

Thanks,
Jacobo.[/i]"

"[i]We’ve finished a test application that shows the problem. It is attached to this post.

The application performs the next steps:

  • Initializes OpenCL: here we put a macro to use CPU or GPU as platforms. In our systems, CPU has index 1, and GPU has index 0. If in your system is different, you only have to go to the function InitPlatform() and chose there any NVidia platform (of course NV is required to see the error)

  • Reads an raw image from disk, and setups three buffers (one for reading and two for writting)

  • Setups a kernel for execution. Here are two clBuildProgram lines (one is commented out) The one that has the optimzitions disabled is the one that makes the kernel running without problems, but if you use the one that doesn’t specify anything about optimizations, they will be used, and the problem will be present.

  • Executing the kernel. To see the error, you will have to inspect with the Visual Studio debugger the buffer g_outputBuffer once its read back. If clBuildProgram was used with optimizations disabled, you will see that every Y component of every structure element of the array has the smae value always. If you disable the optimizations, the value becomes correct.
    Besides that, if you run the kernel with optimizations, BUT you comment out in the kernel the lines 97,98,99,100 the results are correct. Those lines are dummie and do not affect at all the final result, but in fact, they do.

So, in summary: disabling optimizations fixes the results always, but with optimizations, only with the lines 97-100 being compiled, the result is correct. Otherwise the Y component of the structures of the outputBuffer are incorrect (always the same value)

It would be nice to have some feedback about this issue.
Jacobo[/i]. "

Hmm, it seems that I cant attach a zip in this forums. Take the VS2008 project from this URL: http://www.parallel-games.com/TestReport.zip

Protip: if you want to get the attention of developers, write a 20-line application that reproduces the problem and include the source directly in the post.

david.garcia, please, dont assume that everybody is a noob. I started to send bug reports to cards manufactureds since OpenGL 1.1 ages. The Visual Studio project I’ve prepared is the result of a whole day stripping binaries, C++ and OpenCL code from a VS solution with 91 projects keeping the bug alive until I could reduce more the test project.
Also, this application is really straightforward to read and has all that a HW driver devs needs to easily reproduce the problem in less than a 5 minutes. Of course, they have to read what I wrote.

Sometimes the problem isn’t so easy to reproduce that a test application could fit in less than 20 lines, mainly if the problem is related with the undelaying code optimization, where one line single changed in the OpenCL code changes, the whole shape of the optimized code.

I didn’t post this info here expecting that somebody could help me with this matter, mainly because I already found the problem and the workaround. I posted this here because this info could save someone to spend 4 worker days (as we spent) trying to figure our why those weird things are happening in their code.

nvidia have a registered developer login thingy too where bugs can be sent, so you could always try that (there are links on the front developer.nvidia.com page).

I recall one message on the nvidia forums I think (4-6 months ago?) where someone stated nvidia refused to help them unless they were using cuda, so whether it would make any difference is another matter.

I just started OpenCL programming and might have the same problem.

The original code looks something like:


... some code ...

unsigned int array[48];

for(i=0; i<32; i++) {
   for(j=0; j<47)  {
        array[j] = array[j+1];
   }
   function1(array[0],array[2]);
   function2(array[8],array[6]);
   function2(array[23],array[45]);
}

... some code ...

Because this section does not output the correct result, I retrieved the results of each iteration step. The first thing I did was to eliminate the inner for loop – after that the results were correct for the inner loop and multiple iterations (up to 8 ) of the outer loop. However, after that the results are wrong and I have no idea why.

How is that possible, that the inner-loop MUST be unrolled manually, to yield correct results for the first few iterations? I tested several optimization levels (-O0, -Os) but nothing changed.

Any ideas where that might come from?

System: MacBook Pro, early 2011, AMD Radeon HD 6750M 1024 MB, Mac OS X 10.7 (Lion)

EDIT: When I run the Code on my CPU (as OpenCL device) then the results are correct.

@vci, try disabling OpenCL compiler optimizations in the clBuildProgram, using a line like this: clBuildProgram(program, 0, NULL, “-cl-opt-disable”, NULL, NULL);

@notzed, my company has always worked with NV because their great developer tools and support, but OpenCL is a key tech for us now. We simply can’t sill code our algorithms with GLSL because they reached a complexity point in where the graphics pipeline is a burden too heavy, so if we sadly have to switch to AMD, we will.