how to debug opencl kernel program running on GPU?

I can debug my program for CPU using printf or gdb support, and the result correct, but when i set the device to GPU, the result is incorrect. does anyone know how i can debug the kernel code for GPU? printf and gdb not working for GPU. usually how do you debug the kernel code for GPU, such as checking the intermediate values?

Well, my technique to detect syntax errors is to comment code and run it to check if it builds ok. To see your results you can put them in a buffer and check it outside the kernel.

Have a look at clGetProgramBuildInfo() with CL_PROGRAM_BUILD_LOG. It basically gives you the compiler output, which shows you where your syntax errors are.

Have a look at clGetProgramBuildInfo() with CL_PROGRAM_BUILD_LOG. It basically gives you the compiler output, which shows you where your syntax errors are.[/quote]

I know how to get the syntax error using this API, my question is how to check the intermediate values of the kernel code because the result i got is not right.

One way is to create a debug output buffer, send that as argument to the kernel and write some value you want to inspect to it. Then copy the values back to host and compare with a CPU based implementation. Depending on if they are equal or not, move the debug write either up or down through the kernel source in a binary search fashion until you find the place where CPU and GPU differ.

[quote=“ibbles”]

One way is to create a debug output buffer, send that as argument to the kernel and write some value you want to inspect to it. Then copy the values back to host and compare with a CPU based implementation. Depending on if they are equal or not, move the debug write either up or down through the kernel source in a binary search fashion until you find the place where CPU and GPU differ.[/quote]

This piece of code is part of my kernel code for my calculation, because other part code are quite independent pararrel that can be executed on each work item, but this part is serial one, so I am thinking that I can make one work item do it, and other work item just do nothing when it comes to this step. So i wrote this , supposing I use work item 0 to finish the computation

//tid is the thread local id, tB and m are all pointer to local memory
//basically I need to derive array m from array tB recursively, one element of m is derived on each step of the first loop. The value of m Is correct when I execute the kernel on CPU, but wrong on GPU. Is it because the synchronizing goes wrong on gpu? Or do you have suggestions to make it work right on gpu? Thank you so much!

if(tid==0)
{

    for (i=0; i<34; i++)
    {
        m[i]= tB[i];
    
        for(j = i+1; j < 34; j++)
        {

            
            tB[j]=mod_subtract(tB[j],tB[i],baseB[j]);
        
            tB[j]=mod_mul(tB[j],Bm[33*i+j-1-i*(i+1)/2],baseB[j]);

            
        }

To debug OpenCL kernels you can use AMD CPU implementation with image/texture function emulation that equals GPU AMD implementation: http://suhorukov.blogspot.com/2011/12/o … -host.html

Just to check, when using local memory you can only use data from work-items
in the same work-group, and also did you set a barrier (barrier(CLK_LOCAL_MEM_FENCE):wink:
before you started this serial section?


jason