Inner loops with OpenCL

Hello

I am new to OpenCL and want to parallelize some looping code thats doing lu factorization with the looping structure showed by exact code as below:

for(int k = 0; k < N-1; k++)
{
    for(int i = k+1; i < N; i++)
        S[i*N + k] = S[i*N + k] / S[k*N + k];

    for(int j = k+1; j < N; j++)
        for(int i = k+1; i < N; i++)
            S[i*N + j] -= S[i*N + k] * S[k*N + j];
}

I have done with the simple opencl kernel with single work items (no groping). Thats following:

  int IDx = get_global_id(0);
  int IDy = get_global_id(1);

  for(int k = 0; k < n-1; k++)
  {
    barrier(CLK_GLOBAL_MEM_FENCE);

    if(IDy > k && IDx == k)
        matrix[IDy*n + IDx] = matrix[IDy*n + IDx] / matrix[IDx*n + IDx];

    barrier(CLK_GLOBAL_MEM_FENCE);

    for(int j = k+1; j < n; j++)
    {
        if(IDy > k && IDx == j)
            matrix[IDy*n + IDx] -= matrix[IDy*n + k] * matrix[k*n + IDx];
    }
  }

But I dont get correct results when compared to the serial code, this is my personal try for OpenCL kernel and I am still learning how this data parallel scheme in OpenCL works, Can you point out what I am doing wrong in the kernel?

anyone there…

Aren’t the answers from the AMD forums not good enough for you? And by the way, crossposting is rarely an acceptable behaviour.