Problem with nVidia GPU

Hello,
I write a program and kernels on ATI grapics card, and all results are good, but my ATI is very slow. So I tried to start program on nVidia.
But there is many problem. First the result of the same program is different. What can I do to repair it?
Second problem is with work groups and number of operation in queue. Work items size on my nVidia graphics card is 1024x1024x64. I run my kernel as workgroup 1024x1x1 and it is ok. But problem is when I create global work space equal or greater then 1024*64, so this case fall down graphics driver and I dont have any results. So how it is posible the results on nVidia is wrong and on ATI is ok, and on ATI never fall down graphics driver, if I use bigger and bigger global size.

I creating program in C with Visual Studio.
Different between computers is HW not good result on nVidia GT 630, intel core i7 and Windows 7 OS. And correct results is on ATI, AMD and windows 10 OS.

Where is problem and how to repair it.

To receive any advice other than a lucky guess you need to share:

  • The version of OpenCL you are using (1.2?)
  • Your kernel code
  • Your host code, in particular the code where you determine and set the group dimensions

So I using OpenCL v 2.2.
The problem I have on one GPU only with global memmory without using work groups.

This is my kernel code, every item in workgroup copute with different data, and 0. item copy input data form global memory to local memory to faster access.

__kernel attribute((reqd_work_group_size(32, 1, 1))) attribute((vec_type_hint(short)))
void correlationMultiplication(__global short *inBuffer, __global int *outBuffer, __constant short *globalMask,
const int maskLen, __local short *mask, __local short *data)
{
//Private variable for result of correlation
int result = 0;

//Information about ID of kernel
int bufferLen = get_global_size(0);
uint startPlace = get_global_id(0);
uint sequence = get_global_id(1);
uint localId = get_local_id(0);
uint localIdY = get_local_id(1);

//First workItem in workGroup writing data to local buffer
if(localId == 0)
loadData(inBuffer, globalMask, sequence, startPlace, mask, data);
else
barrier(CLK_LOCAL_MEM_FENCE);

    for(int i = 0; i < MASKLEN; i++){
              //computing iterations, every iteration 0. item in wokrGroup load new data to local memory
             barrier(CLK_LOCAL_MEM_FENCE);
    }
    //Save result to global memory
    barrier(CLK_GLOBAL_MEM_FENCE);

}

My host code is defining variable. And thi is part of hostprogram defining kernel calling. If I use global work space bigger then 64K, display will be black and nVidia driver is rebooted after that.

//Parameters for kernel, input data and output data, differentf buffer for iteration, using structure with events
ret = clSetKernelArg(gpuControlData->correlKernel, 0, sizeof(cl_mem), (void *)&(ioData[bufferNumber].inBuffer));
ret = clSetKernelArg(gpuControlData->correlKernel, 1, sizeof(cl_mem), (void *)&(ioData[bufferNumber].outBuffer));

  //Write all constant parameter which are set on start of program
  //Parameter pointer to field with masks, adn nuber of samples for one bit
  ret = clSetKernelArg(gpuControlData->correlKernel, 2, sizeof(cl_mem), (void *)&(mask));
  ret = clSetKernelArg(gpuControlData->correlKernel, 3, sizeof(int), &(transmitorData->bitLenght));
  
  //This parameters declared size of local variables in workgroup
  ret = clSetKernelArg(gpuControlData->correlKernel, 4, sizeof(short) * MASK_LEN, NULL);
  ret = clSetKernelArg(gpuControlData->correlKernel, 5, sizeof(short) * 10240, NULL);
  //Wait to write all input data and complete previous correlation
  clFinish(gpuControlData->cmdQueue);
  clFinish(gpuControlData->maskCorrelQueue);
  
  //Use only 2D, first lenght buffer, second number of transmitor
  size_t globaId[3] = { 2048, 2, 0 };
  size_t localId[3] = { 64, 1, 0 };
  ret = clEnqueueNDRangeKernel(gpuControlData->maskCorrelQueue, gpuControlData->correlKernel, 2, NULL, globaId, localId, 0, NULL, &(ioData[bufferNumber].resultReady));
  clWaitForEvents(1, &(ioData[bufferNumber].resultReady));

One last question what I have is about Events. I have declared event, and I wait to event for complete calculation. But I want to have waiting to different thread on CPU for faster processing.
But in new thread event never wait. I want to know how to resolve this problem? And how to create better followed iteration, becouse every iteration to create new record to command queue, GPU does not working, just waiting.

Thanks for help.

//First workItem in workGroup writing data to local buffer
if(localId == 0)
loadData(inBuffer, globalMask, sequence, startPlace, mask, data);
else
barrier(CLK_LOCAL_MEM_FENCE);

This is a very bad bug: You do a conditional barrier but only one thread may reach it. This is not allowed and makes no sense.
You need to ensure all threads reach any barrier, so all threads can wait on it. Actually what you do leads to undefined behaviour.
You can do conditional barriers only if you make sure all threads get to it.
In your case just remove the else instruction. Thread 0 will load and all others will wait until the data is there.