So I using OpenCL v 2.2.
The problem I have on one GPU only with global memmory without using work groups.
This is my kernel code, every item in workgroup copute with different data, and 0. item copy input data form global memory to local memory to faster access.
__kernel attribute((reqd_work_group_size(32, 1, 1))) attribute((vec_type_hint(short)))
void correlationMultiplication(__global short *inBuffer, __global int *outBuffer, __constant short *globalMask,
const int maskLen, __local short *mask, __local short *data)
{
//Private variable for result of correlation
int result = 0;
//Information about ID of kernel
int bufferLen = get_global_size(0);
uint startPlace = get_global_id(0);
uint sequence = get_global_id(1);
uint localId = get_local_id(0);
uint localIdY = get_local_id(1);
//First workItem in workGroup writing data to local buffer
if(localId == 0)
loadData(inBuffer, globalMask, sequence, startPlace, mask, data);
else
barrier(CLK_LOCAL_MEM_FENCE);
for(int i = 0; i < MASKLEN; i++){
//computing iterations, every iteration 0. item in wokrGroup load new data to local memory
barrier(CLK_LOCAL_MEM_FENCE);
}
//Save result to global memory
barrier(CLK_GLOBAL_MEM_FENCE);
}
My host code is defining variable. And thi is part of hostprogram defining kernel calling. If I use global work space bigger then 64K, display will be black and nVidia driver is rebooted after that.
//Parameters for kernel, input data and output data, differentf buffer for iteration, using structure with events
ret = clSetKernelArg(gpuControlData->correlKernel, 0, sizeof(cl_mem), (void *)&(ioData[bufferNumber].inBuffer));
ret = clSetKernelArg(gpuControlData->correlKernel, 1, sizeof(cl_mem), (void *)&(ioData[bufferNumber].outBuffer));
//Write all constant parameter which are set on start of program
//Parameter pointer to field with masks, adn nuber of samples for one bit
ret = clSetKernelArg(gpuControlData->correlKernel, 2, sizeof(cl_mem), (void *)&(mask));
ret = clSetKernelArg(gpuControlData->correlKernel, 3, sizeof(int), &(transmitorData->bitLenght));
//This parameters declared size of local variables in workgroup
ret = clSetKernelArg(gpuControlData->correlKernel, 4, sizeof(short) * MASK_LEN, NULL);
ret = clSetKernelArg(gpuControlData->correlKernel, 5, sizeof(short) * 10240, NULL);
//Wait to write all input data and complete previous correlation
clFinish(gpuControlData->cmdQueue);
clFinish(gpuControlData->maskCorrelQueue);
//Use only 2D, first lenght buffer, second number of transmitor
size_t globaId[3] = { 2048, 2, 0 };
size_t localId[3] = { 64, 1, 0 };
ret = clEnqueueNDRangeKernel(gpuControlData->maskCorrelQueue, gpuControlData->correlKernel, 2, NULL, globaId, localId, 0, NULL, &(ioData[bufferNumber].resultReady));
clWaitForEvents(1, &(ioData[bufferNumber].resultReady));
One last question what I have is about Events. I have declared event, and I wait to event for complete calculation. But I want to have waiting to different thread on CPU for faster processing.
But in new thread event never wait. I want to know how to resolve this problem? And how to create better followed iteration, becouse every iteration to create new record to command queue, GPU does not working, just waiting.
Thanks for help.