Hi,
My GPU is a ATI Mobility Radeon HD 5450; specifications for the 5470 (which are nearly identical) can be found here. I’ve encountered a problem using multiple host threads (using OpenMP) and OpenCL. What I’m doing is the following:
I’ve got one class containing the context, program, queue, etc.:
class OclMain
{
...
...
cl::Device device;
cl::Context context;
cl::CommandQueue queue;
cl::Program program;
cl::Program::Sources sources;
...
...
}
For each OpenMP thread I have an instance of the following class, which each contains its own cl::Kernel because setArg(…) isn’t thread safe. Each of these also handles its own buffer creation, execution of the kernel, etc. If, for example, I have a maximum of 16 threads using #pragma omp parallel for num_threads(16), I create 16 of these objects and each thread has its own Ocl object. When a thread is done, the Ocl object is reused for a next iteration of the aforementioned for loop and setEpi() is called again to upload the new data to the device. Each thread handles one cv::Mat epi.
class Ocl
{
Ocl(OclMain *oclm) : oclm(oclm) { kernel = cl::Kernel(oclg->program,"kernel"); }
...
// this data doesn't change during execution of an openmp thread, so I only upload it once to the device after thread creation
void setEpi(cv::Mat &epi)
{
...
img_epi = cl::Image2D(oclm->context, CL_MEM_READ_ONLY, cl::ImageFormat(CL_RGBA, CL_FLOAT), epi.cols, epi.rows, 0, 0);
cl::size_t<3> origin;
origin[0] = 0; origin[1] = 0, origin[2] = 0;
cl::size_t<3> region;
region[0] = epi.cols; region[1] = epi.rows; region[2] = 1;
oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, 0);
...
// enqueue some more WriteBuffers here (only small buffers)
...
...
}
...
// gets called multiple times (maximum epi.rows times, on a row per row basis;
// the algorithm I need to implement works this way)
void runKernel(...)
{
// will contain result of kernel computation
cl::Buffer buff(oclg->context,CL_MEM_WRITE_ONLY,sizeof(float)*dmax*epi.cols);
// set kernel args here
// ...
// enqueue kernel
oclg->queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(epi.cols/64*64+64),cl::NDRange(64));
// get results from device
oclg->queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(float)*dmax*epi.cols,result.data());
}
...
...
cl::Kernel kernel;
OclMain *oclm
cl::Image2D img_epi
}
If epi.rows and epi.cols is small (e.g.: 240x100) this works without a problem.
If epi.rows and epi.cols is big(ger) (e.g.: 863x100), it does not, unless I only use ONE OpenMP thread. If I use more threads, the program will freeze after executing the first few threads. As far as radeontop is concerned; at this point there is nothing going on on the GPU, 0% for all statistics.
The problem seems to be that the call to oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, NULL) never finishes. If I change the call to non-blocking, the program continues running, untill the next blocking call which won’t return either.
I tried flushing; didn’t help much either. HOWEVER, I don’t seem to have this problem if I upload the data using the CL_MEM_COPY_HOST_PTR method instead of enqueueing a WriteImage. What’s going on?