Problem using multiple host threads

Hi,

My GPU is a ATI Mobility Radeon HD 5450; specifications for the 5470 (which are nearly identical) can be found here. I’ve encountered a problem using multiple host threads (using OpenMP) and OpenCL. What I’m doing is the following:

I’ve got one class containing the context, program, queue, etc.:

class OclMain
{
	...
	...
	cl::Device       device;
	cl::Context      context;
	cl::CommandQueue queue;
	cl::Program      program;
	cl::Program::Sources sources;
	...
	...
}

For each OpenMP thread I have an instance of the following class, which each contains its own cl::Kernel because setArg(…) isn’t thread safe. Each of these also handles its own buffer creation, execution of the kernel, etc. If, for example, I have a maximum of 16 threads using #pragma omp parallel for num_threads(16), I create 16 of these objects and each thread has its own Ocl object. When a thread is done, the Ocl object is reused for a next iteration of the aforementioned for loop and setEpi() is called again to upload the new data to the device. Each thread handles one cv::Mat epi.

class Ocl
{
	Ocl(OclMain *oclm) : oclm(oclm) { kernel = cl::Kernel(oclg->program,"kernel"); }
	...
	// this data doesn't change during execution of an openmp thread, so I only upload it once to the device after thread creation
	void setEpi(cv::Mat &epi)
	{
	    ...
	    img_epi = cl::Image2D(oclm->context, CL_MEM_READ_ONLY, cl::ImageFormat(CL_RGBA, CL_FLOAT), epi.cols, epi.rows, 0, 0);

	    cl::size_t<3> origin;
	    origin[0] = 0; origin[1] = 0, origin[2] = 0;
	    cl::size_t<3> region;
	    region[0] = epi.cols; region[1] = epi.rows; region[2] = 1;

	    oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, 0);
	    ...

            // enqueue some more WriteBuffers here (only small buffers)
            ...
            ...
	}
	...
	// gets called multiple times (maximum epi.rows times, on a row per row basis; 
	// the algorithm I need to implement works this way)
	void runKernel(...)
	{
		// will contain result of kernel computation
		cl::Buffer buff(oclg->context,CL_MEM_WRITE_ONLY,sizeof(float)*dmax*epi.cols);

		// set kernel args here
		// ...

		// enqueue kernel
		oclg->queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(epi.cols/64*64+64),cl::NDRange(64));

		// get results from device
		oclg->queue.enqueueReadBuffer(buffer,CL_TRUE,0,sizeof(float)*dmax*epi.cols,result.data());		
	}

	...
	...
	cl::Kernel  kernel;
	OclMain    *oclm
	cl::Image2D img_epi

}

If epi.rows and epi.cols is small (e.g.: 240x100) this works without a problem.
If epi.rows and epi.cols is big(ger) (e.g.: 863x100), it does not, unless I only use ONE OpenMP thread. If I use more threads, the program will freeze after executing the first few threads. As far as radeontop is concerned; at this point there is nothing going on on the GPU, 0% for all statistics.

The problem seems to be that the call to oclg->queue.enqueueWriteImage(img_epi, CL_TRUE, origin, region, 0, 0, epi.data, 0, NULL) never finishes. If I change the call to non-blocking, the program continues running, untill the next blocking call which won’t return either.

I tried flushing; didn’t help much either. HOWEVER, I don’t seem to have this problem if I upload the data using the CL_MEM_COPY_HOST_PTR method instead of enqueueing a WriteImage. What’s going on?

Try protecting the call to enqueueWriteImage with a mutex (OpenMP has a lock mechanism - see here: http://stackoverflow.com/questions/2396430/how-to-use-lock-in-openmp ) so that only one thread will execute it simultaneously. If that works, it means that the implementation you’re using is not exactly thread-safe, as defined by the spec… (probably some shared resource down the driver stack is not properly protected)