GL-CL interoperability performances

whites11 · February 11, 2010, 6:51am

Hi.
i’m developing a sample app that has to do the following:

render a single frame to an offscreen framebuffer
analyze pixel by pixel the generated image (in different manners).

i developed a standard openGL app, which draws offscreen, calls the glReadPixels to retrieve rendering result and then do its stuff. this one takes about 0,15 seconds to perform 100 runs on a small rendering (300x300).

then, i developed an opencl app that:

prepares an opengl context
prepares an opencl buffer from the framebuffer
computes the rendered image on the GPU side (there is no explicit data copy between RAM and VRAM)
retrievs the result of the evaluation from the GPU memory (this is just one float number)

this app takes about 4 seconds to run!!!

a simplified snippet of code:


glutInit(...)

glGenFramebufferEXT(...)
glBindFramebufferEXT(...)

glGenRenderBufferEXT(1, &colorId);
glBindRenderbufferEXT(GL_RENDERBUFFER_EXT, colorId);
glRenderbufferStorageEXT(GL_RENDERBUFFER_EXT, GL_RGBA8, 300, 300)

glFramebufferRenderbufferEXT(GL_FRAMEBUFFER_EXT, GL_COLOR_ATTACHMENT0_EXT, GL_RENDERBUFFER_EXT, colorId);

CGLContextObj kCGLContext = CGLGetCurrentContext();              
CGLShareGroupObj kCGLShareGroup = CGLGetShareGroup(kCGLContext);
		
cl_context_properties properties[] = { 
	CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, 
	(cl_context_properties)kCGLShareGroup, 
	0 
};
        
// Create a context from a CGL share group
context = clCreateContext(properties, NULL, NULL, NULL, NULL, &err);

cl_device_id devices[1];
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_GPU, 1, devices, NULL);

program = clCreateProgramWithSource(context, 1, strings, lengths, &err);
queue = clCreateCommandQueue(context, devices[0], NULL, &err);
m_Kernel = clCreateKernel(program, "Evaluate", &err);
res_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float), NULL, &err);
clSetKernelArg(m_Kernel, 1, sizeof(cl_mem), &res_buffer);	
m_Shared = clCreateFromGLRenderbuffer(context, CL_MEM_READ_ONLY, colorId, &err);
clSetKernelArg(m_Kernel, 0, sizeof(cl_mem), &m_Shared);	

Draw(); // openGL code to draw to the framebuffer
glFinish();

clEnqueueAcquireGLObjects(queue, 1, &m_Shared, NULL, NULL, NULL);
const size_t number = 1;
clEnqueueNDRangeKernel(queue, m_Kernel, 1, NULL, &number, &number, 0, NULL, NULL);
clEnqueueReleaseGLObjects(queue, 1, &m_Shared, 0, 0, 0)

float results[1];
clEnqueueReadBuffer(queue, res_buffer, CL_TRUE, 0, sizeof(float), results, NULL, NULL, NULL);

the kernel:


__kernel void Evaluate(
    read_only image2d_t framebuffer,
    __global int *results){
	
int matching = 0;
for(int i=0;i<3000;i++)
{
	for(int j=0;j<3000;j++)
	{
		float4 pixel = read_imagef(framebuffer, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, (int2)(i,j));
		if(pixel matches)
                       matching++;
	}
results[0] = matching;
}

i can’t understand why performances are so bad.
any help would be really appreciated

if you have any questions, please ask

thanks a lot,
christian

dbs2 · February 11, 2010, 10:46am

It looks like your global work size is 1, which means you’re only using 1/16th of 1 of the streaming processors on the GPU. (Which is far slower than 1 core on a CPU.)

You should set your global work size to 3000, 3000 and remove the for-loops in your kernel. (BTW – do you really mean 3k by 3k or do you want 300 by 300? The image is only 300x300.) At the end you then use an atomic add to increment the number matching if there is a match. Actually, since the atomic add is going to be very slow, you might want to set your global size to be much smaller (say 100, 100 for a 300x300 image) and then count up to 9 matches per work-item. That would reduce the number of atomic adds you’d have to do.

Make sure you verify that atomics are supported on the device you’re using, though!

whites11 · February 11, 2010, 12:07pm

hi and thank you for the answer.
actually images are 300x300. i saw i made many mistakes in the last post but i cannot edit it.
tomorrow i’ll surely try what you suggested.
thanks again

christian

whites11 · February 12, 2010, 5:53am

Hi,
i tryied using several work sizes like suggested by you.
i got the best performances using small computation areas for each kernel, such as 100 pixels for each kernel for a 300x300 image.
now i have execution times of about 0,3seconds in openCL vs 0,1 seconds for the CPU algorithm which uses getPixels to retrieve the whole framebuffer from the VRAM.
do you think this is right or not?

dbs2 · February 12, 2010, 7:00am

You should be able to execute this (depending of course on the details of “pixel matches”) far faster on the GPU. There’s something wrong.

Can you post your new kernel for doing the matching pixels counts? Are you using a global atomic add to have each parallel count added together?

whites11 · February 12, 2010, 7:12am

this is the kernel:


__kernel void Evaluate(
    read_only image2d_t framebuffer,
	int width,
	int height,
	int block_size, 
	__global int *results
	){
	
		int index = get_global_id(0);
		
		int reds = 0;
		int a = index*block_size;
		int b = a+block_size;
		for(int i=a;i<b;i+=1)
		{
			int row = trunc((float)i / (float)width);
			int column = i - (row * width);
			
			float4 pixel = read_imagef(framebuffer, CLK_ADDRESS_NONE | CLK_FILTER_NEAREST, (int2)(row,column));
			if(pixel.x == 1 && pixel.y == 0 && pixel.z == 0)
			{
				reds+=1;
			}
		}
		results[index] = reds;
	}

as you can see, each work item save in a different location in the results array, so there is no atomic code in the kernels.
the final sum is made in CPU land, after reading back data.
note that this is not actually a bottleneck (even with this part of code commented out performances are mostly the same).

dbs2 · February 12, 2010, 9:20am

What is your global size? In general you need 1000+ global work-items to hide memory latencies.

whites11 · February 12, 2010, 11:26am

by now i set global work items number to 900 and local size to 1.
what should i set?

pobelzile · February 12, 2010, 12:49pm

Local work size seems pretty small to me. On NVidia architecture, it must be a multiple of 32 for maximum efficiency. When no collaboration is required inside work group threads (ie no communication through shared mem), the local work size should be optimized using CUDA occupancy calculator (an excel spreadsheet available part of the SDK). What you want to do is maximize multiprocessor occupancy factor.

I hope it helps…

P-O

whites11 · February 13, 2010, 3:23am

hi and thanks for your answer.
so, let’s say i need a total amount of 900 work items.
which should be the global and local numbers in your opinion?

dbs2 · February 13, 2010, 9:53am

For only 900, you should definitely have a global size of 900. Even that is a bit low. I’d recommend 1000-2000 at a minimum, more for larger GPUs.

As for local size, if you have no synchronization requirements, then it will be device-dependent and you’ll have to play with it. To start you can just pass in NULL for the local size. If you have synchronization requirements then that will dictate what you need.