Hello and pls help :D

Hello all, I’m a computer science student at Edinboro University of PA working on a senior project and I’m running out of time to debug this.

I am struggling to understand the work group sizes, global size, local size, and all that.

The situation is that I have an 800800 image I need to work on. Originally I tried to set my global work size in clEnqueueNDRangeKernel to 800800. This caused display driver crashes. So I found out about max kernel work group sizes. So I tried en queuing over and over for the amount of iterations I needed to based on work group size. This worked, but it was very very slow.

I am currently trying to set global items to my iterations and local items to my workgroup size and I’m having problems.

How does one divide an 800*800 project up so that I can do one enqueue? Multiple enqueues result in very bad performance negating the reason for using openCL.

Here is my code to grab max work group sizes based on the kernel.

error = clGetKernelWorkGroupInfo(createImageKernelCL, platformDevices[0]->deviceIds[0], CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &maxWorkgroupSize, NULL);

Here is my enqueuing code.


unsigned int iterations = (xyPixels * xyPixels) / (unsigned int)maxWorkgroupSize;
size_t leftOverWorkgroupSize = (xyPixels * xyPixels) % (unsigned int)maxWorkgroupSize;

for(unsigned int i = 0 ; i < iterations ; i++)
	{
		globalIDOffset = i * maxWorkgroupSize;
		clEnqueueNDRangeKernel(commandQueueIds[0], createImageKernelCL, 1, &globalIDOffset, &maxWorkgroupSize, NULL, 0, NULL, NULL);
	}
	if(leftOverWorkgroupSize)
	{
		globalIDOffset = iterations * maxWorkgroupSize;
		clEnqueueNDRangeKernel(commandQueueIds[0], createImageKernelCL, 1, &globalIDOffset, &leftOverWorkgroupSize, NULL, 0, NULL, NULL);
	}

This is very slow, please help a noob understand these sizes and how to break up problems.

So I think I’ve learned you can put whatever you want for the global work size, the local work size just has to be able to evenly divide into it without being larger than the kernel max work items.

Is this true?

If it is then my kernel is what is in truth failing and can you take a look at it. FOr some context I’m drawing a mandelbrot fractal and using a coloring algorithm afterwards.


const char* CreateImageSource[] = {
"__kernel void CreateImage(__global float* image, __global float* fractalProperties, __global unsigned int* fpOperations)",
"{",
	//"unsigned int n = get_global_id(0) * (*workSize) + get_local_id(0);",
	"unsigned int n = get_global_id(0);",
	"int iterations = 0;",
	"float realC = fractalProperties[0] + (n % 800) * fractalProperties[4];", //2
	"float imaginaryC = fractalProperties[2] + (int)(n / 800) * fractalProperties[5];", // 2
	"float realA = 0;",
	"float imaginaryA = 0;",
	"float magnitude = sqrt(pow(realC, 2) + pow(imaginaryC, 2));", //5
	"float hue = 0;",
	"float value = .556789;",
	"float saturation = 1;",
	"float red = 0;",
	"float green = 0;",
	"float blue = 0;",
	"int i = 0;",
	"float f = 0;",
	"float p = 0;",
	"float q = 0;",
	"float t = 0;",
	"fpOperations[n] = 0;", // 9 up to this point
	"while(magnitude >  fractalProperties[7] && magnitude < fractalProperties[8] && iterations < fractalProperties[6])", // 3
	"{",
		"float tempRealA = (realA * realA) - (imaginaryA * imaginaryA);", // 3
		"imaginaryA = (realA * imaginaryA) + (imaginaryA * realA);", // 3
		"realA = tempRealA + realC;", // 1
		"imaginaryA = imaginaryA + imaginaryC;", // 1
		"magnitude = sqrt(pow(realA, 2) + pow(imaginaryA, 2));", // 3
		"iterations += 1;",
		"fpOperations[n] += 14;",
	"}",
	"if(iterations < 20 || (int)(iterations / fractalProperties[6]) == 1)",
	"{",
		"image[n*3] = 0.0;",
		"image[n*3+1] = 0.0;",
		"image[n*3+2] = 0.0;",
	"}",
	"else",
	"{",
		"hue = 360 * ((float)iterations / (float)fractalProperties[6]);",	// 2
		"hue /= (float)60;", // 1
		"i = (int)floor(hue);", // 1
		"f = hue - i;", // 1
		"p = value * (1 - saturation);", // 2
		"q = value * (1 - saturation * f);", // 3
		"t = value * (1 - saturation * (1 - f));", //4
		"switch(i)",
		"{",
			"case 0:",
			"{",
				"red = value;",
				"green = t;",
				"blue = p;",
				"break;",
			"}",
			"case 1:",
			"{",
				"red = q;",
				"green = value;",
				"blue = p;",
				"break;",
			"}",
			"case 2:",
			"{",
				"red = p;",
				"green = value;",
				"blue = t;",
				"break;",
			"}",
			"case 3:",
			"{",
				"red = p;",
				"green = q;",
				"blue = value;",
				"break;",
			"}",
			"case 4:",
			"{",
				"red = t;",
				"green = p;",
				"blue = value;",
				"break;",
			"}",
			"default:",
			"{",
				"red = value;",
				"green = p;",
				"blue = q;",
				"break;",
			"}",
		"}",
		"image[n*3] = red;",
		"image[n*3+1] = green;",
		"image[n*3+2] = blue;",
	"}",
"}"

How does one divide an 800*800 project up so that I can do one enqueue?

So I think I’ve learned you can put whatever you want for the global work size, the local work size just has to be able to evenly divide into it without being larger than the kernel max work items.

Is this true?

Yes, it is true.


cl_int errcode;
size_t work_size[2] = {800, 800};

errcode = clEnqueueNDRangeKernel(myQueue, myKernel,
    2, // Dimensions of the NDRange that you want.
    NULL, // Starting offset of the NDRange. Not used here.
    work_size, // 800x800 (see above)
    NULL, // Remember: you don't need to choose a work-group size explicitly.
    0, // No wait list necessary for in-order command queues (the default).
    NULL, // No wait list necessary.
    NULL); // No return event necessary (same reason as above).
    

If it is then my kernel is what is in truth failing and can you take a look at it.

Could you please rephrase the question?

Thank you for your help! :smiley:

I’m just using a 1 dimensional 800*800 array.

so I just put 1 dimension where you put 2 and put 640000 for my global work size.

my array actually has 3 dimensions. So in my code I index this with global_id * 3(+0, 1, or 2) depending on the value I want to adjust. The three values are red green and blue.

So I am at least enqueuing correctly. Thank you so much for confirming that :slight_smile: You are my hero.

My question before was if you saw something explicitly in my kernel that would cause a driver to crash?

Is there a max run time on kernels?

Something I’m doing is causing my program to crash after the first zoom. On each zoom I just change the fractal values and run the kernels again. So it runs fine once, then the second time it will cause my kernel to crash.

I’m suspecting an nvidia problem with the 8800gts, as nvidia seems to have several issues with implementing the specification.

Can anyone just breeze through my kernel and see if there is anything that would make it crash real quick, I posted it above.

And thanks again.

I think I may have narrowed down the problem, does anyone know of any issues with GLUT library and openCL working together. It seems to only crash when I swap buffers in GLUT. Which would imply my kernel isn’t breaking the display driver, GLUT is.

Ok I think I have solved the issue and it was 2 things both involving memory.

First in my GLUT mouse function I was making a non blocking write call to memory and immediately calling enqueueNDRange. I figured since I do operations in order that the memory read would be done before the enqueue. Well it will execute the operation first sure, but not finish. So I make a call to memory, then call the kernels which use that memory. Bad stuff, I change it to a blocking write.

The other bad thing was that I was calling a GLUT display command and the right after it an openCL read buffer command. These 2 do not know about each other in the least, and I’m quite certain the GLUT. For safety reasons I placed the openCL buffer read blocking and before the GLUT commands. This seems to have resolved my issue. I’ll fill you guys in.

UPDATE:

Ok so it still breaks sometimes just not as often lol. Back to square 1 I suppose. I can easily break it by jacking up the amount of iterations my while loop does inside the kernel.

I am guessing kernels have a maximum run time or something… I am so utterly lost on this problem lol.

Last post on this issue I promise.

I have FINALLY put this baby to rest.

I am encountering the watchdog timer. The watchdog timer will kill the kernels if it runs to long. Basically I need to find a way to split up my kernels to stay under the time limit, or lower my iterations.

Thanks to the guy up there who answered some of my questions, if not for you I would still think I was enqueuing wrong.

See this previous thread on WDDM timeout