Global work space size

I’m currently having some issues with my opencl code, when i’m trying to pass in a large (is it?) global work space.

This is my kernel


kernel void intersection(constant REAL2 *polygon1, int numPoints1, constant REAL2 *polygon2, int numPoints2, global REAL2 *offsets, int numOffsets, double tolerance, REAL linearPrecision, BOOL doingInner, REAL area1, REAL area2, global BOOL *intersects, global BOOL *hasInside, global BOOL *hasOutside, global REAL *dummy)
{
// nothing in here!!
}


now, when i implement a globalws of, say (128,128,16) my code runs fine.
But when using a large globalws of say, (1000000, 1000000, 1) , when waiting for the queue to finish, it errors with invalid command queue


// as an example:

cl::NDRange globalws(1000000, 10000000, 1), localws = cl::NullRange;
cl::NDRange globalws(128, 128, 16) , localws = cl::NullRange;
              
// run the kernel
cl_int err = queue->enqueueNDRangeKernel(intersectKernel, cl::NullRange, globalws, localws);

the code works fine for smaller numbers. I make sure that i’m not trying to access an element in an array that doesn’t exist. In-fact, i’ve removed all code in the kernel to see if i get these queue fails, and indeed it does (so it’s not the kernel code).

I’m assuming it’s a memory problem… ??

Without seeing how you allocate memory via clCreateBuffer, it is hard to tell if it’s a memory issue because the amount of global work items can easily be 2^32 or 2^64 in one dimension. Have you really checked all error codes from all OpenCL calls?

Ok - this is how i upload the data. I’ve tried to keep things to a minimum, but you can see i’m checking every line when i can…

Poly1 and poly2 are polygons made up of an array of doubles (an array of double2).
They are properly aligned etc.



size_t vertexPt_sz = sizeof(double) * 2;
size_t int_sz = sizeof(int);

cl::Buffer buf_poly1(*context, CL_MEM_READ_ONLY, vertexPt_sz * poly1.numPoints());
cl::Buffer buf_poly2(*context, CL_MEM_READ_ONLY, vertexPt_sz * poly2.numPoints());
cl::Buffer buf_result(*context,  CL_MEM_READ_WRITE, int_sz * numOffsets);
cl::Buffer buf_offsets(*context, CL_MEM_READ_ONLY,  vertexPt_sz * numOffsets);
cl::Buffer buf_intersects(*context, CL_MEM_READ_WRITE, int_sz * numOffsets);
cl::Buffer buf_insde(*context, CL_MEM_READ_WRITE, int_sz * numOffsets);
cl::Buffer buf_outside(*context, CL_MEM_READ_WRITE, int_sz * numOffsets);
cl::Buffer buf_dummy(*context,   CL_MEM_READ_WRITE, sizeof(REAL));

cl_int status;
status = queue->enqueueWriteBuffer(*buf_poly1,   CL_TRUE, 0, vertexPt_sz * poly1.numPoints(), (void*) poly1.pts.Buffer);
if (status != CL_SUCCESS)
  Message(MU_TEXT("%s"), OpenCL().GetErrorText(status).str());

status = queue->enqueueWriteBuffer(*buf_poly2,   CL_TRUE, 0, vertexPt_sz * poly2.numPoints(), (void*) poly2);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
status = queue->enqueueWriteBuffer(buf_offsets,  CL_TRUE, 0, vertexPt_sz * numOffsets, (void*) offsets);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

status = queue->enqueueWriteBuffer(buf_intersects, CL_TRUE, 0, int_sz * numOffsets, (void*) intersects);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = queue->enqueueWriteBuffer(buf_insde, CL_TRUE, 0, int_sz * numOffsets, (void*) hasInside);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = queue->enqueueWriteBuffer(buf_outside, CL_TRUE, 0, int_sz * numOffsets, (void*) hasOutside);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
    
status = queue->enqueueWriteBuffer(buf_dummy,   CL_TRUE, 0, sizeof(REAL), (void*) dummy);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));


/// now create the kernel
cl::Kernel intersectKernel(*OpenCL().GetProgram(),"intersection");
              
// set the arguments
status = intersectKernel.setArg(0, buf_poly1);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

status = intersectKernel.setArg(1, poly1.numPoints());
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

status = intersectKernel.setArg(2, buf_poly2);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
              
status = intersectKernel.setArg(3, poly2.numPoints());
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
              
status = intersectKernel.setArg(4, buf_offsets);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(5, numOffsets);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(6, 0.05);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(7, 0.001);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(8, doingInner);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(9, poly1.GetArea());
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

status = intersectKernel.setArg(10, poly2.GetArea());
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(11, buf_intersects);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
  
status = intersectKernel.setArg(12, buf_insde);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));
              
status = intersectKernel.setArg(13, buf_outside);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

status = intersectKernel.setArg(14, buf_dummy);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

// create the ndrange
cl::NDRange globalws(poly1.numPoints(), poly2.numPoints(), numOffsets);
cl::NDRange localws(1,1,1);

// run the kernel
status = queue->enqueueNDRangeKernel(intersectKernel, cl::NullRange, globalws, localws);
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));

// make sure the work has finished before progressing
status = queue->finish();
if (status != CL_SUCCESS)
  Message(TEXT("%s"), OpenCL().GetErrorText(status));


So when i run this code, and then after read the results back in, i don’t have any problems, unless the number of points is increased dramatically.

I’ve been used to polygons of points less than 50, generally (though some are larger), with a small number of offsets.
But when i’ve started to increase this, with polygons having greater than 1000 points, and offsets in the 100s, it fails on the last call to finish, before i read anything back.

Please forgive any typos in the above code, i’ve tried to copy and paste, and edit it so that it’s more readable.

I have previously created the device, queue and program.

bump

A global work size of (1000000, 10000000) is 10^13 items. Which is 2^43. Perhaps your driver uses 32 bit counters for the global work size, not expecting more then 4 billion of them in a single kernel invocation. Based on the execution speed of your smaller tests, how much time do you estimate 10^13 items should take? I have some kernels that execute 10^6 items in 1 mS, so extrapolating that to 10^13 items they would take 10^7 mS or 10^4 seconds or 16 minutes. That seems rather long for an OpenCL kernel. Note that some GPUs (that are also used for the GUI/desktop) will reset after 30 seconds or so of GPU compute (unless you change certain OS flags).

Yeah, i figured this out on friday. Read somewhere that the maximum size is determined by the max size of size_t, which, although i have a 64 bit OS, my graphics card uses 32 bit address. So i can do max of (2^32) threads.

As for the speed, it’s not too bad if i queue up the 2^32 threads. We’re probably talking maybe 30 seconds. The calcs i’m doing on the gpu don’t look like they’re strenuous enough to make it worthwhile on the gpu unfortunately. But, i’m going to keep playing around with it, trying to optimise things even more, and getting a good feel for it. Just by playing around i’ve learnt a great deal, so i’m quite pleased with my progress thus far.

Thanks for the heads up on the 30 second rule. That will come in handy i’m sure.