Strange issue when dealing with Bitmasks

EDIT:
Solved

I am in the process of porting a graph based genetic algorithm and I keep coming across a strange problem. I generate my chromosomes on the cpu and offload them to the gpu. One of the steps of my fitness function is to determine how many bits are set to 1 (which would indicate the inclusion of a node). When trying to verify results on the cpu, the numbers are not matching up. First I figured my local caching was the issue, so I switched to using my global memory, to no avail. Next I figured maybe the integer modulus and division were the problem, so I tried re-implementing it using floating point operations and casts. Still not working. It seems to me that the chromosome isn’t being copied properly for gid > 0. Has any one used bitmasks effectively on the gpu?

here, chrome_local is an array
uchar chrome_local[CHROM_SIZE_BYTES];

		totalChromOff = gid*popSize*CHROM_SIZE_BYTES+i*CHROM_SIZE_BYTES;

		//copy the chromosome
		for (int n = 0; n < CHROM_SIZE_BYTES; n++) {
			chrome_local[n] = InputChroms[totalChromOff+n];
		}

		sSize = 0;
		
		//count all zero size items
		for (unsigned int item = 0; item < numVerts; item++) {
			if (!isBitZero(chrome_local, item)) 
			{
				sSize++;
			}
		}
		OutputFitness[gid*popSize+i] = (float)sSize;

As a note: This is my first attempt at using OpenCl and I love the power. I just need to learn all the tricks of the trade :wink:
If you have any questions or need more information please let me know!

Note: It seems that the first workitem in the group calculates all of its sizes properly, but every other workitem is off. Might this have to do with memory access?

Also, I’ve tried copying the chromosomes back after they are written and recalculating. It all gets copied correctly. So the problem either lies in the conditional being executed incorrectly for whatever reason for gid>0, or chrome_local not having the correct data for gid>0. The address gets calculated properly as far as I know. I’ll try eliminating the conditional using a lookup table. If that doesn’t fix it, chrome_local must not be copied correctly. Otherwise I guess I’m just crazy

Okay now I’ve changed the code to use a lookup table and it doesn’t work…


totalChromOff = (gid*POP_SIZE+i)*CHROM_SIZE_BYTES;	
sSize = 0;

//copy the chromosome
for (int n = 0; n < CHROM_SIZE_BYTES; n++) {
	chrome_loc[n] = InputChroms[totalChromOff+n];
	sSize += LookupTable[ chrome_loc[n] ];
}

OutputFitness[gid*POP_SIZE+i] = (float)sSize; //testing

Here is my code in the host


Buffer bufferMyChroms = Buffer(context, CL_MEM_READ_WRITE, CHROM_SIZE_BYTES*numTotalChromosomes * sizeof(cl_uchar));
...
queue.enqueueWriteBuffer(bufferMyChroms, CL_TRUE, 0, sizeChrom*numTotalChromosomes, chromosomes);
...
kernelGA.setArg(1, bufferMyChroms);

I don’t see the point in copying InputChroms array into a local or global buffer. You can directly access InputChroms in your loop since it is already in global memory.
You also don’t give information about the way you split your kernel execution into work-groups. This can explain potential synchronization troubles.

[QUOTE=utnapishtim;30073]I don’t see the point in copying InputChroms array into a local or global buffer. You can directly access InputChroms in your loop since it is already in global memory.
You also don’t give information about the way you split your kernel execution into work-groups. This can explain potential synchronization troubles.[/QUOTE]

My apologies. I copy the chromosome to local memory because I access it a lot in subsequent calculations(hundreds of times for each chromosome). I have tried, however, to use global memory (InputChroms) for this calculation and I get the same faulty result for gid > 0.

Now this is the part I am unsure of. My current global NDRange is set to 2 (eventually I will increase this once I can figure out the issue) and I omit the local range(which I guess means the driver will decide how to split up the job locally).


// Make kernel
Kernel kernelGA(program, "ComputeFitness");
	
// Create memory buffers
Buffer bufferMyGraph = Buffer(context, CL_MEM_READ_WRITE, MAX_DEGREE* numverts * sizeof(cl_uint));
Buffer bufferMyChroms = Buffer(context, CL_MEM_READ_WRITE, CHROM_SIZE_BYTES*numTotalChromosomes * sizeof(cl_uchar));
Buffer bufferMyQueues = Buffer(context, CL_MEM_READ_WRITE, NUM_STREAM_PROCESSORS*sizeof(queue));
Buffer bufferMyLookupTable = Buffer(context, CL_MEM_READ_WRITE, 256*sizeof(cl_int));
Buffer bufferMyFitness = Buffer(context, CL_MEM_READ_WRITE, numTotalChromosomes * sizeof(cl_float));
	
// Copy graph and initial chromosomes to the buffer

queue.enqueueWriteBuffer(bufferMyChroms, CL_TRUE, 0, CHROM_SIZE_BYTES*numTotalChromosomes, chromosomes);
queue.enqueueWriteBuffer(bufferMyGraph, CL_TRUE, 0, MAX_DEGREE* numverts * sizeof(cl_uint), adj);
queue.enqueueWriteBuffer(bufferMyQueues, CL_TRUE, 0, NUM_STREAM_PROCESSORS*sizeof(queue), myQueues);
queue.enqueueWriteBuffer(bufferMyLookupTable, CL_TRUE, 0, 256 * sizeof(cl_int), lookupTable);

// Set arguments to kernel
kernelGA.setArg(0, bufferMyGraph);
kernelGA.setArg(1, bufferMyChroms);
kernelGA.setArg(2, bufferMyQueues);
kernelGA.setArg(3, bufferMyFitness);
kernelGA.setArg(4, bufferMyLookupTable);
kernelGA.setArg(5, MAX_DEGREE);

// Run the kernel on specific ND range
NDRange global(NUM_STREAM_PROCESSORS); // this is currently 2
//NDRange local(1);
			
queue.enqueueNDRangeKernel(kernelGA, NullRange, global);

So might this be an issue with my memory access?
If there are any good resources (besides the api documentation, it didn’t seem too informative to me) let me know.
Thanks for the help by the way. I’ve been tearing my hair out trying to figure this out.


Additional clarification:
My current kernel function contract looks like this

kernel void ComputeFitness(__global int* InputGraph,
			 __global uchar* InputChroms,
			 __global queue* InputQueues,
			 __global float* OutputFitness,
			 __global int* LookupTable,
			 const unsigned int maxDegree)

Each input is following format (most are 2 or 3d arrays compressed into a 1d piece of contiguous memory)

InputGraph[GRAPH_SIZE, maxDegree]
-This holds an adjacency list with edges for each vertex. the first index is the vertex id, and the 2nd index is the edge
-I access this as such InputGraph[vert*maxDegree+edge] where vert and edge are the indexes

InputChroms[NUM_WORK_ITEMS, POP_SIZE, CHROM_SIZE_BYTES]
-This array stores contiguous chromosomes(each of length CHROM_SIZE_BYTES uchar)
-Each work item is allocated POP_SIZE number of chromosomes
-I access this as such InputChroms[(gidPOP_SIZE+cur_chrom)CHROM_SIZE_BYTES+byteOffset)]
**where gid is the id of the work item
**(gid
POP_SIZE
CHROM_SIZE_BYTES) is the offset for the beginning of the current work item’s block of chromosomes
**(cur_chromCHROM_SIZE_BYTES) is the offset into the work item’s block of chromosomes for the chromosome of interest
**I factored this to get ((gid
POP_SIZE+cur_chrom)*CHROM_SIZE_BYTES) as the offset to the beginning of the chromosome
**byteOffset is the particular byte I want to access (if I want the 30th bit in a chromosome, it would be 30/8=3rd byte)

InputQueues[NUM_WORK_ITEMS] and I access it using InputQueues[gid]
-this is an array of queues, one for each workitem(this is used later to perform a BFS)

OutputFitness[NUM_WORK_ITEMS, POP_SIZE]
-this is the array I write back to after evaluating the fitness (and I am using it currently to get the gpu-calculated sSize that is causing me issues)
-OutputFitness[gid*POP_SIZE+cur_chrom]

LookupTable[256]
-this is something I just added, and it stores the number of 1 bits for every number from 0-255. I do this now to count 8 bits at once in the chromosome. All my tests show that the counting still happens correctly(for the first workitem gid=0), no matter how many chromosomes this one work item actually counts.

int maxDegree
-just stores the maximum degree of an item

I think that there is problem with the way you handle local memory.
Don’t forget that local memory is memory shared among work-items in the same work-group.

When you write:


for (int n = 0; n < CHROM_SIZE_BYTES; n++) {
    chrome_local[n] = InputChroms[totalChromOff+n];
}

this is extremely dubious because each work-item will write at the same memory location.

Generally speaking, when you use local memory in a kernel, you have to explicitly set the local work-size in clEnqueueNDRangeKernel() because it has to match the amount of local memory allocated in the kernel.

Maybe you mean private memory instead of local memory?