OpenCL bit-matrix multiplication (implementing tiling in local memory)

Hello all

I am performing multiplication between two square bit-matrices using a specific formula (not the dot product of row/col, I will describe it below). My implementation works correctly but it blows up when I attempt to do tiling in the local memory. Even though I understand the principles behind it (i.e. putting some of my data in local arrays, then using barriers to synchronize my work-items), I keep getting the wrong results.

This is an example of how I setup the matrices:


int N = 256; // number of bits in a row/col (or any number that is evenly divisible by 64)
unsigned long long A = new unsigned long long[n*n/64]; // stores bits in 64-bit integers
unsigned long long B = new unsigned long long[n*n/64]; // for example, one row consists of 256 bits and uses 4x64-bit integers to store them
int C  = new int[n*n];

This is the actual formula:
Row i of matrix A is XORed with row j of matrix B. Yes, it is a row to row operation. Then, I count the number of 1s and assign the sum to C[i][j].

This is how I launch the kernel:


const size_t global[2] = { n, n };
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, 0, 0, NULL, &event);

This is the actual kernel which works correctly:

    __kernel void BitProduct(const int N, const __global ulong* A, const __global ulong* B, __global int* C)
    {
    	const int i = get_global_id(0);
    	const int j = get_global_id(1);

    	ulong sum = 0;

    	for (int k = 0;k < N/64;k++)
    		sum += popcount( A[ i*(N/64) + k ] ^ B[ j*(N/64) + k ] );

    	C[ i * N + j ] = (int) sum;
    }

I made a typo in the first post. I forgot to add the * in my original post. Of course I am initializing a pointer to dynamically allocated memory.

int N = 256; // number of bits in a row/col (or any number that is evenly divisible by 64)
unsigned long long *A = new unsigned long long[n*n/64]; // stores bits in 64-bit integers
unsigned long long *B = new unsigned long long[n*n/64]; // for example, one row consists of 256 bits and uses 4x64-bit integers to store them
int C  = new int[n*n];