Hello all

I am performing multiplication between twosquare bit-matricesusing a specific formula (not the dot product of row/col, I will describe it below). My implementation works correctly but it blows up when I attempt to do tiling in the local memory. Even though I understand the principles behind it (i.e. putting some of my data in local arrays, then using barriers to synchronize my work-items), I keep getting the wrong results.

This is an example of how I setup the matrices:

Code :int N = 256; // number of bits in a row/col (or any number that is evenly divisible by 64) unsigned long long A = new unsigned long long[n*n/64]; // stores bits in 64-bit integers unsigned long long B = new unsigned long long[n*n/64]; // for example, one row consists of 256 bits and uses 4x64-bit integers to store them int C = new int[n*n];

This is the actual formula:

Row iofmatrix AisXORed withrow jofmatrix B. Yes, it is a row to row operation. Then, I count the number of1s and assign the sum toC[i][j].

This is how I launch the kernel:

Code :const size_t global[2] = { n, n }; clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, 0, 0, NULL, &event);

This is the actual kernel which works correctly:

Code :__kernel void BitProduct(const int N, const __global ulong* A, const __global ulong* B, __global int* C) { const int i = get_global_id(0); const int j = get_global_id(1); ulong sum = 0; for (int k = 0;k < N/64;k++) sum += popcount( A[ i*(N/64) + k ] ^ B[ j*(N/64) + k ] ); C[ i * N + j ] = (int) sum; }