Extremely slow when returning value from device to host

Hi all,

I have a simple OpenCL kernel code that would generate a 32-bit CRC hash from a given byte array. The purpose of the application is for illustration purpose only.

I launched the kernel with 2billion work items and captured the time it took for all work items to complete. The following is what I observed:

- If I do not return any value to the host, the entire process took about 2-3 seconds to complete.
- If I try to return the 32-bit CRC value to the host, the process took over 50 seconds to complete.

I do not believe copying a 32-bit int from private to global memory would take orders of magnitude long!

Is there something that I am doing wrong or not getting? Would appreciate if someone could explain why it is taking so long for kernel to return an int value to the host.

Many thanks in advance! :slight_smile:


// Generate CRC from a given byte array - This piece of code is not important and is for illustration purpose only
inline uint murmurHash(const __global char * key, short len) {
	const unsigned int seed = 0x9747b28c;
	const unsigned int m = 0x5bd1e995;
	const int r = 24;

	unsigned int h;
	h = seed ^ len;

	const unsigned char * data = (const unsigned char *)key;
	
	while(len >= 4)
	{
		unsigned int k = *(unsigned int *)data;

		k *= m; 
		k ^= k >> r; 
		k *= m; 
		
		h *= m; 
		h ^= k;

		data += 4;
		len -= 4;
	}
	
	switch(len)
	{
	case 3: h ^= data[2] << 16;
	case 2: h ^= data[1] << 8;
	case 1: h ^= data[0];
	        h *= m;
	};

	h ^= h >> 13;
	h *= m;
	h ^= h >> 15;
	
	return h;
} 

__kernel void Search(const __global char * key, short len, __global uint * output) {

 	uint hash;
  	hash = murmurHash(key, len);

        int id = get_global_id(0);
  	if (id == 0) {
   		*output = hash;  
                 // without the above assignment, the code executes 2billion work items in 2 seconds. 
                 // however with the above assignment, the code took over 50 seconds to complete. the size of the int to be returned is only 32bits.
  	}
}

Without assignement to *output, the hash value is not needed and the compiler proceeds to dead code elimination, removing the call to murmurHash.