Results 1 to 2 of 2

Thread: Extremely slow when returning value from device to host

  1. #1
    Join Date
    Nov 2013

    Extremely slow when returning value from device to host

    Hi all,

    I have a simple OpenCL kernel code that would generate a 32-bit CRC hash from a given byte array. The purpose of the application is for illustration purpose only.

    I launched the kernel with 2billion work items and captured the time it took for all work items to complete. The following is what I observed:

    - If I do not return any value to the host, the entire process took about 2-3 seconds to complete.
    - If I try to return the 32-bit CRC value to the host, the process took over 50 seconds to complete.

    I do not believe copying a 32-bit int from private to global memory would take orders of magnitude long!

    Is there something that I am doing wrong or not getting? Would appreciate if someone could explain why it is taking so long for kernel to return an int value to the host.

    Many thanks in advance!

    Code :
    // Generate CRC from a given byte array - This piece of code is not important and is for illustration purpose only
    inline uint murmurHash(const __global char * key, short len) {
    	const unsigned int seed = 0x9747b28c;
    	const unsigned int m = 0x5bd1e995;
    	const int r = 24;
    	unsigned int h;
    	h = seed ^ len;
    	const unsigned char * data = (const unsigned char *)key;
    	while(len >= 4)
    		unsigned int k = *(unsigned int *)data;
    		k *= m; 
    		k ^= k >> r; 
    		k *= m; 
    		h *= m; 
    		h ^= k;
    		data += 4;
    		len -= 4;
    	case 3: h ^= data[2] << 16;
    	case 2: h ^= data[1] << 8;
    	case 1: h ^= data[0];
    	        h *= m;
    	h ^= h >> 13;
    	h *= m;
    	h ^= h >> 15;
    	return h;
    __kernel void Search(const __global char * key, short len, __global uint * output) {
     	uint hash;
      	hash = murmurHash(key, len);
            int id = get_global_id(0);
      	if (id == 0) {
       		*output = hash;  
                     // without the above assignment, the code executes 2billion work items in 2 seconds. 
                     // however with the above assignment, the code took over 50 seconds to complete. the size of the int to be returned is only 32bits.
    Last edited by thecoder; 11-20-2013 at 07:29 AM.

  2. #2
    Senior Member
    Join Date
    Oct 2012
    Without assignement to *output, the hash value is not needed and the compiler proceeds to dead code elimination, removing the call to murmurHash.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
Proudly hosted by Digital Ocean