Cuckoo Hash- Table in GPGPU

Hallo GPGPU hackers!
In my supercool Opencl progect i need to store data in a global hash table, but with relative
small size so it can be copied to shared memory.
The threads must be able to perform common put, get, delete operations from the hash table.
The cuckoo was chossen because of the worst case of 2 for any lookup.
I might guess, that for effiecency the hash table must be first edited locally and then the results must be
redused globally, but there can be many issues with this. For example, if an entry is putted und then deleted, but the threads makes that in different order: first delete then put, and value is there , but it dont suppose to ((. Does anyone knows how to do the task properly?
//----------------------------------
The hash table is in form of many arrays, each of them represents some element of a “sturture”
what would be array of structures or pointers for example in common c.

You’d have to use atomic operations, which are not fast. Also, how would you deal with the case of needing to rehash?

Yes thats the problem, and i m looking for answer. First, it must be done that with no rehash!
Second, it must be done with 2 kernel : first atomicks to reserve the position of “primekeys”, then second kernel comes without race conditions because places are reserved.
But i have no idea how to make that with local memory optimisation. Thats answers i am lookung for … here :smiley: