Example for Random Number Generator?

Hello,
I’m writing a kernel that needs to generate a random float4 value between 0 and 1.
I know that OpenCL doesn’t have a built-in rand() function, but since it’s a common problem, I hope someone already implemented it and will be able to guide me how to do that.

Can anybody please explain or give an example for a simple random number generator to generate float4 with values between 0 and 1?

Thanks in advance,
Yoav

Look up mersenne twister algorithms for OpenCL (or you could port a CUDA or DirectX11 compute shader “kernel”). There are various implementations available.

You could also pass in a buffer of random values, and access them using some form of indexing or hashing with the get_global_id(…) values.

If you need to generate them on-the-fly and don’t care too much for highly stochastic pseudo-randomness, you can try some simple algorithms online (one of many examples)

You could also pass in a buffer of random values

I agree, something so is what i’ve been doing. Sure, it’s just unoptimized study-code, but passing 16*10^6 values isn’t necessarily the critical time wasting issue, as you can see below. Unless my clock()-measuring is off. Judging by the speed of the printout, there clearly is the longest wait for the allocation, it almost feels like a second or two… …mmmmhhh…

printout:


...
------------------------------------------
using GeForce 9600M GT
info in 0.000047 seconds 
------------------------------------------
creating context...		 done in 0.005928 seconds 
adding program... 		  done in 0.349487 seconds 
adding kernel... 			done in 0.000010 seconds 
creating queue... 		  done in 0.025236 seconds 
allocating memory...   	done in 0.138350 seconds 
setting kernel args... 	done in 0.000004 seconds 
executing clProg... 		done in 0.198150 seconds 
writing back data... 	  done in 0.013004 seconds 
cleaning up... 			  done in 0.164929 seconds 
--------------------------------------------------------------
n = 16 million times so-and-so-many mixed float-operations done (+,-,*,/,cos,sin):
cpu = 15.725804 seconds (1 core, for-looped)
cpu = 1.781080 seconds (openCL)
gpu = 0.895446 seconds (openCL, not optimized i guess...)
---------------------------------------------------------------------------
...

Be careful how you measure time. I’m not saying you’ve done it wrong, but it’s easy to do so.

clEnqueueXXX functions are asynchronous and return before they’ve done any significant work. I recommend creating command queues with the CL_QUEUE_PROFILING_ENABLE flag enabled and using clGetEventProfilingInfo() to get the timing data.

http://en.wikipedia.org/wiki/Mersenne_twister

Mersenne twister is a good one, statistically pretty good, not outrageously expensive. But it takes memory space for lots of state (which prevents me from using it - for example - in a graphics shader) and it’s still not all THAT cheap.

But if you don’t care too much, there is a truly awful one that dates back to the dawn of computing. It works well enough for non-critical applications. Von Neumann (yes, him!) came up with the “Middle Square” method - he basically squared the seed number and used some digits from the middle of the result! In floating point 0…1 range, you can do it even more cheaply like this:

//   0 < seed < 1
seed = fract ( seed * seed * 1024.0 ) ;

…not great - but definitely the cheapest! The picky will want to clamp the result to just INSIDE the 0…1 range…meh.

I’m probably going to get an angry mob of statisticians with flaming torches and pitchforks showing up any time now!

– Steve

It seems like a simple PRNG can be adequate for my needs, since I need to generate several random numbers per work-item, and their statistical properties are not critical.

However, most of the simple PRNGs require to store (at least) one state per thread (or “core”, in the meaning of things that actually happens in parallel).
Storing a seed per work-item, will do the work but will be inefficient in terms of memory consumption.

Is it possible to maintain a small array of states, one per physical core and use each of them in one core at a time? and if so, how?
(actually, my question is how do I get the id of the core that my kernel is running on)

Thanks in advance,
Yoav

Is it possible to maintain a small array of states, one per physical core and use each of them in one core at a time?

That is called local memory.

actually, my question is how do I get the id of the core that my kernel is running on

There is no way to do that. However, each “core” has its own piece of local memory so you don’t need to worry about it either.

Thanks!
Here is the rand() function that I want to use (I took it from OpenCV, and I use it in my CPU implementation):

Option 1:


///
/// Generate a random float4 vector of values between 0 and 1
///
float4 rand_next(__local ulong4* states)
{
	size_t tid = 0; // Use one states vector per work-item. The next line should be atomic, right?
	states[tid] = (ulong4)(uint4)states[tid]*4164903690U + (uint4)(states[tid] >> 32);
	return ((uint4)states[tid])*2.3283064365386962890625e-10f; // * (2^32-1)^-1
}

Option 2:


///
/// Generate a random float4 vector of values between 0 and 1
///
float4 rand_next(__local ulong4* states)
{
	size_t tid = get_local_id(0); // That should be used if I have one states vector per thread.
	states[tid] = (ulong4)(uint4)states[tid]*4164903690U + (uint4)(states[tid] >> 32);
	return ((uint4)states[tid])*2.3283064365386962890625e-10f; // * (2^32-1)^-1
}

If I use the first option, I’ll have to run make the second line “atomic”, and it will run sequentially. If so, I can’t really make it fast without consuming more memory and have a states vector per work-item, right?

Thanks again,
Yoav

If so, I can’t really make it fast without consuming more memory and have a states vector per work-item, right?

Right.

If so, I can’t really make it fast without consuming more memory and have a states vector per work-item, right?

Not right: it’s possible to save the memory overhead using a counter-based RNG:
http://www.openclblog.com/2011/11/gpus- … ation.html