Random number generator on GPU

Practical and theoretical implementation discussion.
Post Reply
shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Random number generator on GPU

Post by shiqiu1105 » Sat Apr 06, 2013 3:54 am

I am currenlty writing my own GPU tracer.

And I now need random number, or quasi random number for sampling.

But how to get random number with CUDA? What do people usually do?

Generate a bunch of random numbers on the CPU and pass it to kernel? Or use some library to generate numbers in the kernel directly?
Or implement their own RNG in the kernel??

keldor314
Posts: 10
Joined: Tue Jan 10, 2012 6:56 pm

Re: Random number generator on GPU

Post by keldor314 » Sat Apr 06, 2013 5:24 am

Just write your own. This is the one I use:

Code: Select all

__constant__ unsigned int shift1[4] = {6, 2, 13, 3};
__constant__ unsigned int shift2[4] = {13, 27, 21, 12};
__constant__ unsigned int shift3[4] = {18, 2, 7, 13};
__constant__ unsigned int offset[4] = {4294967294, 4294967288, 4294967280, 4294967168};
__shared__ unsigned int randStates[32];

__device__ unsigned int TausStep(unsigned int &z)
{
	int index = threadIdx.x&3;
	unsigned int b = (((z << shift1[index]) ^ z) >> shift2[index]);
	return z = (((z & offset[index]) << shift3[index]) ^ b);
}

__device__ unsigned int randInt()
{
	TausStep(randStates[threadIdx.x&31]);
	return (randStates[(threadIdx.x)&31]^randStates[(threadIdx.x+1)&31]^randStates[(threadIdx.x+2)&31]^randStates[(threadIdx.x+3)&31]);
}

__device__ float randFloat()
//This function returns a random float in [0,1] and updates seed
{
	unsigned int y = randInt();
	return __int_as_float((y&0x007FFFFF)|0x3F800000)-1.0f;
}
You seed the RNG from the CPU - just pass in an array of 32 random ints and copy it into randStates. Just be sure that the CPU is using a different algorithm!

It's a varient of the Tausworthe RNG, optimized to partially share state between threads. This gives it a tiny seed footprint (32 bits per thread, this reduces cache/shared memory pressure) while retaining a much higher quality compared to non-shared 32 bits per thread generators.

There's some room for optimization, even just moving the shifts and offsets out of constant memory and into shared memory should help performance on Kepler (though not on Fermi or earlier). A downside with recent hardware is that this generator has a number of bit shifts. You could modify it to turn the left shifts into muls, which is a win on Kepler, but not so much on other hardware (I don't think it's actually a loss anywhere, though). There's nothing you can do with the right shift, though.

Post Reply