Can't understand Timo Aila's GPU traversal paper

Practical and theoretical implementation discussion.
shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Can't understand Timo Aila's GPU traversal paper

Post by shiqiu1105 » Thu Feb 14, 2013 9:28 pm

I have been reading Alia's paper on the GPU ray travsal.

But don't quite understand it.

So basically, to do fast GPU ray tracing, we need to launch enough threads to fill the gpu,and use while-while loop?

Anything like QBVH he's using?

Can anyone who understand the paper give a short review :) ?

shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by shiqiu1105 » Thu Feb 14, 2013 10:12 pm

What's packet trevsal? I thought that's on the CPU using SIMD

graphicsMan
Posts: 167
Joined: Mon Nov 28, 2011 7:28 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by graphicsMan » Thu Feb 14, 2013 10:38 pm

Probably you should re-read the paper a few times. Packet traversal is simply tracing a group of rays together at once. Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent. Traditionally, it's been tuned to CPU SIMD, but nothing requires that.

shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by shiqiu1105 » Fri Feb 15, 2013 2:27 am

graphicsMan wrote:Probably you should re-read the paper a few times. Packet traversal is simply tracing a group of rays together at once. Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent. Traditionally, it's been tuned to CPU SIMD, but nothing requires that.
Okay. What does persistent threads mean? Launching just enough threads to fill the machine, how do you know how many threads can do this?

Does that mean we don't launch the kernel by calling trace<<<PixelX, PixelY>>>()? But then how?

shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by shiqiu1105 » Fri Feb 15, 2013 2:33 am

I have also downloaded his code. He uses a lot of SOA pointers which makes it quite hard to read..

jbikker
Posts: 225
Joined: Mon Nov 28, 2011 8:18 am
Contact:

Re: Can't understand Timo Aila's GPU traversal paper

Post by jbikker » Fri Feb 15, 2013 8:38 am

shiqiu1105 wrote:
graphicsMan wrote:Probably you should re-read the paper a few times. Packet traversal is simply tracing a group of rays together at once. Even sans wide hardware ALU benefits, it should improve memory access, since the groups of rays traced together will be coherent. Traditionally, it's been tuned to CPU SIMD, but nothing requires that.
Okay. What does persistent threads mean? Launching just enough threads to fill the machine, how do you know how many threads can do this?

Does that mean we don't launch the kernel by calling trace<<<PixelX, PixelY>>>()? But then how?
Persistent threads: determine the number of SMs you have, multiply by 32 (for the number of threads in the warp). If you then launch a kernel with trace<<<SMCount * 32, dim(32,1,1)>>> each SM will be executing exactly 1 warp. That's typically not optimal, so you increase SMCount * 32 by some constant. 8 blocks per SM seems optimal for Kepler in many cases, but it depends on the number of registers that you need. Kepler has 256, so launching 8 blocks per SM gets you 32 registers for each warp. In practice, you either use the occupancy calculator to find the optimal number of blocks per thread, or you hand-tune it for best performance. Note that 2 is the minimum to use Fermi's and Kepler's hardware optimally, although I found that kernels that are heavily bandwidth-bound sometimes benefit from very small numbers of concurrent threads, to distribute cache misses optimally over the available SMs: for a box filter kernel, it seems even optimal to have 16 threads per block active (so: half-filled warps).

Then, you need special thread code: each thread remains active until all work has been done. This requires some atomics to fetch new jobs:

Code: Select all

do 
{
	volatile int& jobIdxBase = nextJobArray[threadIdx.y];
	if (threadIdx.x == 0) jobIdxBase = atomicAdd( &warpCounter, 32 );
	idx = jobIdxBase + threadIdx.x;
	if (idx >= totalJobs) return;
	// perform work for task 'idx'
} while (true);
So: thread 0 of each warp basically 'allocates' 32 jobs to work on: [jobIdxBase..jobIdxBase+31]. Individual threads in the warp get their job idx by calculating jobIdxBase + threadIdx.x.
Make sure the global memory variable 'warpCounter' is set to 0 before starting the kernel.

In practice, this approach is rather efficient, even for simple kernels.

- Jacco.

tomasdavid
Posts: 22
Joined: Wed Oct 10, 2012 12:41 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by tomasdavid » Fri Feb 15, 2013 8:44 am

As far as I remember, the approach does not use packets at all (each ray has its own stack), and the code for Fermi and Kepler does not use persistent threads anymore.

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Re: Can't understand Timo Aila's GPU traversal paper

Post by spectral » Fri Feb 15, 2013 10:02 am

Hi,

1) This code is able to handle both a packets of rays or a ray at once.
2) Persistent thread is no more efficient today
3) Don't forget that this code is based on a "simple study", to keep the same performance than Timo you have to care about the number of registers you use
in your kernel. If there are too much used registers in your code then you will not have the same performance
4) SOA stuffs are very important ;-)
5) It is a BVH but not a QBVH (more suited for the CPU)... maybe "Dade" has more experience with his SQBVH ?

What don't you understand, read and re-read it... it is a simple BVH traversal... with a stack. Try to understand each step... then study the next one.
If you have a precise question, I can answer you...

Regards

shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Re: Can't understand Timo Aila's GPU traversal paper

Post by shiqiu1105 » Fri Feb 15, 2013 2:18 pm

Thanks for you guys answer! They helped a lot :D

I do want to know, if persistent threads are not efficient nowadays, what's the efficient approach now then?

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Re: Can't understand Timo Aila's GPU traversal paper

Post by spectral » Fri Feb 15, 2013 2:33 pm

The best approach developed by Timo does not use persistent thread ! Check the paper ;-)

Also, it depends of what you want to do... if you want to improve this code to support instances etc... by example... and maybe depend of the video card and the scene !
It is up to you to test... or check the results in the paper.

Post Reply