## Optimization strategies

Practical and theoretical implementation discussion.
spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

### Optimization strategies

Hi there,

There is a long time I work on the GPU and have already play a lot with several optimizations strategies, but would like to discuss with you some optimization subjects.

Sorting / scanning

It seems that one one the main method to "optimize" the processing on the GPU is based on "sorting" (By using sort, scan, etc...) (It is why I have create the CLPP https://code.google.com/p/clpp/ library in fact)

It is something I have try in the past, by example:
1) Trace a set or rays
2) Sort all the rays based on intersection/no-intersection
3) Do the shading on the set of "intersection"

It allow to improve the 'thread' coherence, but in your experience :
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas. Something like this :

Code: Select all

uint pathId = pathsIds[ threadIdx.x ];
float3 normal = paths[pathId]->Ns;
The problem here is that acces to the paths are not coalescents !

b) Should we directly sort (and Copy/Swap) a complete structure in global memory ? This improve the data coalescence ?

Code: Select all

float3 normal = paths[threadIdx.x]->Ns;
Dividing in smaller kernels
In your experience, does dividing a kernel in several smaller kernels help in performance ?

In my experience, the problem is that:
a) often we need to save more data in global memory, then read it in the second kernel
b) controlling the kernels by the CPU request some times... and so here we can loose performance

One advantage is that kernels are smallers... and then... we avoid to freeze the GPU/display

How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
but in rendering we have some choice to do :

1) We can launch one thread for each pixel, then the number of threads depends of the resolution. It seems not a good approach because for small resolution we will not have a lot thread and on bigger resolution we will use a lot of memory and can freeze the display it the kernels are big. On the other side it can improve coherence and has better visual interactivity !

2) We can have a fixed set of threads (Like in Megakernels Considered Harmful: Wavefront Path Tracing on GPUs https://mediatech.aalto.fi/~timo/). The problem there is that :
a) We have less coherence because each thread can correspond to a very different pixel !
b) We have to distribute correctly the threads for each pixel (Using by example a QMC 02-sequence). Imagine I use "path restart" then once a path is completed then I will continue to another pixel, but which one ? (The goal here is to avoid patterns and keep good visual interactivity)
c) Methods like "Randomized Coherent Sampling for Reducing the Perceptual Error of Rendered Images" are harder to implement !

Buckets
Does someone has experience with bucket rendering on the GPU... I have do some test on my side but have'nt see any great improvement

Thx for your advice & experience
Spectral
OMPF 2 global moderator

dbz
Posts: 46
Joined: Wed Jan 11, 2012 10:16 pm
Location: the Netherlands

### Re: Optimization strategies

It all sounds rather complex and time consuming what you have planned there. Moreover, what works well on a certain manufacturers gpu's, does not necessarily work well on other manufacturers gpu's. And optimizations for one generation of gpu's may turn out not to work well on newer generations of the same manufacturer. So will all the efforts you put into optimizations pay off in the long run?

Dade
Posts: 206
Joined: Fri Dec 02, 2011 8:00 am

### Re: Optimization strategies

spectral wrote: Sorting / scanning
[...]
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas.
I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
spectral wrote: How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

### Re: Optimization strategies

Dade wrote: I have tried this route but, at least in my tests, it wasn't effective (the cost of sorting the IDs was higher than the performance gained by having less GPU thread divergence, etc.).
It was in my case ... but honestly in OpenCL creating good sorting libraries is harder than in CUDA (Also in CUDA it is also provided and well optimized). Another problem is that we have to create several algorithms... one for each kind of device

But notice that most of the sorting algorithms are efficient for "at least" 1.000.000 items... not less !!!
Dade wrote: I use a set of 65K "path states" (i.e. items). I really never had the need to increase that value (but I can decrease that value, in on order to reduce the memory requires to stores all the states, without loosing too much performance).
I currently use 200.000 path states... and as explained in the Timo Aila papers... increasing it increase the perf. but not a lot
Spectral
OMPF 2 global moderator

Zelcious
Posts: 42
Joined: Mon Jul 23, 2012 11:05 am

### Re: Optimization strategies

I currently use 16k items in parallell. I take a slight hit with a gtx 780 but it's worth it since I use a mlt like algorithm and it converges faster with fewer chains. I'm heavily compute bound, use no resorting and split too long traversals to reduce incoherence.

tomasdavid
Posts: 22
Joined: Wed Oct 10, 2012 12:41 pm

### Re: Optimization strategies

ad dividing into smaller kernels:
I now have in the review pipeline a paper that claims that single kernel solution is a good choice.
On 580 it is worse than Dietger's approach (see his HPG2011 paper), but on 680 it is comparable it (seems the L1 caching of global accesses that is missing on Kepler hurts a bit).
Otoh, if you have expensive shaders, Laine now has an aptly named paper: Megakernels Considered Harmful: Wavefront Path Tracing on GPUs.
(That said, I couldn't reproduce their results in my framework).

Dade
Posts: 206
Joined: Fri Dec 02, 2011 8:00 am

### Re: Optimization strategies

Zelcious wrote:I use a mlt like algorithm and it converges faster with fewer chains.
This is a serious problem with MLT and GPUs: my implementation of MLT on GPU (with 65K chains) looks pathetic when compared to same MLT code on CPU (with 12 chains, i.e. 12 threads) .

Zelcious
Posts: 42
Joined: Mon Jul 23, 2012 11:05 am

### Re: Optimization strategies

This is a serious problem with MLT and GPUs: my implementation of MLT on GPU (with 65K chains) looks pathetic when compared to same MLT code on CPU (with 12 chains, i.e. 12 threads) .
I hope multiple-try will remedy some of that. Plan to experiment with some higher order ensamble mcmc as well.
I have some other ideas but I agree it's a big problem.

Zelcious
Posts: 42
Joined: Mon Jul 23, 2012 11:05 am

### Re: Optimization strategies

Perhaps the solution would be to let the cpu do the mlt and the gpu just looking for seed paths. For really really hard to sample scenes like some of the ones I experiment with it could perhaps be a winner.

ingenious
Posts: 282
Joined: Mon Nov 28, 2011 11:11 pm
Location: London, UK
Contact:

### Re: Optimization strategies

Are you saying that many shorter chains is much worse than fewer longer chains? On what scenes? I find it a bit hard to believe that having a larger number of chains is always worse.
Click here. You'll thank me later.