There is a long time I work on the GPU and have already play a lot with several optimizations strategies, but would like to discuss with you some optimization subjects.
Sorting / scanning
It seems that one one the main method to "optimize" the processing on the GPU is based on "sorting" (By using sort, scan, etc...) (It is why I have create the CLPP https://code.google.com/p/clpp/ library in fact)
It is something I have try in the past, by example:
1) Trace a set or rays
2) Sort all the rays based on intersection/no-intersection
3) Do the shading on the set of "intersection"
It allow to improve the 'thread' coherence, but in your experience :
a) We should sort a set of 'ids', and then the shader kernel only acces the sorted set of 'ids' datas. Something like this :
Code: Select all
uint pathId = pathsIds[ threadIdx.x ];
float3 normal = paths[pathId]->Ns;
b) Should we directly sort (and Copy/Swap) a complete structure in global memory ? This improve the data coalescence ?
Code: Select all
float3 normal = paths[threadIdx.x]->Ns;
In your experience, does dividing a kernel in several smaller kernels help in performance ?
In my experience, the problem is that:
a) often we need to save more data in global memory, then read it in the second kernel
b) controlling the kernels by the CPU request some times... and so here we can loose performance
One advantage is that kernels are smallers... and then... we avoid to freeze the GPU/display
How many threads / items ?
In my experience doing processing on the GPU require between 200.000 and 1.000.000 items (or more) to be processed at once (to be optimal),
but in rendering we have some choice to do :
1) We can launch one thread for each pixel, then the number of threads depends of the resolution. It seems not a good approach because for small resolution we will not have a lot thread and on bigger resolution we will use a lot of memory and can freeze the display it the kernels are big. On the other side it can improve coherence and has better visual interactivity !
2) We can have a fixed set of threads (Like in Megakernels Considered Harmful: Wavefront Path Tracing on GPUs https://mediatech.aalto.fi/~timo/). The problem there is that :
a) We have less coherence because each thread can correspond to a very different pixel !
b) We have to distribute correctly the threads for each pixel (Using by example a QMC 02-sequence). Imagine I use "path restart" then once a path is completed then I will continue to another pixel, but which one ? (The goal here is to avoid patterns and keep good visual interactivity)
c) Methods like "Randomized Coherent Sampling for Reducing the Perceptual Error of Rendered Images" are harder to implement !
Buckets
Does someone has experience with bucket rendering on the GPU... I have do some test on my side but have'nt see any great improvement

Thx for your advice & experience