Hybrid : CPU & GPU strategies

Asic transit gloria mundi.
spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Hybrid : CPU & GPU strategies

Postby spectral » Fri Jan 17, 2014 9:21 am

Hi there,

I'm currently playing with a CPU & GPU hybrid rendering approach... the goal is to "plug" the current GPU intersection engine into an existing CPU renderer.
So, here are the performance I got :
Pure GPU : 180 MRPS
Hybrid : 12 MRPS

The GPU usage is:
Pure GPU : 97%
Hybrid : 37%

So, the hybrid approach is currently so slow that I think that there are some problems. simply because there is no shading, no bounce etc... I use a simple AO shading and ray always restart from the camera.

Some pseudo-code that show how it works :

Code: Select all

for(i = 0; i < 8; i++)
  _renderThreads[i].Start();

class RenderThread
{
 Run()
 {
    while(1)
    {
       GenerateCameraRays(rays, 200000); // 200.000 rays in a set

       gpuQueue[threadId]->SendToGPU(rays);
       gpuQueue[threadId]->ExecuteRaysIntersections(rays);
       gpuQueue[threadId]->WaitForGPUCompletion();
       gpuQueue[threadId]->ReadGPUHits(hits);

       Shade(hits);
   }
 }
}


So, I'm looking to improve the speed... if someone has experience with this, any comment is welcomed ;-)
Spectral
OMPF 2 global moderator

jbikker
Posts: 175
Joined: Mon Nov 28, 2011 8:18 am
Contact:

Re: Hybrid : CPU & GPU strategies

Postby jbikker » Fri Jan 17, 2014 10:03 am

I suppose your GPU is spending too much time waiting for the transfers (and the CPU work). One thing you could do is to make the transfers asynchronous: have a double buffer for the rays (2x 200k), and fill one while the other is processed. In CUDA, the processing will suffer very little from all the copying going on in the background.

macnihilist
Posts: 8
Joined: Thu Mar 22, 2012 9:36 am

Re: Hybrid : CPU & GPU strategies

Postby macnihilist » Fri Jan 17, 2014 11:24 am

I actually tried the same thing, because "C++ on CPU for complex rendering algorithms and shading and GPU for chewing through large intersection batches" sounded really nice.
Also, PCIE throughput seemed reasonable enough today to try this if you're not after real-time.
In practice (at least my approach) didn't work so well.
There was some improvement from using the GPU as an additional 'coprocessor', but it was in no way using the GPU to it's full potential.
I didn't use async transfers to the GPU as jbikker suggested, but at least the CPU was doing some parallel work while waiting for the results from the GPU.
Async or not, I found it very hard to keep the workload balanced, devices were constantly waiting for each other.
Another (smaller) problem was that you have huge memory consumption with large batches (it's not only the rays/results but also the 'interim results' you have to save for each ray/path to let the integrator continue after the results are there).
Maybe I just wasn't putting enough thought into it or doing something stupid, but I decided to ditch the approach (I'm mainly after interactivity, not flexibility and complex scenes).

What I'm trying now is writing the renderer (integrator/shader) as a C kernel and then compiling that to CUDA/OpenCL/ISPC.
Then there are intersection engines for each compilation target that allow the whole thing to run on a single device.
So, similar to what Cycles does but not in a Megakernel style, but with renderer kernels and intersection kernels separated.
You can then use multiple devices to render a single image, but the devices are more decoupled and can run more concurrently (it's actually almost the same as cluster rendering over the network).
This is all in the early stages and I don't have any reliable numbers, but it seems to works much better.
Of course, you can't use all the nice C++ features and existing code.
(Which is really sad, because most of this stuff is just a compilation problem in the end -- not having virtual, templates, and operator overloading can be quite annoying...)

Well, after reading this again I realize this little experience report probably won't help you much with your actual problem, but you said any comment was welcome. ;)

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Re: Hybrid : CPU & GPU strategies

Postby spectral » Fri Jan 17, 2014 1:06 pm

jbikker wrote:I suppose your GPU is spending too much time waiting for the transfers (and the CPU work). One thing you could do is to make the transfers asynchronous: have a double buffer for the rays (2x 200k), and fill one while the other is processed. In CUDA, the processing will suffer very little from all the copying going on in the background.


Hi Jacco,

It is what I also expect... but notice that in the current case I have 8 threads... and 8 "gpu commands queues". So, I should have a lot of parallel asynchronous send/receive/execute commands in the queues...

... at the end it should be the same than having several "sets" in the same thread... or I forgot something :-P
Spectral
OMPF 2 global moderator

Dietger
Posts: 50
Joined: Tue Nov 29, 2011 10:33 am

Re: Hybrid : CPU & GPU strategies

Postby Dietger » Fri Jan 17, 2014 2:52 pm

spectral wrote:It is what I also expect... but notice that in the current case I have 8 threads... and 8 "gpu commands queues". So, I should have a lot of parallel asynchronous send/receive/execute commands in the queues...
... at the end it should be the same than having several "sets" in the same thread... or I forgot something :-P

Issuing memory transfers and kernel executions from different threads is not enough to achieve concurrent memory transfer and kernel execution. In cuda only memory transfers from pinned memory can overlap with kernel execution on the GPU and only if both are issues on different (non-default) streams.

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Re: Hybrid : CPU & GPU strategies

Postby spectral » Fri Jan 17, 2014 3:06 pm

Thanks Dietger,

But I have also different set of pinned memory !

Of course, I use the same "command queue" for thread's memory transfer and 1 kernel execution :-P
Spectral
OMPF 2 global moderator

tarlack
Posts: 27
Joined: Mon Feb 10, 2014 7:48 am

Re: Hybrid : CPU & GPU strategies

Postby tarlack » Mon Feb 10, 2014 8:07 am

Hello,

Is your GPU workload linear with respect to the CPU workload ? (you have N rays on CPU, and k*N work on GPU ?) If so, unless k is gigantic, I don't think it will work, and I think that your best bet is to find another formulation for your rendering algorithm to exhibit something like N tasks on CPU -> N^a tasks on GPU with a > 1.
For complex scenes, CPU/GPU hybridization is not just a matter of compilation. GPU-only algorithms cannot handle nicely tens of gigabytes of textures, arbitrarily complex shader codes, measured BRDFs, BTFs and so on while making rays bounce everywhere in the scene. Even by clustering and using on-device caches seems not obvious, because assuming a coherence in ray space for global illumination after even a few (say, two or three) diffuse bounces seems quite dubious to me. But again, I'm talking about complex (large) scenes, not the cornell box or a (set of) glasses on a table, which, although very complex to render, I do not consider as complex scenes. But put these glasses in a complete scottish pub, with all the tables, objects, mirrors, food, measured wood BRDF, a BTF for the fabric of the barman's kilt and all other fabric clothes of all the clients, tens of small lights, paintings, curved shiny chromes, plants, etc, and now we are talking about a (somewhat more) complex scene...

spectral
Posts: 382
Joined: Wed Nov 30, 2011 2:27 pm
Contact:

Re: Hybrid : CPU & GPU strategies

Postby spectral » Mon Feb 10, 2014 8:35 am

Sure,

But here I'm just looking to "improve" the speed of an existing CPU renderer (only the intersection test) ... all I need to store is the geometry !
So, I agree with your vision of the GPU renderers but notice that with time GPU have more & more memory.... more & more cores etc etc... so it is just a question of time and even for some scenes GPUs performs very well ;-)
Spectral
OMPF 2 global moderator

tarlack
Posts: 27
Joined: Mon Feb 10, 2014 7:48 am

Re: Hybrid : CPU & GPU strategies

Postby tarlack » Mon Feb 10, 2014 8:43 am

Just for the intersection test, it is likely that the memory throughput will be your main problem while you stick to one ray produced by CPU->one GPU task, even with double buffering and async transfers. Could you find a way to build on CPU something that will be able to "explode" on GPU, even just for isect tests ? Something like computing the parameters of a "procedural ray production" system ?

Dade
Posts: 206
Joined: Fri Dec 02, 2011 8:00 am

Re: Hybrid : CPU & GPU strategies

Postby Dade » Mon Feb 10, 2014 9:22 am

tarlack wrote:Just for the intersection test, it is likely that the memory throughput will be your main problem while you stick to one ray produced by CPU->one GPU task, even with double buffering and async transfers. Could you find a way to build on CPU something that will be able to "explode" on GPU, even just for isect tests ? Something like computing the parameters of a "procedural ray production" system ?


Check: http://www.researchgate.net/publication ... 341ee2.pdf

However, in practice, hybrid rendering will never provide the kind of speedup you can achieve with a full OpenCL rendered (or CUDA or whatever you are using).


Return to “GPU”

Who is online

Users browsing this forum: No registered users and 1 guest