Can't pass virtual function objects to CUDA kernel is a pain

Practical and theoretical implementation discussion.
Post Reply
shiqiu1105
Posts: 138
Joined: Sun May 27, 2012 4:42 pm

Can't pass virtual function objects to CUDA kernel is a pain

Post by shiqiu1105 » Wed Apr 03, 2013 3:29 pm

In my CPU ray tracer, I used polymorphism intensively.

For example, I had a bunch of different light types, point light, area light, etc., all derived from a Light base class.
When rendering, all I need is to loop through an arrary of Light pointers.

Now, without this capability, I have to explicitly store an array for each of the light types, and in order to query all light sources I need to loop through several arrays.
This is really ugly and inelegant to me.

Another alternative is to have a big switch-case clause, and choose different query methods, which also ugly.

I know that we can construct objects locally in the kernel code to call virtual functions too, but that's not cool to me either and I am afraid it wil hurt performance.

So I am asking is there a good solution to polymorphism in CUDA kernel? Such as a smart way to use template somehow and I can call different implementation of a methods just like in C++?

keldor314
Posts: 10
Joined: Tue Jan 10, 2012 6:56 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by keldor314 » Sat Apr 06, 2013 5:37 am

A CPU will do polymorphism either through function pointers or though a big switch clause, depending on what the compiler thinks is fastest. Hence, you have to eat that cost there too.

It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.

Looping though an array for each light type may actually be fastest, both on CPU and GPU, assuming you don't mess up data locality in the process.

hobold
Posts: 56
Joined: Wed Dec 21, 2011 6:08 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by hobold » Sat Apr 06, 2013 10:38 am

keldor314 wrote:It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.
Are you referring to the GPU's latency hiding strategies with this comment? Or something else?

graphicsMan
Posts: 167
Joined: Mon Nov 28, 2011 7:28 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by graphicsMan » Sat Apr 06, 2013 6:34 pm

Correct me if I'm wrong, but the problem is not that polymorphic (virtual) functions won't work in CUDA, it's that if you allocate the object on the CPU side, you can't copy the pointer to the GPU the way you do with simple structs and use the polymorphic functions. This leaves you with two strategies: 1) manage polymorphism in a straight C fashion, or 2) have some kind of factory that can take CPU objects and explicitly make new GPU objects, which can then be used as normal. Again, please correct me if I'm wrong; it's been several years since I programmed in CUDA.

fursund
Posts: 11
Joined: Mon Nov 28, 2011 1:27 pm
Contact:

Re: Can't pass virtual function objects to CUDA kernel is a

Post by fursund » Sat Apr 06, 2013 6:59 pm

You might be able to solve your problems with function pointers?

http://stackoverflow.com/questions/9000 ... n-pointers

keldor314
Posts: 10
Joined: Tue Jan 10, 2012 6:56 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by keldor314 » Mon Apr 08, 2013 10:39 pm

hobold wrote:
keldor314 wrote:It's probably worth noting that GPUs should be able to handle function pointers more efficiently than CPUs for micro-architectural reasons (if you're interested, I can explain), so be sure to give them a try.
Are you referring to the GPU's latency hiding strategies with this comment? Or something else?
More or less - function pointers are a worst case for branch prediction, since they are heavily data dependent, and can go to any number of different addresses. Hence, you're very likely to get a branch mispredict.

So what happens on a mispredict? Basically, you have to flush all instructions in the pipeline, since these are from after the branch, and are invalid if you got the branch wrong. This means something like 15 stages times maybe 5 superscalar ports = 75 instructions (actually, many CPUs are wider and have deeper pipelines). In addition, any instructions that got issued early with out of order will also be invalidated. Thus, the cost of a branch mispredict is a stall across about 100 instructions, which is rather expensive.

Now, what about a GPU? A GPU will try to issue instructions from different threads every cycle, something like round robin among however many threads are resident on the processor, skipping over any threads that are stalled. This means that when a thread hits a branch, it is simply marked as stalled, and not issued from again until after the 20 or so cycles it takes to go through the pipeline and resolve the branch. Since it's aggressively cycling between threads, a given thread won't execute more than once every few cycles in most cases, meaning there's usually plenty of time to decode the instruction and determine whether it's a branch before issuing further instructions on that thread. This means that for a GPU, there will usually be no stall following an indirect branch (function pointer).

graphicsMan
Posts: 167
Joined: Mon Nov 28, 2011 7:28 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by graphicsMan » Mon Apr 08, 2013 10:56 pm

Right, the badness will occur if all your threads are invoking calls on different function pointers :) Divergence is your enemy on GPU. If they all invoke the same virtual function (or function pointer), things are pretty good.

lucian
Posts: 4
Joined: Sun Mar 31, 2013 7:54 pm

Re: Can't pass virtual function objects to CUDA kernel is a

Post by lucian » Tue Apr 30, 2013 10:52 pm

Speaking of virtual functions, is there a more ellegant way to implement polymorphism in OpenGL compute shader than with switches/branching?

Post Reply