Page 1 of 1

AVX512 MBVH4 Traversal

Posted: Fri Sep 23, 2016 1:05 pm
by mpeterson
with intels knl being now available to everyone, people start
asking for a native avx512 port of clpt (http://ompf2.com/viewtopic.php?f=3&t=2075).
knl seems to be the first accelerator from intel with some kind of power under the hood
(knf and knc have been simple nonstarters). so i did an optimized implementation
of clpt for avx512 and was surprised about the outcome.
clpt is by far the fastest rt-kernel for cpus today but was never compared to gpus.
so i was looking around for some numbers. not much to find ! so i used the
medium numbers (viewpoint 2) from amd firerays 2 on firepro w9100 and measured the test-scenes
on cuda by using the implementation from nvidia (http://www.nvidia.com/object/nvidia_res ... b_011.html)
optimized for nv titan. to make it short: using coherent ray traversal, knl can render most of the scenes
i have around stable below 1 ms into a 1024x1024 frambuffer.

rem: cuda and knl numbers are avg. values calculated out of a sequence of several thousand frames (scene fly-thru).
amd firerays is single shot.

Image

Re: AVX512 MBVH4 Traversal

Posted: Mon Sep 26, 2016 8:29 am
by rtpt
A Nvidia 1080gtx should be twice as fast as the
titan. Can you please run your tests on current
hardware ?

Re: AVX512 MBVH4 Traversal

Posted: Tue Sep 27, 2016 10:52 am
by jbikker
Could you also test divergent rays? Architectures that rely on caches (i.e. CPUs) seem to suffer greatly from divergent mem access, while architectures that hide latencies using many threads typically fare much better. I would be suprised to see that the latest CPU-like device outperforms the latest GPU device in that setting (in fact, I don't expect it to come even close).

Re: AVX512 MBVH4 Traversal

Posted: Wed Sep 28, 2016 1:10 am
by atlas
Price point between the devices is also a consideration, I'm not sure this is an apples-to-apples comparison. Power envelopes aside, how many GPUs can you buy for the price of a Knight's Landing?

Getting over 2.5B rays/s on a CPU is exciting though, but I agree we have to see the incoherent numbers.

Re: AVX512 MBVH4 Traversal

Posted: Wed Sep 28, 2016 2:07 pm
by MohamedSakr
great results, but as others said, in divergence case CPU will crawl (cache misses, waiting memory...).

Re: AVX512 MBVH4 Traversal

Posted: Fri Sep 30, 2016 11:53 am
by mpeterson
yes, i would like to run the bench on latest gpu gen. but titan is all i have around.concerning the incoherent transport: yes it will be a different story for shure. first of all, the implementation is not straight forward on avx512 (avx512 is pretty inflexible when it comes to random access streaming/computation -> there is no fast way to shuffle single elements around, limited integer/int16 support etc.). so implementation time is pretty high (a clear disadvantage here). on the other side: running our full blown pt with avx2 backend on knl the performance is great. on average more than 3x compared to octane renderer on the titan (except simple scenes).

Re: AVX512 MBVH4 Traversal

Posted: Fri Sep 30, 2016 12:00 pm
by MohamedSakr
does this test include texture access? like a standard interior scene full of textures. (as the bottleneck is always memory).

Re: AVX512 MBVH4 Traversal

Posted: Tue Oct 04, 2016 10:25 am
by mpeterson
yes (nn sampling and bi-linear sampling). keep in mind that knl has 90gb/s on pretty large mem and extra 400gb/s on 16gb.
atm we are playing around with all the diff. mem. options. beside this, we try to run the pt as a special kind of "stand-alone-app"
on knl without os noise. a lot o new stuff here to explore...

Re: AVX512 MBVH4 Traversal

Posted: Wed Oct 05, 2016 10:12 am
by MohamedSakr
mpeterson wrote:yes (nn sampling and bi-linear sampling). keep in mind that knl has 90gb/s on pretty large mem and extra 400gb/s on 16gb.
atm we are playing around with all the diff. mem. options. beside this, we try to run the pt as a special kind of "stand-alone-app"
on knl without os noise. a lot o new stuff here to explore...
it would be interesting if you test it on a production ready renderer (like Cycles). , as it is well known for its bruteforce PT nature, and it uses embree. (got CPU/OpenCL/CUDA).