Page 1 of 1


Posted: Mon Nov 06, 2017 12:45 am
by Tahir007
Hi there,

I'd like to take this opportunity to advertise tool that might be very useful for developing ray tracers or any other
compute intensive applications on CPU. I develop package for python that allow you to utilize CPU SIMD instructions (SSE, AVX, AVX2, AVX-512, FMA).
Basically its JIT compiler that compiles simplified Python code to native x86 machine code. To use SIMD instructions I add
vector data types float32x4, float32x8, float32x16, etc... so you can easily do explicit vectorization. Before compilation I check what
instruction sets CPU supports and then select best one. So basically this means if you want to achieve maximum performance all that is needed
to do is to use biggest vector types supported. (float32x16, float64x8, int32x16) as much as possible and all the magic happens automatically.
Even if your CPU only has SSE instruction sets you still benefit from using wide vector types because of memory locality.
This tool is still WIP because there is still lots of work to be done but even at this stage it is very useful. I start developing
path tracer just to show how this tool is used.

Here is one trivial example (Calculation of pi using Monte Carlo)

Code: Select all

from multiprocessing import cpu_count
from simdy import int64, float64, simdy_kernel

def calculate_pi(n_samples: int64) -> float64:
    inside = int64x4(0)
    for i in range(n_samples):
        x = 2.0 * random_float64x4() - float64x4(1.0)
        y = 2.0 * random_float64x4() - float64x4(1.0)
        inside += select(int64x4(1), int64x4(0), x * x + y * y < float64x4(1.0))

    nn = inside[0] + inside[1] + inside[2] + inside[3]
    result = 4.0 * float64(nn) / float64(n_samples * 4)
    return result

result = calculate_pi(int64(25_000_000))
print(sum(result) / cpu_count())


Path Tracer -

My question for you guys is what you think about this tool?

When i transform BVH tree to BVH16 i got almost got two times performance with AVX-512 instructions. :)


Posted: Tue Nov 07, 2017 9:21 am
by mpeterson
sorry, but absolutely useless. the n+1 invocation of an "auto-vectorizer" ... and pathon ? what is it really good for ?


Posted: Tue Nov 07, 2017 9:46 am
by Tahir007
I don't get what you mean by this
"the n+1 invocation of an "auto-vectorizer" ... and python ?"

Idea is that using just Python + SIMDy you get similar performance as if you program in C++.
In above project I develop path tracer using just Python + SIMDy that easily outperform C++ implementations.


Posted: Tue Nov 07, 2017 5:24 pm
by graphicsMan
Does it generate object files? IMO, this is pretty neat. It would be cool to write kernels using python, and then link those into C++ code.


Posted: Tue Nov 07, 2017 5:26 pm
by graphicsMan
NM, re-reading, it is clear that it doesn't generate object code. I think it's a nifty project, and probably a good way to learn stuff, but I think if you spend effort writing optimized C++, I'd be very surprised to see this perform similarly.


Posted: Tue Nov 07, 2017 8:55 pm
by Tahir007
Thanks for positive opinions about project:)

Yes you are right when you sad that you be very surprised if this was fast as optimized C++. I am programming about 15 years now and on numerous occasions i tried to optimize some function with hand written assembly code and compiler always beat me, but I learn lot in the process. Over the years a got better in assembly but still i admit that C++ compilers generates better code than I am. But when you turn to SIMD instructions things
are suddenly changed. Now programmer is responsible for writing compiler SIMD intrinsic so now I compete with other programmers and not compiler.
And also because I am doing JIT compilation i have lot's more context to work with because I know exactly what CPU you have. So in the and its not clear which code will be faster that why I sad that you get similar performance as optimized C++. :)
Now i will show simple example just to see exactly what is going on and how SIMDy works. Below example is trivial but it will show one of biggest advantage of SIMDy and that is how it adapt to different instruction sets automatically, depend of you CPU capabilities for handling float64x8 data type AVX-512, AVX2, AVX or SSE will be used. Best thing here is that programmer does't care about your CPU is just works. Even if your CPU
have only SSE instruction you still benefit from float64x8 type because of memory locality. Hint: for best performance always use float64x8 :)
Here I put explicitly AVX-512 as preferred instruction set because currently default is AVX2 but this will be fixed in next version and default will
be AVX-512.

Code: Select all

from simdy import Kernel, float64x8, ISet

source = """
a = b * c + float64x8(2.0)
args = [('a', float64x8()), ('b', float64x8()), ('c', float64x8())]
# I forgot to put AVX512 as default :x but this will be fixed in next version
k = Kernel(source, args=args, iset=ISet.AVX512)  # if you dont set iset default is AVX2

# put some values for parameters of kernel
k.set_value('b', float64x8(2.0))
k.set_value('c', float64x8(3.0))

# you can of course inspect assembly code if you want

Yes you can write kernels in Python and use it from C++ but in that case you must embed Python in your project and use it from there.
Communication between Python and C++ can be in both directions, people usually are not aware of this. :)