Monthly Archives: March 2014

Increasing performance with static polymorphism (and other neat tricks)

So I wrote a serialisation library in D. I initially wrote it to try and understand and see the benefits of the compile-time reflection available in the language. I based it off of the library I wrote in C++11, and as such my brain already had the design in its head. This design was in turn inspired by similar code a colleague had written at work. So when I wrote the code, I used dynamic polymorphism, which in D means classes.

That usage of the new operator bugged me though. In real code (such as here) the objects doing the serialisation were always actually short-lived. Enter function, do the job, exit. It seemed like a waste to allocate memory on the garbage-collected heap and I started thinking about transforming them into structs, which can live on the stack. Then it dawned on me: dynamic polymorphism wasn’t ever actually needed. It’s never the case that the code doesn’t know whether it wants to marshall or unmarshall a value. That decision is always made at compile-time so the the cost of the virtual functions and garbage collection was being paid for nothing. The other place that was using GC allocation was the serialiser itself, which was appending to a dynamic array. It was the simplest thing that would work, so that’s what I started out with. Of course, the same realisation could’ve happened whilst maintaing the C++ version and there also it would’ve been possible to convert, but it’s so much pain do it in C++ and so easy in D that it just happens naturally.

I converted the codebase to use structs and template functions instead, breaking backwards compatibility with the old (V0.4.x) version of Cerealed. In the process, I ended up discovering the Appender struct in std.array. This happened as a result of trying to do policy-based design so I could separate the algorithm (in this case, how to marshall) from the process of actually writing to an OutputRange (see std.range). I had also recently read about warp and the new ScopeBuffer in Phobos (the D standard library) and wanted to see how these new additions would affect performance. I wrote a small, not particularly well-written test program, which can be seen in this gist. I used both gdc and dmd. I left ldc out because its frontend (at least the package currently available on Arch Linux) is older and can’t compile the code, and I didn’t feel like making alterations just to see how well it would do.

The results are presented in the tables below. I left out standard deviations because they were too small for nearly every measurement, so the values are just averages of a few different runs. “Classes” is the original V0.4.1 Cerealed OOP code, “Structs” is the code with structs instead of classes but still using a dynamic array, “Appender” and “ScopeBuffer” use the Phobos structs mentioned above. Since ScopeBuffer isn’t part of my distribution of Phobos, I copied it to the project instead. That way it can be compiled by other people who, like me, don’t compile their own versions of the compiler and standard library. I did 25M loops for serialisation and 75M loops for deserialisation. I compiled using (g)dmd with options -O -release -inline -noboundscheck.

Cerealiser (seconds, lower is better) dmd gdc
Classes 22.8 16.1
Structs (dynamic array) 19.9 14.0
Appender 16.1 9.5
ScopeBuffer 4.6 4.5

 

Decerealiser (seconds, lower is better) dmd gdc
Classes 21.7 17.3
Structs 8.9 9.9

 

For unmarshalling, I only compared classes vs. structs. The reason is that unmarshalling didn’t allocate memory (it uses whatever slice is passed to it so allocation is the responsibility of the client code), so there wasn’t much I could do to improve performance. Even then, that slight alteration causes a dramatic reduction in the time spent deserialising, with dmd making it more than twice as fast. Inlining is great!

For marshalling, the results between the slowest and fastest version are nearly a factor of 4-5! ScopeBuffer made such a difference, despite me using, on purpose, a static array that was too small to hold the struct so it had to allocate. I tried with a larger array and there was no difference in performance.  Most of the  results shows gdc generating more efficient code for most cases. The real lesson here is that using the right algorithm for the job (in this case, ScopeBuffer), makes a much larger difference than everything else.

I’m really happy with the results. None of the new V0.5.0 and newer Cerealed code needs to use the garbage collector anymore and I even added a convenience function called cerealise to use ScopeBuffer. It works by passing in a lambda to act on the resulting byte array and is templated on the size of the static array, which has a default of 32 bytes.

I could go back and do something similar for the C++ version but… that would be a lot more work (moving everything into the headers alone would take quite some time), I don’t really have any projects that require super fast serialisation and these days I just want to hack on D anyway. I still want to finish the networking part of my game (in progress and won’t compile on Windows, but there are binaries for the old version on sourceforge), which means some more C++, but after that I’ll avoid it when I can. Unless the alternative is C, of course. I really dislike C.

All in all, it’s unlikely the bottleneck of any app using Cerealed will be the serialisation, but if it is… it just got a whole lot faster.

Tagged , , ,