EDIT: As pointed out in the Rust reddit thread, the Rust version can be modified to run faster due to a suble difference between
..(z+1). I’ve updated the measurements with rustc0 being the former and rustc1 being the latter. I’ve also had to change some of the conclusions.
You may have recently encountered and/or read this blog post criticising a possible C++20 implementation of Pythagorean triples using ranges. In it the author benchmarks different implemetations of the problem, comparing readability, compile times, run times and binary sizes. My main language these days is D, and given that D also has ranges (and right now, as opposed to a future version of the language), I almost immediately reached for my keyboard. By that time there were already some D and Rust versions floating about as a result of the reddit thread, so fortunately for lazy me “all” I had to next was to benchmark the lot of them. All the code for what I’m about to talk about can be found on github.
As fancy/readable/extensible as ranges/coroutines/etc. may be, let’s get a baseline by comparing the simplest possible implementation using for loops and printf. I copied the “simplest.cpp” example from the blog post then translated it to D and Rust. To make sure I didn’t make any mistakes in the manual translation, I compared the output to the canonical C++ implementation (output.txt in the github repo). It’s easy to run fast if the output is wrong, after all. For those not familiar with D, dmd is the reference compiler (compiles fast, generates sub-optimal code compared to modern backends) while ldc is the LLVM based D compiler (same frontend, uses LLVM for the backend). Using ldc also makes for a more fair comparison with clang and rustc due to all three using the same backend (albeit possibly different versions of it).
All times reported will be in milliseconds as reported by running commands with
time on my Arch Linux Dell XPS15. Compile times were measured by excluding the linker, the commands being
clang++ -c src.cpp,
dmd -c src.d,
ldc -c src.d,
rustc --emit=obj src.rs for each compiler (obviously for unoptimised builds). The original number of iterations was 100, but the runtimes were so short as to be practically useless for comparisons, so I increased that to 1000. The times presented are a result of running each command 10 times and taking the mininum, except for the clang unoptmised build of C++ ranges because, as a wise lady on the internet once said, ain’t nobody got time for that (you’ll see why soon). Optimised builds were done with
-O2 for clang and ldc,
-O -inline for dmd and
-C opt-level=2 for rustc. The compiler versions were clang 7.0.1, dmd 2.083.1, ldc 1.13.0 and rustc 1.31.1. The results for the simple, readable, non-extensible version of the problem (simple.cpp, simple.d, simple.rs in the repo):
Simple CT (ms) RT (ms) clang debug 59 599 clang release 69 154 dmd debug 11 369 dmd release 11 153 ldc debug 31 599 ldc release 38 153 rustc0 debug 100 8445 rustc0 release 103 349 rustc1 debug 6650 rustc1 release 217
C++ run times are the same as D when optimised, with compile times being a lot shorter for D. Rust stands out here for both being extremely slow without optimisations, compiling even slower than C++, and generating code that takes around 50% longer to run even with optimisations turned on! I didn’t expect that at all.
The simple implementation couples the generation of the triples with what’s meant to be done with them. One option not discussed in the original blog to solve that is to pass in a lambda or other callable to the code instead of hardcoding the printf. To me this is the simplest way to solve this, but according to one redditor there are composability issues that may arise. This might or might not be important depending on the application though. One can also compose functions in a pipeline and pass that in, so I’m not sure what the problem is. In any case, I wrote 3 implementations and timed them (lambda.cpp, lambda.d, lambda.rs):
Lambda CT (ms) RT (ms) clang debug 188 597 clang release 203 154 dmd debug 33 368 dmd release 37 154 ldc debug 59 580 ldc release 79 154 rustc0 debug 111 9252 rustc0 release 134 352 rustc1 debug 6811 rustc1 release 154
The first thing to notice is, except for Rust (which was slow anyway), compile times have risen by about a factor of 3: there’s a cost to being generic. Run times seem unaffected except that the unoptimised Rust build got slightly slower. I’m glad that my intuition seems to have been right on how to extend the original example: keep the for loops, pass a lambda, profit. Sure, the compile-times went up but in the context of a larger project this probably wouldn’t matter that much. Even for C++, 200ms is… ok. Performance-wise, it’s now a 3-way tie between the languages, and no runtime penalty compared to the non-generic version. Nice!
Next up, the code that motivated all of this: ranges. I didn’t write any of it: the D version was from this forum post, the C++ code is in the blog (using the “available now” ranges v3 library), and the Rust code was on reddit. Results:
Range CT (ms) RT (ms) clang debug 4198 126230 clang release 4436 294 dmd debug 90 12734 dmd release 106 7755 ldc debug 158 15579 ldc release 324 1045 rustc0 debug 128 11018 rustc0 release 180 422 rustc1 debug 8469 rustc1 release 168
I have to hand it to rustc – whatever you throw at it the compile times seem to be flat. After modifying the code as mentioned in the edit at the beginning, it’s now the fastest out of the 3!
dmd compile times are still very fast, but the generated code isn’t. ldc does better at optimising, but the runtimes are pretty bad for both of them. I wonder what changes would have to be made to the frontend to generate the same code as for loops.
C++? I only ran the debug version once. I’m just not going to wait for that long, and besides, whatever systematic errors are in the measurement process don’t really matter when it takes over 2 minutes. Compile times are abysmal, the only solace being that optimising the code only takes 5% longer time than to not bother at all.
In my opinion, none of the versions are readable. Rust at least manages to be nearly as fast as the previous two versions, with C++ taking twice as long as it used to for the same task. D didn’t fare well at all where performance is concerned. It’s possible to get the ldc optimised version down to ~700ms by eliminating bounds checking, but even then it wouldn’t seem to be worth the bother.
My conclusion? Don’t bother with the range versions. Just… don’t. Nigh unreadable code that compiles and runs slowly.
Finally, I tried the D generator version on reddit. There’s a Rust version on Hacker News as well, but it involves using rust nightly and enabling features, and I can’t be bothered figuring out how to do any of that. If any Rustacean wants to submit something that works with a build script, please open a PR on the repo.
Generator CT (ms) RT (ms) dmd debug 208 381 dmd release 222 220 ldc debug 261 603 ldc release 335 224
Compile times aren’t great (the worst so far for D), but the runtimes are quite acceptable. I had a sneaky suspicion that the runtimes here are slower due to startup code for std.concurrency so I increased N to 5000 and compared this version to the simplest possible one at the top of this post:
N=5k RT (ms) dmd relase simple 8819 dmd release generator 8875
As expected, the difference vanished.
Final conclusions: for me, passing a lambda is the way to go, or generators if you have them (C++ doesn’t have coroutines yet and the Rust version mentioned above needs features that are apparently only available in the nightly builds). I like ranges in both their C++ and D incarnations, but sometimes they just aren’t worth it. Lambdas/generators FTW!