Comparing Pythagorean triples in C++, D, and Rust

EDIT: As pointed out in the Rust reddit thread, the Rust version can be modified to run faster due to a suble difference between ..=z and ..(z+1). I’ve updated the measurements with rustc0 being the former and rustc1 being the latter. I’ve also had to change some of the conclusions.

You may have recently encountered and/or read this blog post criticising a possible C++20 implementation of Pythagorean triples using ranges. In it the author benchmarks different implemetations of the problem, comparing readability, compile times, run times and binary sizes. My main language these days is D, and given that D also has ranges (and right now, as opposed to a future version of the language), I almost immediately reached for my keyboard. By that time there were already some D and Rust versions floating about as a result of the reddit thread, so fortunately for lazy me “all” I had to next was to benchmark the lot of them. All the code for what I’m about to talk about can be found on github.

As fancy/readable/extensible as ranges/coroutines/etc. may be, let’s get a baseline by comparing the simplest possible implementation using for loops and printf. I copied the “simplest.cpp” example from the blog post then translated it to D and Rust. To make sure I didn’t make any mistakes in the manual translation, I compared the output to the canonical C++ implementation (output.txt in the github repo). It’s easy to run fast if the output is wrong, after all. For those not familiar with D, dmd is the reference compiler (compiles fast, generates sub-optimal code compared to modern backends) while ldc is the LLVM based D compiler (same frontend, uses LLVM for the backend). Using ldc also makes for a more fair comparison with clang and rustc due to all three using the same backend (albeit possibly different versions of it).

All times reported will be in milliseconds as reported by running commands with time on my Arch Linux Dell XPS15. Compile times were measured by excluding the linker, the commands being clang++ -c src.cpp, dmd -c src.d, ldc -c src.d, rustc --emit=obj src.rs for each compiler (obviously for unoptimised builds). The original number of iterations was 100, but the runtimes were so short as to be practically useless for comparisons, so I increased that to 1000. The times presented are a result of running each command 10 times and taking the mininum, except for the clang unoptmised build of C++ ranges because, as a wise lady on the internet once said, ain’t nobody got time for that (you’ll see why soon). Optimised builds were done with -O2 for clang and ldc,  -O -inline for dmd and -C opt-level=2 for rustc. The compiler versions were clang 7.0.1, dmd 2.083.1, ldc 1.13.0 and rustc 1.31.1. The results for the simple, readable, non-extensible version of the problem (simple.cpp, simple.d, simple.rs in the repo):

Simple          CT (ms)  RT (ms)

clang debug      59       599
clang release    69       154
dmd debug        11       369
dmd release      11       153
ldc debug        31       599
ldc release      38       153
rustc0 debug    100      8445
rustc0 release  103       349
rustc1 debug             6650
rustc1 release            217

C++ run times are the same as D when optimised, with compile times being a lot shorter for D. Rust stands out here for both being extremely slow without optimisations, compiling even slower than C++, and generating code that takes around 50% longer to run even with optimisations turned on! I didn’t expect that at all.

The simple implementation couples the generation of the triples with what’s meant to be done with them. One option not discussed in the original blog to solve that is to pass in a lambda or other callable to the code instead of hardcoding the printf. To me this is the simplest way to solve this, but according to one redditor there are composability issues that may arise. This might or might not be important depending on the application though. One can also compose functions in a pipeline and pass that in, so I’m not sure what the problem is. In any case, I wrote 3 implementations and timed them (lambda.cpp, lambda.d, lambda.rs):

Lambda          CT (ms)  RT (ms)

clang debug      188       597
clang release    203       154
dmd debug         33       368
dmd release       37       154
ldc debug         59       580
ldc release       79       154
rustc0 debug     111      9252
rustc0 release   134       352
rustc1 debug              6811
rustc1 release             154

The first thing to notice is, except for Rust (which was slow anyway), compile times have risen by about a factor of 3: there’s a cost to being generic. Run times seem unaffected except that the unoptimised Rust build got slightly slower. I’m glad that my intuition seems to have been right on how to extend the original example: keep the for loops, pass a lambda, profit. Sure, the compile-times went up but in the context of a larger project this probably wouldn’t matter that much. Even for C++, 200ms is… ok. Performance-wise, it’s now a 3-way tie between the languages, and no runtime penalty compared to the non-generic version. Nice!

Next up, the code that motivated all of this: ranges. I didn’t write any of it: the D version was from this forum post, the C++ code is in the blog (using the “available now” ranges v3 library), and the Rust code was on reddit. Results:

Range           CT (ms)   RT (ms)

clang debug     4198     126230
clang release   4436        294
dmd debug         90      12734
dmd release      106       7755
ldc debug        158      15579
ldc release      324       1045
rustc0 debug     128      11018
rustc0 release   180        422
rustc1 debug               8469
rustc1 release              168

I have to hand it to rustc – whatever you throw at it the compile times seem to be flat. After modifying the code as mentioned in the edit at the beginning, it’s now the fastest out of the 3!

dmd compile times are still very fast, but the generated code isn’t. ldc does better at optimising, but the runtimes are pretty bad for both of them. I wonder what changes would have to be made to the frontend to generate the same code as for loops.

C++? I only ran the debug version once. I’m just not going to wait for that long, and besides, whatever systematic errors are in the measurement process don’t really matter when it takes over 2 minutes. Compile times are abysmal, the only solace being that optimising the code only takes 5% longer time than to not bother at all.

In my opinion, none of the versions are readable. Rust at least manages to be nearly as fast as the previous two versions, with C++ taking twice as long as it used to for the same task. D didn’t fare well at all where performance is concerned. It’s possible to get the ldc optimised version down to ~700ms by eliminating bounds checking, but even then it wouldn’t seem to be worth the bother.

My conclusion? Don’t bother with the range versions. Just… don’t. Nigh unreadable code that compiles and runs slowly.

Finally, I tried the D generator version on reddit. There’s a Rust version on Hacker News as well, but it involves using rust nightly and enabling features, and I can’t be bothered figuring out how to do any of that. If any Rustacean wants to submit something that works with a build script, please open a PR on the repo.

Generator      CT (ms)  RT (ms)

dmd debug      208      381
dmd release    222      220
ldc debug      261      603
ldc release    335      224

Compile times aren’t great (the worst so far for D), but the runtimes are quite acceptable. I had a sneaky suspicion that the runtimes here are slower due to startup code for std.concurrency so I increased N to 5000 and compared this version to the simplest possible one at the top of this post:

N=5k                    RT (ms)
dmd relase simple       8819
dmd release generator   8875

As expected, the difference vanished.

Final conclusions: for me,  passing a lambda is the way to go, or generators if you have them (C++ doesn’t have coroutines yet and the Rust version mentioned above needs features that are apparently only available in the nightly builds). I like ranges in both their C++ and D incarnations, but sometimes they just aren’t worth it. Lambdas/generators FTW!

Advertisements
Tagged , , , , ,

11 thoughts on “Comparing Pythagorean triples in C++, D, and Rust

  1. […] Comparing Pythagorean triples in C++, D, and Rust 5 by atilaneves | 0 comments on Hacker News. […]

  2. […] Comparing Pythagorean triples in C++, D, and Rust 5 by atilaneves | 0 comments on Hacker News. […]

  3. […] Comparing Pythagorean triples in C++, D, and Rust 5 by atilaneves | 0 comments on Hacker News. […]

  4. […] Comparing Pythagorean triples in C++, D, and Rust 5 by atilaneves | 0 comments on Hacker News. […]

  5. […] Comparing Pythagorean triples in C++, D, and Rust 8 by atilaneves | 0 comments on Hacker News. […]

  6. […] Comparing Pythagorean triples in C++, D, and Rust 8 by atilaneves | 0 comments on Hacker News. […]

  7. Pierre says:

    I am not sure what you did. simple.c time is about 99.9% in the printf function, i.e. in a library provided with the OS. the first 1000 pythagorean triples can be found in half a million cycles, i.e. less than 300 microseconds at 2GHz. In the same order of overheads, running a command 10 times does only measures the time it takes to load a binary from a hard drive, which is again immensely larger than the problem being measured (about 2 milliseconds to load the small binary). I am afraid you did not compare the compiler, but the time it takes to load the binaries with and without debug information.

    • atilaneves says:

      > I am not sure what you did.

      There’s an entire blog post above your comment explaining exactly what it is that I did, with a link to the code.

      > simple.c

      I assume you mean simple.cpp.

      > is about 99.9% in the printf function

      Not according to perf, it’s not – I just ran it and it spends 0.32% of its time in printf. I commented out the printf and ran it again – nearly no difference in runtime.

      > the first 1000 pythagorean triples can be found in half a million cycles, i.e. less than 300 microseconds at 2GHz

      I don’t know where you got these numbers from. In any case, they’re demonstrably wrong: it takes ~130ms on my machine.

      > In the same order of overheads, running a command 10 times does only measures the time it takes to load a binary from a hard drive, which is again immensely larger than the problem being measured (about 2 milliseconds to load the small binary).

      You have the numbers right there in the blog post. The ~2ms it takes to run a C binary that does nothing is drowned out by the actual runtimes.

      > I am afraid you did not compare the compiler,

      Correct, I did not “compare the compiler”. I compared run and compile times for equivalent code emitted from different compilers.

      > but the time it takes to load the binaries with and without debug information.

      What debug information? The command line arguments used to compile are right there in the blog post. There’s not one instance of `-g` in sight.

      • Pierre says:

        I apologize, you are right. I did confuse myself with a compiler implementing strength reduction based on arithmetic series, geometric series and linear recurrences, replacing loops variants with equivalent operations (this requires to break the integer overflow assumptions and constraints in C standards), and with ms units (milllisec and microsec).

        What I did is related to ranges when you consider in the simple.c code that every second iteration of the inner loop will never produce a result. This is provable when checking the parity (the least significant bit) of x,y, and z.

        This is the original code:

        for (y = x; x*x + y*y <= z*z; y += 1)

        The following is equivalent and logically twice as fast

        for (y = x + (z & 1); x*x + y*y <= z*z; y += 2)

        (let the program run long enough, the first wrong result after integer overflow will be different)

        Another optimization shows that the inner loop does nothing unless y*y == z*z – x*x
        Checking the 3 least significant bits required for z*z-x*x to be a perfect square reduces the possible range of "useful" values for x.

        There is only at most 1 value of y which would match the constraint. Mathematically, it happens when the loop variant y == integer_square_root(z*z – x*x) . (Again, this is not true modulus 2^32, which is required by C standards on integer overflows). But formally, the inner loop is gone….

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: