So I wrote an MQTT broker once (twice really, but I never really finished the second version). It’s now my go-to way of learning a new computer language. Once Rust finally makes it to version 1.0, I’ll write an MQTT broker in it as well. It’s a problem I know well now, it tackles the hairy problems of networking and concurrency, and it’s small enough to not become a huge time sink. So that’s how I decided to try and learn Haskell.
With the immense paradigm shift that came with learning Haskell came a much longer development time. I’m ok with that, especially given that the code is so much shorter. This time I decided to BDD the whole thing, using Cucumber to drive the acceptance tests and writing unit tests for the rest.
After much toiling and trying to wrap my head around monads (I finally understand them! It only took months of reading multiple different blog posts and using them…), I got a version that compiled and passed all tests. A working MQTT broker! So let’s run Jeff’s benchmark on it to see how it compares with the implementations in other languages and… oops.
The program hangs. I try again with 10 messages and 2 connections. It still hangs. How can this be? I did my due dilligence, didn’t I? Aren’t Haskell programs just supposed to work if they compile? I mean, I had to jump around hoops to write a function to convert character literals into Word8 values so write my tests! After a lot of “printf” debugging (the Haskell debugging experience is not ideal) I find what’s causing the bug, and as always, it seems obvious in hindsight.
Networking is dirty, ugly and variable. Not the kind of thing that lends itself well to being unit-tested. So I cheated a bit. The part that dealt with client connections and whether or not the server should disconnect went into the main loop and wasn’t unit tested at all. I felt bad about it at the time but let it go on one of those “it’s ok, I know what I’m doing” feelings. And as usual, I break my own rules at my own peril.
I had an acceptance test for it, and it passed. It just wasn’t comprehensive enough. The TCP traffic has to be just so to trigger the bug, and it’s actually hard to craft packets that look like real-world usage scenarios. Who wants to have a list of bytes 512 bytes long in their test code, much less several of them?
My conclusion of the error of my ways? Not enough unit-testing. Too much code outside the nice, warm and fuzzy pure core. The feeling I shook off about the dirty networking code not being unit tested? I’m never doing that again. The fix is going to be relatively simple: purify as much of the code as I can so it’s trivially unit-testable and have the “real” code be a thin wrapper over the pure code.
The worst is that was already my belief of how to develop robust software. I just didn’t follow my own advice.