It works on my machine

“But it worked on my machine!” is usually the fault of the programmer or some quirk of configuration working behind the scenes to foil what would otherwise be smoothly executing software. In the overwhelming majority of cases, buggy software is the result of poor design, incorrect configuration, or bad communication (source: personal experience); however, there are some once-in-a-blue-moon type events that can also be the cause of a lot of frustration. I’ll describe a few that I find interesting; there are many others not listed here, and I’m sure there are many that I will encounter at some point in my career. The point of this post isn’t to provide potential scapegoats for badly implemented/documented/deployed code. Instead, I hope it’s a reminder that

Hardware and software are inextricably linked, and designers must be aware of how software is impacted by the hardware it runs on.
Even well designed code can have unexpected failures. Plan for them.

Now, some examples of things to think about when writing and debugging:

Bit Flips & Checksums

Bit flips can be caused by cosmic rays, and the scientists at NASA and SpaceX spend a lot of time trying to reduce the impact of cosmic rays on electronics in space, where cosmic rays like to hang out. Earth’s atmosphere does a good job protecting our devices from most of them (the rays.. not the scientists), but some still make it through and end up impacting your game of Super Mario 64

Cosmic rays aren’t the only source of bit flips–overheating components, surges, deteriorating connections, and many other things cause them. Cosmic rays just get the most bad press because they’re the hardest to control. My point is: bit flips happen more often than you’d think, which is why computers have been designed with parity bits, checksums, and cyclic redundancy checks (CRC)–as protection against bit flips. But although these methods lower the risk of data corruption during transmission, they don’t eliminate it. The original proposition for the TCP checksum states:

“Even if the checksum appears good on a message which has been received, the message may still contain an undetected error. The probability of this is bounded by \(2^{-C}\) where \(C\) is the number of checksum bits.” [1]

Since the length of the TCP checksum is \(16\) bits, that’s an error every \(2^{16}\) TCP packets. If you’re a datacenter handling billions of packets regularly, then you will receive TCP packets that appear valid but actually aren’t. Why isn’t there mass chaos? Because although the TCP checksum is pretty weak, the ethernet CRC is considerably stronger, and data has to pass both checks to be treated as valid; however, there is a weak point in the system: network switches recalculate the ethernet CRC rather than reusing it. So if a network switch corrupts the packet (due to a cosmic ray, faulty hardware, etc) and it still satisfies the (weak) TCP checksum, that packet will be wrapped back up into an ethernet frame with a nice, valid CRC computed by the corrupting switch. Clearly, this scenario is not very likely (though, if you’ve seen the quality of the Junos OS from Juniper Networks… maybe not that unlikely). But it can still happen, and if you design a system with the expectation that all data flowing through it will be valid, your system will eventually fail.

Parallel programming is powerful because it allows multiple threads to work concurrently on different tasks. Yet even if two threads are working on logically separated tasks, they must share the same hardware–and that introduces contention. The issue occurs when two threads, operating on two different cores, write to two different variables that are stored close together in memory. Normally, when two threads operate on completely different tasks, we wouldn’t expect them to impact one another. But our programs aren’t running in the idealized abstraction we thought up when writing the code–they’re running on hardware. The issue is as follows: multiple processors exist in some device, and they all agree to cut the memory up into small chunks that can be cached by each processor. Loading data from memory is slow, so if some data is repeatedly used, caching can save the processors a lot of idle downtime while they wait for data to load from memory. But wait! What if multiple processes are attempting to modify the same piece of data–we need a system to ensure the modifications of one processor get communicated to the others, so data changes don’t get overwritten. Cache coherence is the system by which processors indicate that they have modified some chunk of memory. It flags chunks of data as being modified, so that other cores planning to use that data, reload the updated value. This system allows multicore code to run correctly while also allowing the use of caches. However, if you aren’t careful about the way your data is stored, two processes using different data that is stored physically close together in memory may believe they’re modifying data used by another process, even if they aren’t. This is false sharing, and it can result in a massive drop in performance. Consider if you had two threads, each on a different core. Thread A modifies the thread_A_var and thread B modifies the thread_B_var, which are unknowingly stored on the same cache line. Every time thread A modifies thread_A_var, it will flag that cache line as being modified, meaning that thread B’s core will need to reload that cache line, wasting time. Then thread B will modify thread_B_var, which will mark that cache line invalid for thread A’s core, who will have to waste time reloading that cache line… on and on. While this may not be a typical “bug” in the sense that your output is incorrect, it will drop your performance needlessly. It is also difficult to catch once it’s in the code, so keeping it in mind while implementing parallel programming is a good practice. If you’re using C++, std::hardware_destructive_interference_size is intended to help you align your data on different cache lines, thus avoiding false sharing. Use it if you’re developing parallel code!

Garbage Collection & Memory Allocation

Not much to say here as these are pretty self explanatory. But if your program randomly slows down or crashes, it might be wise to monitor when garbage collection and memory allocation are occurring. Even if these aren’t causing your issue, another data point is always good to help you debug. Unless, of course, the issue is caused by your garbage collection monitoring statistics being logged.

Noisy Next-Door Neighbors

Just like people in thin-walled apartments, computers can be adversely affected by disruptive neighbors. Cloud providers (at least their marketing teams) attempt to abstract away the fact that your program will be running on physical hardware, and unless you pay extra for a dedicated host, you will likely have neighbors on the same hardware. So what happens if they’re noisy, aka greedily maxing out the compute and networking bandwidth of the host? Unfortunately, there isn’t much you can do other than hope that the well-paid AWS/Google/Azure engineers have implemented scheduling that is fair to tenants sharing hardware. If noisy neighbors are truly a concern, you can use a dedicated host or dedicated instance (AWS) to ensure that no one else is throttling your program’s ability to execute; I think the more sensible thing to do is to ensure you have a well designed implementation that monitors the metrics that matter to you and auto-scales appropriately if performance is subpar.

Hardware failures, Chaos, and Monkeys

My intention in writing about these issues is to push the idea that software is fundamentally connected to hardware. As obvious as that is, it often gets forgetten in the software engineering process. Hardware ~~can~~ will fail, hardware ~~can~~ will corrupt data, slow down programs, and violate the assumptions that software engineers take for granted. I subscribe to Occam’s razor so, usually, when something in my code doesn’t work it’s probably my fault as the programmer. But it’s still necessary to ensure that hardware issues don’t derail otherwise solid implementations.

The coolest technical/cultural approach to remedying the disconnect between programmers and the hardware their programs run on was developed by Netflix (in my opinion). Chaos Monkey is a DevOps program created to randomly terminate active hardware instances in production environments. As insane as that sounds, the program forced Netflix engineers to architect software that was resilient to failures. And that is how software should be designed: if you anticipate hardware failures, data corruption, and cosmic rays, then you design systems that can tolerate them.

It works on my machine

Bit Flips & Checksums

False Sharing

Garbage Collection & Memory Allocation

Noisy Next-Door Neighbors

Hardware failures, Chaos, and Monkeys