My Five Worst Bugs: Lessons Learned In System Design
by Adam Tornhill, August 2018
My two decades in the software industry have been rewarding, interesting, and fun. However, there has also been plenty of frustration, horrible mistakes, and many sleepless nights. This month I'm celebrating 21 years as a professional developer (meaning: I started to get paid for what I used to do for free), so let's take this opportunity to revisit some epic failures from the past.
Some of the failures were brought to the world by yours truly. In other cases, I didn't introduce the bug myself but got the wonderful opportunity to track it down in production, often with a fairly upset customer next to me. The reason I'd like to share these stories is because each one of these bugs provided a learning experience that influenced the way I design software today. Let the horror show begin.
1. Discovering Edit and Continue
My first significant program was part of my engineering thesis back in 1997. Me and a friend developed what we now know as an Integrated Development Environment (IDE) for Motorola's HC-11 emulator. In three months we developed a text editor with syntax highlighting, an assembler, integration with the HC-11 emulator, and some basic debug functionality. Given our programming skills at that time, I'm still surprised we pulled that off and it was more grit than skill. But we did and both of us got our degrees.
Soon after I had completed my engineering degree, I was asked to continue the development on the IDE. Basically, the company that now owned the code we had written needed to extend the debugging capabilities. I had just started my first programming job, but figured I could take on another job in my spare time as well. I also agreed on the fixed fee of 5,000 Swedish Kronor, which is roughly 600 dollars (you see, I wasn't that business minded back then).
One of the requested features was the ability to manipulate the CPU registers during a debugging session. However, the way it was explained to me was that we needed to have a feature to modify the memory on the device
. A slight difference.
So based on my flawed understanding of the desired feature, I went ahead and implemented a view that listed the machine code on the device with decompiled assembly next to it. From this view you could modify any part of the source code, press a button, and have your modified code transferred to a running system on the device. With this feature I could hit a breakpoint in the debugger, change the code on the fly, and continue running my modified version. Wonderful. Except that this was not the feature the end users wanted; They wanted to tweak CPU registers, which the program still couldn't do.
I spent roughly three weeks implementing the wrong feature, while the desired requirement would have been a relatively straightforward task I could have completed in a day or two. The project suddenly became hectic, the deadline loomed on the horizon, and I remember that it was quite stressful to complete the actual features I got paid for. More often than not I had to code deep into the night. This felt even worse as it was an avoidable situation.
Lessons Learned
In the software industry, we'd like to think that our requirements change all the time. However, I've found that that's rarely the case. What changes is our *understanding* of the requirements. These days I make sure to have regular conversations with not only the stakeholders and direct customer, but I also talk as much as possible with the end-users of the product I'm working on. A deep understanding of the domain that we write code for makes wonders for system design and helps us both in designing the right thing but also to simplify our solutions based on domain knowledge.
On a related note, and much later during my lost years in Java, C#, and C++ lands, I truly missed being able to re-write code in a running system. Not only does it simplify debugging; It also allows for a different interactive programming experience that helps us maintain a flow and guides explorations. This eventually lead me to languages like Lisp and Smalltalk where your code is one with the running system, but that's a story for a different article. Let's get back to my mistakes instead.
2. The Year 2000 Fix that wasn't
Those of you who were around in the late 1990s probably remember the panic around year 2000 and how our computer systems were supposed to break down. To younger readers, you might want to compare the zeitgeist to today's worries about the impending AI doom where computer systems achieve consciousness and take over the world in a few clock cycles. The same apocalyptic feeling was in the air twenty years ago. Only then the problem was solvable.
The basic problem with the year 2k bug was that plenty of systems hadn't reserved enough space to represent dates with more than two digits for the year. Hence, after the turn of the millennium, a system would think it was year 1900 again and, for reasons that are still not that clear to me, this would mean the end of civilization as we knew it.
At this time I worked on a large control system and my coding skills were still, well, somewhat inadequate. In this system, most of the code didn't care about years or dates and wouldn't be affected by the year 2k bug. But there was this one external interface where a technician could connect a terminal and inspect log files and other diagnostics. The technician could search the logs by specifying a range of dates. And of course the API only expressed the year using two digits and my initial code review showed that a technician wouldn't be able to search the logs when dates stretched over the millennium. Not good.
I jumped on the task to fix the bug. Since that company had systems all over the world, we couldn't just update the API and expand the date field; It would break hundreds of systems that integrated with ours. So I decided to be smart. That was my first mistake.
The date was expressed as a serialized tm struct from the C standard library. Of particular interest was the tm_year field in the tm struct , which I could see held the current year as a two digit value (e.g. 98 for year 1998). The de-serialization code simply unpacked it and prepended the string 19 to the year so that 98 got expressed as the string 1998 internally. That means the rest of the system used four digit years, which would work fine.
Looking at the code I decided to add a conditional check. If the year was less than 38 (yes, my inexperienced younger self really planted another time bomb -- horrible) we could conclude that we were on the other side of the millennium, so that 01 would mean the year 2001 internally. Higher numbers meant we were looking for things in the 1990s. Perfect. I checked in the code and started to test it. By stubbing out the API I could emulate a technician searching for logs over different date spans. It all worked fine and I wrapped up the project by spending weeks on documentation, updating manuals, and taking part in the meetings that were part of the rather complex release process of that time. After that we shipped the new release to our customers, which back then meant physical distribution using CD ROMs.
All this happened in 1998. A couple of months later I worked on another task and had to implement some date and time related functionality. Skimming through the man pages of the standard library functions I suddenly froze. There was this dreadful comment next to the tm_year field. The comment claimed that tm_year holds the number of years since 1900
. With my year 2k fix fresh in my head I realized that I had made a terrible mistake. I had assumed that year 2000 would be expressed as 00 in the interface while, in reality, it would become year 100 and the system would interpret that as year 19100, which isn't even remotely correct. The code was already shipped and installed around the world. I started to realize that my mistake would cost the company a lot of money as we had to do a new unplanned and expensive release. I thought I would lose my job for this glaring incompetence if someone found out. What should I do?
After some soul searching I decided to tell my manager and assume full responsibility. This is one of the few times I've been scared in my career, which had barely started at this point and could end right now. Thankfully, my manager turned out to be quite understanding. We talked about what it would take to correct the bug, and how we should verify it to ensure the system really did work this time. Eventually a new release went out and the system survived the new millennium. And so did my job.
Lessons Learned
Of all the bugs I've been guilty of, this one's the worst. It was early in my programming career, and one of the things that surprise me when I look back is that I got to do all of this on my own; No one reviewed my code or discussed the proposed solution with me. Worse, there was no independent test or verification. Today I make a point in having an independent verification of any significant feature. That simple rule helps to catch most of the flawed assumptions we might have held as we wrote the code.
At that time, however, my first take away was a strong note to self to RTFM. But I do think the problem goes deeper than that. In today's large-scale systems we've come to rely extensively on third-party software fueled by an ever growing open-source ecosystem. We don't really have time to learn all the details of those APIs, and the APIs might of course change between releases and invalidate previous assumptions. Hence, I've learned to invest in contract tests on any API that I consume. That is, I write a set of tests that capture how we intend to use a particular API and run them as part of the test suite for my own applications. Should our assumptions break in a future version of the API, the contract tests will notify us.
I also use this story as a warning on the perils of stubs and mocks; It doesn't matter if you manage to achieve 100% code coverage through your tests if those tests are based on stubbed out functionality whose behavior differs from the real system. I know, because that was exactly the case with my year 2k bug. After that, I learned to emphasize integration tests as bugs tend to breed in the inter-relationships between different sub-systems, particularly when those sub-systems are implemented by different teams and organizations.
Finally I learned that smart solutions rarely are that smart. Lots of defects come from edge cases that we fail to handle properly in our code. And smart solutions kind of tends to introduce edge cases. Thus, when I'm tempted to do something smart, I make it a habit to double-check my proposed solution with my colleagues.
3. A Core too Far
After two glaring mistakes by myself it feels good -- in fact, really good -- to take a step back and look at a case where I got to play the hero for once. This all happened when I worked for a company that developed training and simulation systems. One of the products was an Air Traffic Control (ATC) training system. It was a cool product with panoramic 3D views and several elaborate simulation features. I hadn't contributed much code myself, but got called in on a disaster.
One customer had planned a major training event and, in the days before that event, the ATC system suddenly started to crash multiple times a day. This was surprising as the system had been in use for quite some time without any major issues. And now we had 24 hours to identify the problem and patch the system; a whole staff of air traffic people were waiting for their training, and a cancellation of the event would be expensive. Really expensive.
Once I got to the site I interviewed the technical staff. The staff also managed to produced some crash dumps that let me locate the part of the code that failed. Looking at the code I couldn't find anything odd. So I continued my investigation by looking at the calling context of the code. Hmm, looks like the failing code executes in a separate, short-lived thread. Interesting. I got an idea.
Since I knew that the installed software hadn't been updated for months, I started to suspect that something had changed in its environment. And yes, just before the problems started, the customer had upgraded their servers to handle the increased load during the planned training event. Their old single core machines were now replaced by shiny, blazingly fast servers with 4 cores each. This meant that with the new servers, the code suddenly got to experience its first rush of true parallelism. It failed that test miserably.
You see, there had been a latent race condition (basically a shared data structure updated from separate threads), but the old servers didn't context-switch fast enough for that to matter. With more CPU cores, the offending thread suddenly got to live its own life on a dedicated core and the latent race condition turned into a failure that crashed the system.
Once diagnosed, the bug was easy to fix; Instead of patching it with locks and mutexes -- the programming equivalent of Russian roulette -- I just removed the additional thread and let everything execute serially. Interestingly, as I profiled the resulting code, there wasn't even a decline in performance; The overhead of creating the short-lived thread dominated over any potential savings in parallel execution.
Lessons Learned
The environment is part of your system too. Don't just look at the code when debugging those unexplainable bugs, but consider the changes that might have happened in the hardware as well. This is particular relevant today, with the increasing popularity of distributed systems (often called microservices) and their reliance on networks and virtualization.
This experience also lead me to look at different programming languages and paradigms. If the future was supposed to be increasingly parallel with tens and maybe hundreds of CPU cores, then traditional multi-threading is unlikely to be the answer. I had just experienced first hand how brittle threads are, and that it's virtually impossible to test correctness into a multi-threaded program; You see, that race condition had been around for years and had passed many hours of system testing undiscovered.
The next major system I worked on was built using Erlang, mostly as a response to what I perceived as the dead-end of mulit-threading combined with mutable state. But that experience is a story for another article. Let's move ahead and look at the most expensive bug I ever tracked down.
4. How to Crash a Metro Station: a Guided Tour
I spent several years working on safety-critical systems for the railway industry. One of the first projects I joined was for a major system upgrade to an Asian metropolis. Millions of people commuted in that city and lots of them relied upon the metro lines that our system controlled.
The system had been installed on the site for a couple of months as the first major problem got reported. It turned out that the train management system -- the UI where you request routes for trains to run -- sometimes lost its connection to the safety systems that controlled signals and points. Without any connection, no trains could run and the signals along the tracks turned red. The failure required the system to be re-started, which meant another 5-10 minutes of downtime with the consequent delays on the train lines. That's no good.
We had struggled for weeks to track down the bug. So far, we hadn't even been able to reproduce it until, one day, a colleague called from the lab and reported that the exact same thing had happened there. As I inspected the log files, everything looked correct. Not a single sign of trouble. However, when I compared the logs to the one we had received from the real site in Asia, I noted what could be a pattern: prior to the connection troubles, the system had received a signal telling it to calibrate its system time. Now I had something specific to guide my debugging.
So far we had focused our efforts on the newly developed code responsible for maintaining the connection with the train management system. We had reviewed it, tested it, and couldn't find any mistakes. For all we could tell it was correct.
However, that connection relied upon several timers. One such timer controlled a keep alive message sent periodically to the train management system. As I inspected the timer code, which was old low-level C code that hadn't been modified for years, I found an interesting problem. It turned out that the system timers were stored in a sorted linked list and those timers had to be re-calculated if the system time changed. And I soon noted that there was a race-condition where a timer that is set in close proximity to a change in system time could get lost. In this case, it meant that the timer triggering the keep alive message would simply disappear and the message would never be triggered and the application code would just wait forever for a timer signal that would never arrive. This issue was extra nasty and hard to track down as it was highly unlikely to happen. But when it did, it broke the system.
Lessons Learned
This is a different kind of bug than the once we've discussed so far. In this case we had a disconnect -- both in time and software architecture -- between the root cause of the failure and its manifestation at a later time in a different part of the system. This lead us down a false path where we focused our debugging on the part of the codebase exhibiting the symptoms.
This story provides an argument for what is now hyped as testing in production. Testing in production doesn't mean we skip unit/integration/end-to-end tests. Rather, it means that our systems are so complex and depend on so many parameters that we cannot verify our code in isolation from the rest of the world. As such, testing in production is a complement. To pull it off, we need to have good diagnostics, monitoring, and resilience, which are all aspects that have to be built into the very core of the system architecture. Ultimately, it was the diagnostics that saved us by pointing to a pattern in the events received before the failure.
The correction was just a few lines of code -- amongst hundreds of thousands of others -- but the difference was between a product that delivered passengers to their destination and an expensive show stopper. Small things like these might have a huge impact by making a system's behavior unpredictable. I've often heard debates on whether programming is art, science, or engineering. After fixing this bug I was convinced that there's only one field capable of describing large-scale software systems: chaos theory.
5. Virtually Correct
The stories I've shared so far date 15-20 years back in time. Spending that long in the software industry is frustrating at times; We're an industry that seems destined to ignore lessons of the past and we keep repeating preventable mistakes over and over again. However, some things have definitely gotten better. Some notable improvements are shorter release cycles, iterative development, and a much needed focus on automating the repeatable and mechanic aspects of software verification.
In one of my last projects as a consultant before I left to found Empear, my own product company, I got to work with a team that had invested heavily in test automation. We had tests on all levels and those tests complemented each other nicely. We also had a set of long-term tests that exercised the codebase for hours to ensure we didn't have any hidden stability problems. Unfortunately, those tests frequently broke, and the team soon got used to arrive in the morning and see a red build indicating that the nightly long-term tests had failed.
This sad state of the build had been a constant companion for months, when suddenly some of our internal users started to complain about what looked like a real problem. The software we designed was targeting a hand-held device, and now users complained that the device crashed if left with the power on over the night. One of our testers jumped on the task and started to measure the memory consumption of the device. He soon noted that the code seemed to have a memory leak. Another colleague followed-up with a detailed inspection of the memory allocation profile, and narrowed the problem down to some C++ code looking like this:
class some_specific_event : public base_event { public: some_specific_event(const event_source& s) { m = new some_type(); // allocate some memory... } virtual ~some_specific_event() { delete m; // ...and free the allocated memory here } // ..other stuff.. }
How could that code leak memory? Sure, C++ has manual memory management but this class had a proper destructor that clearly contained the correct code for delete:ing the allocated memory when the object went out of scope.
Now, I hadn't coded any C++ for years at this point, but I have spent more time reading the C++ standard than I'd like to admit. So I immediately suspected a particularly nasty pitfall and asked to see the base class. It looked something like this:
class base_event { public: ~base_event(); // ..other declarations.. }
And there it was: the objects were deleted through a pointer of the base class type. In C++, this means that the destructor of the base class -- not the derived type -- specifies the deletion semantics; If the base class destructor isn't virtual, the resulting behavior is undefined
in C++ speech. In practice, it meant that the derived class's destructor wasn't invoked, which means the delete m statement didn't get executed and we have a memory leak.
Fixing this bug was as easy as adding the virtual keyword to the base class's destructor and recompile the application. Suddenly, the stability tests that had glowed red for weeks started to pass. The problem was solved, but it had also raised a number of important questions about both technology and the people-side of code, so let's look that the lessons learned.
Lessons learned
First of all, there was actually an automated long-term test designed to catch potential memory leak. Unfortunately that failing test was erroneously blamed on an instable build pipeline. The real issue, however, is late feedback; Long feedback loops break an organization. I would even claim that for automated tests, late feedback is as good as no feedback. The problem with long-running tests is that they open a time window in which multiple programmers get to commit code. As the test then fails, it's unclear what changes that caused the failure as multiple independent changes have been done to the system since the last build the previous night.
This problem was exaggerated since the organization employed a collective code ownership model for its 20+ developers. The principle also stretched to supporting code like automated integration tests and the build pipeline. So when a test failed, it was everyone's responsibility to track it down. In practice this meant that no one really took the initiative to investigate and correct the potential problems.
I've come across collective code ownership several times in my career, and we currently employ that principle to parts of the CodeScene development. I also know some great developers who claim collective code ownership works well for their organizations. However, what all those success stories -- including my own -- have in common is that they occur in small, tight knit teams of 4-5 developers; I'm yet to see an example where collective code ownership works well at scale. My scepticism isn't specific to software, but comes from my background in social psychology and group theory. As soon as we scale collective code ownership to a larger group, we open up for social biases like diffusion of responsibility and social loafing that I described in Software Design X-Rays. These biases are hard to keep in check.
Instead, introducing code ownership can be both a strong motivator as well as a coordination mechanism, and it serves as an important component when scaling organizations. And, of course, I don't mean "ownership" in the sense that this is "my" code and no one else is allowed to touch it. Rather, I'm referring to a sense of personal responsibility for the quality and future direction of a particular module or sub-system. Collective ownership at scale doesn't have a strong track record in the real world, so why would code be different?
Finally, I just have to comment on the bug itself. I frequently claim that surprise is one of the most expensive things you can put into a software architecture. That claim extends to programming language design as well, and C++ is notorious for the amount of surprise it can deliver. Sure, I understand that C++ was designed in a time much different from today. C++ was designed to let you write object-oriented code with the same run-time performance as its C equivalent. The consequence is that C++ is optimized for a machine to execute as opposed to minimizing programmer errors.
This bug is a perfect example on this trade-off: Back in the 1980s, making destructors virtual by default would have come at a cost of, say, one or two additional clock cycles and a few bytes of memory as the compiled code would have to specify an extra level of indirection. On today's hardware, performance typically depends on other factors and micro-optimizations like non-virtual destructors rarely matter. But as C++ programmers we're condemned to decisions made in a now obsolete historical context. Nothing's really free.
Uncovering a Common Theme
Revisiting these memories of failures past has been painful. Each one of them cost me lost sleep and, consequently, a caffeine induced headache. My idea was to recall the stories as I remember them without sparing myself (yes, I'm still embarrassed over the design that lead to the year 2k bug). The reason for this transparency is because I think there's a common theme that transcends any personal lessons I might have learned from repeated failures: the worst errors occurred through an interaction between the code and a surrounding system. Sometimes that surrounding system is an API, at other times it might be the operating system or hardware, and sometimes the external system is other people who we communicate with to capture the requirements. That is, the worst system failures occurred through interactions with something we don't necessarily control.
On a personal note, I don't think we will ever be able to eliminate these kind of bugs, no matter how much we automate, how clean our code is, or how many tests we have. The destiny of our system is rarely in our own hands. Rather, we should plan for failure. And when things go wrong, we need to be able to respond fast. Most of these bugs were simple to fix, although tracking them down was time consuming and painful. Good system design -- on all levels -- is critical to the response times that really matter.
About Adam Tornhill
Adam Tornhill is a programmer who combines degrees in engineering and psychology. He's the founder of Empear where he designs the CodeScene tool for software analysis. He's also the author of Software Design X-Rays: Fix Technical Debt with Behavioral Code Analysis, Your Code as a Crime Scene, Lisp for the Web, Patterns in C and a public speaker. Adam's other interests include modern history, music, and martial arts.