What Disasters Can Teach Us About Good System Design

The Chernobyl disaster didn’t happen simply because of Soviet Communist top-down regime. The Challenger disaster didn’t happen simply because of the O-ring seals failure. The second wave of the Coronavirus didn’t take over India simply due to lack of adequate healthcare in the country. A marketing strategy doesn’t fail simply because you didn’t include animal photos. These so called “root causes” are part of many reasons, and they are definitely not the main reason.

On 26 April 1986, a group of nuclear engineers wanted to test a simulation of an electrical power outage to create a safety procedure in a nuclear power plant. They wanted to test if it was possible to maintain the nuclear reactor cooling water circulation until the back-up electrical generators could provide power. The goal was to make sure the reactor won’t have to be completely shut down during power outage.

Three such tests had been conducted in the previous four years, but they had failed to provide a solution. This fourth attempt didn’t happen as scheduled. An unplanned 10-hour delay caused a completely unprepared operating shift to handle the job.

The reactor power had to be decreased as part of the test. This meant disabled the safety system. But the power unexpectedly dropped to near-zero during the process. The operators could partially restore the power, but this put the reactor in an unstable condition.

Since the safety system was off, the risk wasn’t evident. Moreover, the operating instructions didn’t cover this use case, so the operators proceeded with the planned test.

Upon completion of the test, when they triggered the reactor to shut down, a combination of unstable conditions and reactor design flaws caused an uncontrolled nuclear chain reaction. The rest, as they say, was the Chernobyl nuclear disaster.

On 28 March, seven years back and 7,600 km away, a group of nuclear power plant technicians accidentally got a bubble stuck in a sensor during a routine cleaning procedure. This stopped the coolant pumps from circulating water. There were auxiliary pumps to handle this situation, but they were closed for maintenance (in violation of the plant’s operating procedure).

Even though the reactor detected a problem and went into emergency shutdown mode, there was no water circulation and the heat in the system had nowhere to go. The pressure relief valve—whose job was to let out the excess pressure—suffered a mechanical failure and got stuck open, thereby permitting coolant water to escape from the system.

This wasn’t detected by the technicians because the control room indicator light showed if the valve closing mechanism was powered on or off, not if the valve itself was open or closed. The open valve slowly leaked water necessary to cool the reactor and there was no extra water coming in.

There was, however, one coolant temperature sensor that was spiking in the control room. But the crew’s training for abnormal incidents directed them to refer other sensors, not this one. This chain of events caused the reactor to overheat for 11 hours. This led to a partial meltdown of the reactor and subsequent radiation leak in the Three Mile Island of the US.

When we do postmortems of both big and small failures, we heavily suffer from hindsight bias. Informed by Chaos Theory, we assume that avoiding the “root cause” of the disaster—the technicians being unprepared, or bubble getting stuck in the sensor—could have avoided the meltdown. What if they hadn’t done that? What if they were more careful? This assumption is wrong.

In theory, yes, a buttery flapping its wings in Brazil can set off a tornado in Texas, but it doesn’t mean that it will. There aren’t as many tornadoes as butterfly flaps in real life—this bit is clear. By that logic, any random event can set off a series of events that can lead to some other event. But this knowledge doesn’t help us in any way practically.

In real life, the human body, the environment, a society, an organisation, a business strategy, or a space shuttle, are examples of complex systems. Errors are normal in a complex system. They happen every day. Bubbles get stuck, coolant systems fail, technicians ignore procedures. But there’s enough redundancy and margin of error in place to make sure such errors don’t collapse the whole system.

Catastrophic failures happen only when enough of these small errors line up, leading to a compounding effect via a chain reaction. Catastrophic failures—whenever they occur—happen regardless of a “root cause”.

By that logic, the Challenger disaster didn’t happen simply because of the O-ring seals failure. The second wave of the Coronavirus didn’t take over India simply due to lack of adequate healthcare in the country. A marketing strategy doesn’t fail simply because you didn’t include animal photos.

These so called “root causes” are part of many reasons, and they are definitely not the main reason. If you look closely, there are 500 other things that were erroneous but didn’t contribute to the disaster. In a complex system—where there are multiple moving parts—there is no “root cause”. In other words, avoiding them wouldn’t have been enough to avoid the disaster. It might have been delayed, but not prevented.

Catastrophic failures are usually due to a combination of bad system design, human error, human-computer interaction flaws, poor training, miscommunication, bad process, and countless other major and minor contributing factors. There’s seldom one major factor which has a cascading effect.

The same is true for success as well. A company doesn’t become successful just because it has a great CEO. You don’t have a healthy relationship just because you are a nice person. You don’t win a competition just because you worked hard. There are multiple other factors—some of which are outside your control.

Therefore, the goal isn’t to make completely error-free systems. It’s impossible to make them completely free of error anway. It’s hubris to even attempt that—be it a personal fitness regime, or the product development process in a startup, or the service design of an airport. What you should thrive for is to make them error-proof. Systems that work regardless of errors.

A good system helps you stay healthy regardless of occasional drinking and partying. A good system helps you get more done regardless of procrastination. A good system helps you ship the MVP regardless of developmental hiccups.

Every day, between one and five of your cells in your body turn cancerous. But your immune system is efficient enough to capture and kill them. Think of that. A couple of dozen times a week, well over a thousand times a year, you get the most dreaded disease of our age, and each time your body self-corrects.

The body works regardless of errors. So should the systems we design.