Software Engineering 🔩: Fault Tolerance

3 min readJan 9, 2024

As a software engineer, understanding Fault tolerance and building it into your software system is important.

Fault tolerance is a way to ensure that unexpected terminal faults/interruptions encountered by your software do not irreparably disrupt your system’s proper functioning.

There are several layers of your software system to investigate for potential fault vectors so that the appropriate recovery mechanism can be initiated.

Now let us assume we are not concerned about the actual business logic driving those software systems for now. What are the fault vectors that can arise from the system by itself? Let’s create a scenario:

Assume we have created a web server. We want this webserver to handle long-running jobs (1h+) that must run to completion. These jobs are instantiated by a GET request to the server.

Now the fault vector here is “handle long-running jobs (1h+)”

When we fire this request, it triggers the job. Now what if the server got shut down while the job was running?

Two scenarios can cause this:
1. You pushed a new version to your CI and the server has gotten restarted as a result.
2. An unexpected error in the system, maybe OutOfMemory, or even a power outage, so your server gets shut down (I know the power outage is not likely in the cloud).

Now these are faults that our fault vector led us to. Because the original requirement is to have them running for a long period, you have to factor interruptions into the equation.

Since the requirement is for the job to run to completion, the way to tolerate these faults will be:
1. Restart the server somehow and as soon as possible whenever a fatal error happens. These days most cloud providers offer this out of the box so you are in good hands.

2. Restart the job as soon as the server gets restarted.

“But the server just got restarted? How do I get the job and then restart it?”

You need to store that state. My approach to this is to have a data store to store data about these jobs before running them. Store the metadata of the job and other parameters that the job needs to run.

Next, you create a startup routine for your server. This routine should check the data store if there is any metadata for any job and trigger the corresponding job using the metadata.

And that will help tolerate the fault exposed by that vector.

Now what if the vector got adjusted to “long-running jobs (1h+) with expensive/non-repeatable steps”? Now this is a different constraint. Just restarting the job isn’t enough, you need a sense of resumability built into how your jobs are run.

How you would achieve this is to store the results of non-repeatable steps in a data store. Then the startup routine should provide this data to the job when triggering it.

That will help tolerate the fault exposed by the new fault vector and make your app more resilient to failure.

This kind of recovery-oriented thinking pattern is what will help build a fault-tolerant software system.

All the big tech companies that care about reachability and uptime dedicate a lot of effort towards building fault-tolerant systems because overall it delivers a better experience to all your users.

Clap for this post if you liked it! 🚀

Software Engineering 🔩: Fault Tolerance

Written by Benjamin Chibuzor-Orie

No responses yet