Product Engineering

by Team Azilen

December 21, 2021

Save Friday Night Using Chaos Engineering

What if you are preparing to leave early on Friday evening and you get a notification on your mobile reading that one of your production services is down and clients are not able to use the system anymore? You guessed it right, you need to stay late (at the office or home) and solve it. How quickly you solve it depends upon many factors like system understanding, system architecture, infrastructure, networking, etc.

It’s not just you who have to suffer. If your application is mission-critical, then every single minute counts and it can have a huge business impact. To avoid such situations, you can include Chaos Engineering in your project.

What is Chaos Engineering?

It is a practice of injecting faults in your system on purpose to check the resiliency of your system. It helps you to gain more understanding about your system and answers questions like:

what if third-party API starts responding slowly
what if your database becomes unreachable
what if the whole region goes down
what if downstream Lambda times out

It is common to notice that once you complete the development of the whole system, there are many parts (business logic, database, containers, etc) that work in harmony to enable the business flow. If any part fails, the chances of outage increase. You must have read about the outage of Facebook on Oct 4, 2021. It was down for six hours due to some configuration changes on the backbone routers. Many businesses rely on its APIs, so just think of the business impact.

Quality Testing VS Chaos Engineering

But you already have a QA team that ensures no bugs slip to the production environment. That’s true but the QA team, generally, checks the business use cases and performs somewhat load testing. They are not aware of the no. of services running on the cloud, storage used for the system, API Gateway concepts, Kubernetes, etc. QA team performs usability checks, whereas Chaos Engineering helps you to perform reliability checks. There is a separate SRE (Site Reliability Engineering) team that makes sure your system doesn’t crash. This team practices Chaos Engineering. In the case of a small project, developers can play the role of the SRE team.

Breaking Production

The myths in Chaos Engineering:

1. Your system must be big enough, like Netflix or Uber to perform Chaos Engineering

2. Run your first chaos experiment on production

3. Start with breaking everything down

It is true that if you want to gain real confidence in your system, then you need to inject faults (chaos) in production. Because users use production systems and not pre-prod (staging). Outages happen in production and not in pre-prod. Business suffers a loss due to faults in production and not in pre-prod. Thus, there is no point in testing a system that is not used by real users.

But still, Chaos Engineering is not:

1. Breaking down the system

2. Making bad surprises

3. To blame others (because third-party service didn’t work, so my application couldn’t respond)

It is:

1. Defining the steady-state (normal behaviour of the system, but you need to be prepared for Flipkart’s Big Billion Days or Amazon’s Great Indian Festival)

2. A vaccine to your system (the way we inject some virus in our body to build immunity)

3. Achieving confidence and resilience

Blast Radius

The best way is to start with small and grow strong. Let’s say your application has 10 services. You can bring down one service and check the impact. Understand the fault and then fix it. Then bring down two services and so on. Because you are playing with production, some of your users will get impacted and there are chances of monetary loss or damage to brand reputation based on the type of application.

Fail Often

To illustrate this, it would be good if we see the example of NAB (National Australia Bank). It has deployed the Chaos Monkey tool, open-sourced by Netflix, on the nab.com.au production environment. NAB migrated the public-facing areas of its nab.com.au to AWS. They have the tool running 24/7 for 365 days. It kills the server randomly. It could be killing as you read this article. This is to strengthen the belief in the system when tens of billions of dollars go through the bank every day. When you have such a resilient system, your engineers can sleep well at night.

When you fail the system so often, you can fix it quickly when the real outage takes place. There is a famous quote from Toyota.

Because we solve real problems all the time in our daily work, when crisis hit, it’s just a matter of degree.

— Toyota

Chaos Experiment Guidelines

Understand the System End-to-End

You need to have sound knowledge about the entire system. It includes past outages as well.

Gain Organizational Agreement

The idea of bringing even a portion of the production environment down may not digest well with the product owners or top leadership team. So, you need to convince them first of the benefits of chaos engineering in long term by showing them how your UAT or Staging environment reacts.

Create a Hypothesis

Ask what-if questions to yourself. For example,

1. What if I inject latency of 200 ms on average to every Lambda in my architecture?

2. What if my MongoDB collection becomes unreachable?

Enable Observability

When the chaos experiment is being run, you need tools (e.g. dashboards) where you can observe how your system recovers in case of failures. Without observability tools, you cannot unearth any knowledge from the experiment.

Prepare Your Experiment

As said before, start with a small blast radius. Inject failure into the portion which you can control easily. Don’t attack the entire system. While running the experiment, you need to have a stop button somewhere. You need to know in advance when to halt the experiment and must have a rollback plan in place. The aim of the experiment is not to surprise anyone, so communicate the plan to developers, SREs, customers (if possible).

Run Chaos Experiment

Once you have everything in place, you can start.

Measure & Learn

By this time you must have learnt how exactly your system is supposed to behave when the same thing happens. You also must have learnt how you respond to issues as a team.

How to get alerted
How to see the impact
How to respond
How/when to notify users

Halt & Fix

Your system is now broken. You have the opportunity to learn and fix the errors before the real outage. In this process, don’t blame others but blame yourself.

Tools

There are a plethora of options available to inject chaos into your system but if your application is hosted on Azure or AWS, you can check out Azure Chaos Studio and AWS Fault Injection Simulator. There is also a chaos-as-a-service option available from Gremlin. I recommend checking it. I tried it myself and had a good experience. The co-founder of Gremlin used to work in Netflix and there his job was to ensure Netflix systems are always up.

As the above-mentioned solutions are paid ones, you can also try out open-source tools. After reading a good amount of case studies , I think you can consider LitmusChaos and ChaosMesh. If your systems are based on Kubernetes, which I presume will be, then these tools would fit well in your ecosystem.

Litmus

FAQs

What if I am using shared infrastructure?

Well, this is a common problem among Chaos Engineering. It is not at all easy to run tests in a shared environment. Let’s assume you have one server and you are running 5 different applications. If you run the chaos experiment, other applications might starve for the resources. It is commonly known as a noisy neighbour.

How to run a chaos experiment on a Database?

You cannot change data but you can bring down the database or fill up the connection pool so that no other requests are accepted.

Is it just for micro-service based architectures?

No. You can use chaos engineering in almost any area. You can run it on IoT (break one device and see if there is a cascade effect on other devices), Artificial Intelligence (by default AI’s reliable score is always lower than humans, but by improving models, you can train AI effectively), Human Augmentation (think of human wearing exoskeleton to lift 300kg weight; what if exoskeleton stops functioning?), Cybersecurity (what if your IAM is compromised?)

Can I inject chaos in application logic?

Yes. Let’s say you fetch user details from the Redis cache. What if Redis is down? Your code should be able to connect to the main database and fetch the details.

What are the main causes of an outage?

Honestly, it can be anything. But if we narrow down the possibilities, then you can consider them below.

Fires (component failures like VM, switch, router, firewall, any library)
Floods (physical servers down, regional failures, entire data centre)
Fools (engineers who don’t follow rules; go to production and make some changes)
Fat-fingers (engineers who follow rules, but by mistake add extra code, think of extra zeros)

What are the common outcomes of Chaos Engineering?

Increased availability, lower mean time to resolution (MTTR), lower mean time to detection (MTTD), fewer bugs shipped to product, and fewer outages. Teams who frequently run chaos experiments are more likely to have >99.9% availability.

Can I have automated Chaos Engineering?

Yes, you can integrate it into your CI/CD workflow.

Conclusion

What matters in the software product development industry is how fast you deliver the features, how long your application is available to users. Downtime is costly and you need to meet SLA. Get started with chaos engineering today because it will test the assumptions you are aware of AND the ones you took for granted. It is an opportunity to learn the system behaviour, to train and prepare the team for whatever conditions might arise. So treat reliability as a project and not just as a practice.