(This is a crosspost from an article I wrote @ https://medium.com/jumbo-tech-campus)
Ideally our frontend solution is a unicorn that eats rainbows and poops butterflies. In reality it’s often a piece of software reliant on other pieces of software that all have the potential to break for whatever reason, relaying that problem to our customers.
We tend to be feature driven and primarily look at the happy flow of our software. This has everything to do with the Pareto Principle:
It was … discovered that in general the 80% of a certain piece of software can be written in 20% of the total allocated time. Conversely, the hardest 20% of the code takes 80% of the time.
And at the same time:
… Microsoft noted that by fixing the top 20% of the most-reported bugs, 80% of the related errors and crashes in a given system would be eliminated.
Now, it’s not said that this 20% of the former statement fully overlaps with the latter, but there’s at least a correlation here. The 20% that ‘needs to be done right’ takes 80% of our time, and is prone for shortcuts.
One of these shortcuts is often designing for failure. We tend to forget about what should happen in case our solution doesn’t work like intended. That’s a problem, because in example a full outage costs actual money (conversions) and can become a major detractor for your public image and retention rate.
Failure can manifest itself in many ways, and for many reasons. This makes it so that you’ll need to be able to pull the plugs on a macro level. In the case of a webshop, you might want to implement three levels:
No general level of criticality assigned. All features function the way they should (implementing circuit breakers, as described in the next chapter).
All critical functionalities (like finding and showing items, adding items to a basket and placing orders) are operational as expected, for all other (non mission critical) functionalities we trigger their circuit breakers actively (again, more on that in next chapter). It’s important that you inform your customer that you are running your shop with reduced functionality.
Once you go orange, you’ll be able to process orders, make money and have the ‘best’ experience given the circumstances, whilst reducing strain on the backend so you can fix what’s broken or run major updates.
When you go ‘red’, you basically disable ALL backend traffic. This essentially means you’ll have to serve a static website without interactive functionalities.
This is the first and the easiest implementation of criticality mode. What you’ll need to create in order to do this is:
- pick a random cloudprovider other than the one you host your operations on (you’ll want this when e.g. you by accident ran a terraform that deleted vital operational pieces of your operation)
- schedule daily a recursive wget command to your homepage that writes its output to a storage bucket
- make sure the wget sends a header when it crawls. Like
- do a first pass over your solution and condition interactive logic to not show (no add-to-basket CTA’s, no basket at all, stuff like that).
- condition a banner to show on top of your page to inform the user in case the header is provided.
- Adjust your load balancer to route — based on the preferred criticality mode — all external traffic towards the static bucket. Each response should be fitted with a no-cache header, which allows you to quickly come back from this mode when needed.
Going red enables the client to still see your products but postpone his or her order. Since your site basically progressively enhances once you go orange or even green, the conversion loss will be minimised and the perception of the quality of the digital solution will be relatively high.
The Circuit Breaker Design Pattern comes down to this:
If you know a backend is under pressure, trying to make more connections to it or start waiting for it makes no sense.
So instead of making more and more connections that cannot be resolved, bringing down your entire stack (because of timeouts and dog-piling), you start to inform the user beforehand that the functionality isn’t as expected.
This comes with some implications:
As a developer, you should make your stakeholders aware of the components your solution is relying upon and how they can fail.
As business, you should figure out a way — together with your developers — to ensure the best behaviour when that situation occurs.
This circuit breaker should evidently be tested and verified in an automated fashion to ensure the longevity of the solution. That inherently means you’ll have to be able to trigger the circuit breaker yourself. A good way to model this into your landscape is incorporating the CB in a feature-flagging system.
Feature Flagging is a way to enable or disable features for certain groups, or percentages of traffic on your website. You can go wild with user segmentation and A/B testing and such, but for the goal of this article I’d like to highlight a mode that you can relatively easily implement that resembles criticality mode.
- Each feature should be listed
- At which criticality level should this feature be showed? (Green = only when fully operational, Orange = mission critical feature, red = can be crawled and has no interactive functionality (or stripped when crawled))
- What is the target state of the feature? (Green = feature is operating nominally, Orange = trigger circuitbreaker, Red = remove feature from external traffic).
This setup allows you:
- to hide features from external traffic, but test it on production for a specific target audience anyhow
- to reduce stress on a particular backend function, showing reduced functionality to a user while not degrading the rest of the digital solution
- to manage which features are showed in which criticality state