Criticality Mode and Circuit Breakers

(This is a crosspost from an article I wrote @ https://medium.com/jumbo-tech-campus

Ideally our frontend solution is a unicorn that eats rainbows and poops butterflies. In reality it’s often a piece of software reliant on other pieces of software that all have the potential to break for whatever reason, relaying that problem to our customers.

We tend to be feature driven and primarily look at the happy flow of our software. This has everything to do with the Pareto Principle:

It was … discovered that in general the 80% of a certain piece of software can be written in 20% of the total allocated time. Conversely, the hardest 20% of the code takes 80% of the time.

And at the same time:

… Microsoft noted that by fixing the top 20% of the most-reported bugs, 80% of the related errors and crashes in a given system would be eliminated.

Now, it’s not said that this 20% of the former statement fully overlaps with the latter, but there’s at least a correlation here. The 20% that ‘needs to be done right’ takes 80% of our time, and is prone for shortcuts.

One of these shortcuts is often designing for failure. We tend to forget about what should happen in case our solution doesn’t work like intended. That’s a problem, because in example a full outage costs actual money (conversions) and can become a major detractor for your public image and retention rate.

Criticality Mode

Failure can manifest itself in many ways, and for many reasons. This makes it so that you’ll need to be able to pull the plugs on a macro level. In the case of a webshop, you might want to implement three levels:

Criticality Green

No general level of criticality assigned. All features function the way they should (implementing circuit breakers, as described in the next chapter).

Criticality Orange

All critical functionalities (like finding and showing items, adding items to a basket and placing orders) are operational as expected, for all other (non mission critical) functionalities we trigger their circuit breakers actively (again, more on that in next chapter). It’s important that you inform your customer that you are running your shop with reduced functionality.

Once you go orange, you’ll be able to process orders, make money and have the ‘best’ experience given the circumstances, whilst reducing strain on the backend so you can fix what’s broken or run major updates.

Criticality Red

When you go ‘red’, you basically disable ALL backend traffic. This essentially means you’ll have to serve a static website without interactive functionalities.

This is the first and the easiest implementation of criticality mode. What you’ll need to create in order to do this is:

  • pick a random cloudprovider other than the one you host your operations on (you’ll want this when e.g. you by accident ran a terraform that deleted vital operational pieces of your operation)
  • schedule daily a recursive wget command to your homepage that writes its output to a storage bucket
  • make sure the wget sends a header when it crawls. Like x-criticality-mode: red.
  • do a first pass over your solution and condition interactive logic to not show (no add-to-basket CTA’s, no basket at all, stuff like that).
  • condition a banner to show on top of your page to inform the user in case the header is provided.
  • Adjust your load balancer to route — based on the preferred criticality mode — all external traffic towards the static bucket. Each response should be fitted with a no-cache header, which allows you to quickly come back from this mode when needed.

Going red enables the client to still see your products but postpone his or her order. Since your site basically progressively enhances once you go orange or even green, the conversion loss will be minimised and the perception of the quality of the digital solution will be relatively high.

Circuit Breakers

The Circuit Breaker Design Pattern comes down to this:

If you know a backend is under pressure, trying to make more connections to it or start waiting for it makes no sense.

So instead of making more and more connections that cannot be resolved, bringing down your entire stack (because of timeouts and dog-piling), you start to inform the user beforehand that the functionality isn’t as expected.

This comes with some implications:

As a developer, you should make your stakeholders aware of the components your solution is relying upon and how they can fail.

As business, you should figure out a way — together with your developers — to ensure the best behaviour when that situation occurs.

This circuit breaker should evidently be tested and verified in an automated fashion to ensure the longevity of the solution. That inherently means you’ll have to be able to trigger the circuit breaker yourself. A good way to model this into your landscape is incorporating the CB in a feature-flagging system.

Feature Flagging is a way to enable or disable features for certain groups, or percentages of traffic on your website. You can go wild with user segmentation and A/B testing and such, but for the goal of this article I’d like to highlight a mode that you can relatively easily implement that resembles criticality mode.

  • Each feature should be listed
  • At which criticality level should this feature be showed? (Green = only when fully operational, Orange = mission critical feature, red = can be crawled and has no interactive functionality (or stripped when crawled))
  • What is the target state of the feature? (Green = feature is operating nominally, Orange = trigger circuitbreaker, Red = remove feature from external traffic).

This setup allows you:

  • to hide features from external traffic, but test it on production for a specific target audience anyhow
  • to reduce stress on a particular backend function, showing reduced functionality to a user while not degrading the rest of the digital solution
  • to manage which features are showed in which criticality state

Scale fast effect (Conway’s law)

(This is a crosspost from an article I wrote @ https://medium.com/jumbo-tech-campus)

At Jumbo, we’ve started to scale fast within a short amount of time. E-Commerce is not something we do on the side anymore. It’s part of our core. It’s who we are.

We’ve started scaling the digital landscape hard. And like children, when you grow fast, you might experience some growing pains every now and then. In this article I’m writing a chronological sequence of things that happen when you scale your development effort fast.

When you know you are in a fast scaling organisation and feel lost every now and then, this article might bring you perspective of where you are situated and what steps will follow to absolution ;-).

Where it starts

Given that this article is in regards to scaling fast, the origin of your digital adventure is somewhere along

  • It’s a complete new endeavour
  • The potential is clear but there was simply not enough money
  • The potential was unclear, it wasn’t a priority

Whatever the cause might be, the scale is small. This inherently means that you’ll have a small amount of people, steering a small amount of developers (internal or externally), working on a small amount of products.

Digital is part of who we are

At some point in time you manage to prove that this is what we should do. This is what will push your company into the next era. But in order to get more features and attract a bigger audience (or get the ability to attend your current audience), your company will need to make a bet. They will need to invest real money over a period of time before they get a return on that investment.

The first thing that will happen is that the company starts hiring people that can help them forward. The issue is that — in the current market climate —, the ratio developer to job is in the favour of the developer. Meaning that in order to attract the best developers, you’ll have to compete with the best tech campuses out there.

This is a struggle, because you’ll have to adjust your expectations. Unfortunately allocating a lot of money doesn’t inherently give you all these intrinsically motivated people you hope for.

One way to scale quick here is hiring externally with companies as partners to start setting up your new organisational structure.

You’ve managed to internalise some development teams. With these teams, you’ll become able to create a culture that attracts the people you search for. The scale feels big(ger). You’ll still have one stream of business demand, but you’ll have multiple teams working on features. These features still flow into one application, but life is good, for a while.

Features features features

You now have the workforce to work on many things at the same time. This means that your business becomes able to put themselves close to the fire in order for them to make sure you are building the things that pay the bills. Their PO’s will take place in your teams and you’ll set up a process that helps in prioritising the demand. The moment you create these teams is also the moment Conway’s law starts to bite you. It states:

Organisations which design systems … are constrained to produce designs which are copies of the communication structures of these organisations

Lots of new features will be implemented. You’ll learn as a business that it’s sometimes best to apply validated learning. Set smaller goals, define how to measure them, validate the success, continue your path or improve and adjust course.

Your tech department will however learn that not all progress is measurable in the definitions of pageviews, turnover, performance and these metrics. Some effort is made, because of an ideology.

And Jumbo is big on that. We believe. We believe in Service with a smile. We believe in being every day low price. We believe in a pleasant shopping experience. We believe in a winning mentality, a positive attitude and an ability to overcome whatever it is you need to overcome.

Unfortunately, pushing all these features usually leads to a drain in Performance, lots of bugs, dissatisfaction on working on the product and lots of troubles keeping the boat floating. This is a hard, but a good spot to be in. You’ve proved that there is a huge demand for the course you’ve charted, you’re just not there yet to be able to cope with the demand.

Ownership

Because you are working on óne application, it becomes increasingly hard to take ownership of your product. You’ll see product owners focussing on the new features they want to have in, but not focussing on the quality of the product. They’ve essentially became Problem Owners.

It’s a logical thing though. Since all functions are entangled in this one monolith, it’s impossible for them to take the ownership even if they’d want to. What we need is to crumble the monolith into pieces that can be owned. In order for them to be adopted by business to become something that they can take ownership off.

What you need here is a push from Development as well as from Business in unlocking business Capabilities. In order to determine the business capability you have to ask yourself: If I had a business and I would spend money on this, what would it enable me to do? Concrete examples would be:

  • the ability to process payments
  • the ability to send push messages to my customers
  • the ability to know where the order now physically is

Each ability is atomically defined. This inherently means that when I develop the functionality: ‘send a message 15 minutes before the order reaches the customer’, unlocks these building blocks (Business Capabilities) my other processes also benefit from. The more capabilities you unlock, the easier it becomes to combine them and service future business demands.

Development should dissect new incoming projects into business capabilities. This takes maturity in the sense that you will have to understand the challenges from the perspective of your customer, rather than your own technical perspective. Each piece of the puzzle has to be allocated around a business domain and serviced accordingly.

Business should start looking at their feature requests a bit different as well. Their opportunity versus cost analysis should deepen a bit, taking the unlocking of features into account.

Let’s say, we have three epics

  • 8 effort, 8 cost : Notify people whenever their basket offers expire
  • 5 effort, 5 cost : Send messages when we are 15 min away with order
  • 3 effort, 3 cost : Show average delivery time for the current order

None of these are ‘low hanging fruit’ you’d say. But what if I’d say:

  • if the 5/5 effort has been made whilst unlocking its true business capabilities (know where the order is, format a personalised message and send push messages to customers),
  • the 8/8 becomes an 8/3 (because we already can send push messages and personalise them)
  • and the 3/3 becomes a 3/1 (because we already collect metrics on delivery times)?

What we evidently miss is a factor that multiplies the opportunity value due to its unlocked capabilities.

You’ll get valuable things cheaper, ánd the entire room starts knowing about the capabilities, which enables you to allocate the capabilities at their respected owners.

You now have a service oriented architecture where people feel (and are) responsible for. New features won’t be accepted if they impact one of these services in a negative way. Bugs will be hunt, performance will be on top of mind. A project manager will have to talk with Product Owners to be able to integrate new functionality into their systems. A natural guard has been created.

Dev and Ops

If you create it, and you are responsible, you should run it. If you can’t run it yourself, you can’t be held fully responsible.

You should prevent a blame culture at all time. If someone can be blamed (rightfully or not), it will set a negative context. It poisons the atmosphere. It’s a constant excuse to underperform. And it might not be evident, but some take real comfort in this situation, it gives them power and personal validation to be able to be the hero when trouble arises. Therefore they might not be inclined to actually solve the problem at hand.

If you want to be the best, you should be in control. So if you need to roll an update now, nobody should be between you and the deployment. If you temporarily need more resources, you should be able to pull them. Of course you need to operate within boundaries, but for the simple stuff you should be empowered to ‘do it yourself’. And when something breaks, you should be held responsible. Only then you have the speed (agility) to improve continuously. Someone else cannot be held accountable for the software you wrote. And even more relevant, you won’t release bad code because it’s your head that rolls when it goes wrong.

DevOps doesn’t mean you do everything yourself though. But Ops becomes facilitating rather than steering. Steering Ops makes a lot of sense as long as your landscape primarily revolves around external applications, but when you build your own application it becomes an major blocker unless you can make the transition to facilitating the teams.

The good thing is that you, like no other, know how to prevent these problems. Becoming DevOps means, taking care of Quality Assurance within the designs of your code and deployment as well.

Conclusion

Whenever you find yourself lost in a transition due to scaling up rapidly, know that you just haven’t reached the operational optimum yet, and the turmoil you experience is needed to get to the next stage of maturity. Conway’s law is not a law because it has to be followed. It became a law because it’s a cascade of logical steps that will happen when you are in a certain situation.

If you can identify yourself with one of the steps in the document, find peace in that eventually your situation (with the willingness of all involved) will resolve to a well oiled machine. Transitions just take some time.