Once you reach a certain size as a company, you will typically have a fairly varied software portfolio. Some systems would be legacy systems, others more recent ones. Some systems form the bedrock of your business, other systems mostly support back office work.
As we make changes in these systems, we increase their Software Entropy. As stated in the seminal paper, even against the best efforts of the most architecturally conscious organisations, most systems tend to end up as a Big Ball of Mud. A lot of software strategy is essentially about managing this entropy.
I daresay, software strategy comes a little bit later in the organisation’s life cycle, as initially it needs to succeed as a viable business. This makes total sense.
So let’s assume we are a successful business and own a set of systems which power this business. We are ready to form a strategy to deal with the entropy of these systems. Most typically we start with a system that is of most importance to the business. We choose an approach to deal with the entropy of this system and it seems to work.
Often, having seen some success with the approach, we declare victory and rollout the same approach across all our systems. These attempts often end up in failure and at huge costs. I personally am guilty of going down this route.
Turns out, this is so common that there is a law about this. It’s called Ashby’s Law also known as The Law of Requisite Variety.
Or put in a more approachable way
“In order to deal properly with the diversity of problems the world throws at you, you need to have a repertoire of responses which are (at least) as nuanced as the problems you face.”
Essentially, we need a variety of strategies to manage different systems in our software portfolio. Depending on the various attributes of each system, we can then apply a strategy to manage their respective entropy.
Let’s look at some common strategies that one can apply to each system in your software portfolio.
Do nothing
This is a valid strategy when the domain that you are trying to explore is new to you. This applies to new systems being built in new domains that your business is navigating. It’s suited for exploration.
This is also a commonly adopted strategy by people when the software has not been tended to for a long time.Then it becomes more of a by-product of inertia.
Refactoring
Once you have an understanding of the domain, before the system gets out of your hand, you need to make sure that the software now represents the domain and the problem you are solving. This activity needs to be done continuously and with discipline.
In practice, I have seen refactoring often fail. This happens because of a variety of reasons. If you let your system grow too much, if the person or the people doing the refactoring are not using the right abstractions, if the abstractions are not communicated well in the team, if the tools you have to protect your refactoring from getting smudged all of this plays a part into this.
Rewrites
Rewrites, while being extremely desired by engineers, are an extremely risky choice. As Joel Spolsky wrote in his essay, engineers love this because it’s harder to read code than it is to write it.
A not exhaustive list when they make sense are:
When the abstractions that you originally started designing the system with, no longer hold. While not strictly a rewrite, the evolution of Stripe’s API is a great example of this. The Stripe API initially started out with the idea that payments are finalised instantly as their primary use case was credit cards. Finalisation means that the user has sufficient confidence is guaranteed. However with most payment methods, including crypto, finalisation takes a while. Stripe’s API was then redesigned with this abstraction in mind.
The other case in which a rewrite makes sense is if you are going for a paradigm shift. Uber rewrote their app to fundamentally utilise functional/reactive patterns and redesign their UI to allow multiple product teams to work together. Personally, I worked on a project at work, where to keep a handle on latency of our user funnel, we rewrote the backend so as to not make any database queries. All the requisite data was stored locally on each box and any updates propagated through kafka streams. The astute reader will notice this is just CQRS in practice.
However rewrites are extremely risky. At best they slow you down massively during the rewrite, at worst they can be threatening to your very existence. The current system has acquired its calluses the hard way and those calluses serve a purpose. Translating that into a new system always takes longer than you thought. Maintaining backwards compatibility is hard. You might end up replacing your system with an over-engineered monstrosity - which Fred Brooks in the Mythical Man-Month called the Second System Syndrome.
You have been warned.
Reclaim
For the longest time, I thought that you only had the 3 choices that I mentioned above. However, I recently came upon this article by Will Larson. Will suggests translating your beliefs about your system into the desired properties and behaviours. Then one can try to implement validations and assertions of these desired properties and behaviours and eventually you would be able to reason about your software.
Another approach of reclaiming systems which have regressed into incomprehensible monstrosities is to break them apart into logical services. Then we implement SLOs and SLIs, clean up the APIs and iteratively try to reduce the entropy in each individual system. Sounds a lot like Will’s idea, doesn’t it?
This strategy makes most sense for systems which, while chock full of important business logic, have fallen into a state of disrepair due to lack of attention. This can also be a very valid strategy before embarking on a full rewrite. Once you can reason about the system slightly better, you can choose how best to rewrite it and which parts.
Replace
I think the best software to have in your portfolio is one whose maintenance you are not responsible for. You could probably use X as a service especially when X is not one of your core capabilities i.e payments, communications platform etc. If it doesn't bring you competitive advantage, you probably should not be building it.
If you have already built it, chances are this system is more likely to decay, for the simple reason that it’s not your priority. Replace it with a commoditised service. Increasingly, there are also offerings like Retool which let you quickly build internal tools. This is a topic I want to explore more in the future.
These are the strategies I have commonly seen applied. While the top 3 are quite common, I think Reclaim and Replace are an incredibly useful lens with which to look at your system.
Humans have a tendency to crave one size fits all strategies. Perhaps we deal with complexity by pretending that the complexity does not exist. We end up fighting against Ashby’s Law and losing.
Let’s remember, the world is complex.
And Variety beats Variety.