On complexity, software and resiliency
Hello, and welcome my newsletter. I’m Karim, and every 2 weeks I tackle questions or problems I’ve witnessed in startups from the very early stages up to late growth stages.. Much of my startup experience has been in leading engineering organizations, but I cover topics outside of engineering as well.
Send me your questions or suggested topics and in return, I’ll try and answer them through a post to this newsletter.
If you find this post valuable, check out some of my other popular posts:
To receive this newsletter in your inbox every ~2 weeks, consider subscribing 👇
Complex systems: a very brief introduction
I recently finished reading Complexity: The Emerging Science at the Edge of Order and Chaos, a book I highly recommend. The book was my first foray into the world of complex systems and provided a good overview on the topic and the Santa Fe Institute - an independent, nonprofit theoretical research institute located in Santa Fe and dedicated to the study of complex systems. As I was reading the book, I couldn’t help but think about the properties of complex systems and their prevalence in software development, at least for large software projects.
Complex systems, not to be confused with complicated ones, are systems that possess these following properties. They are non-linear, meaning that a small change to the system’s input can result in a disproportional change to the system’s output. Complex systems are emergent, meaning that the properties the whole system exhibits differ from those shown by the individual parts comprising the system. Thus, a complex system displays collective behavior that emerges from the interactions between its parts. The other main properties are adaptation and feedback loops implying that these systems change and adapt to their environment.
Complex systems are very common, in fact we live in one. Our climate is an example of a complex system. So is the entire human body and even a single human cell.. Cities are complex systems and so is the universe. It’s worth noting that there is a difference between complicated and complex systems. A car engine is a complicated system, but it is not complex. A car’s engine is composed of many parts that behave according to some specification. These parts operate in concert to ultimately provide the functionality of an engine. There are no emergent properties to be found in a car’s engine. Nor does it adapt to its environment, and, thankfully it doesn’t exhibit non-linear properties.
Software development: A complex system?
I believe, mostly based on my experience, that the process of building fairly large software is an example of a complex system. In my opinion software development exhibits many of the properties of complex systems: emergence, non-linearity and adaption.
If you’ve ever worked on a software team, you might have witnessed how the team interactions are unique. A software team’s characteristics aren’t the average of the individuals comprising this team, rather they are a unique amalgam of these individuals characteristics. Software teams, most notably OSS projects, are typically autonomous and self-managed. More importantly, the process of building software is at its heart based on social-interactions. Individuals working together on a software project will have to find the most optimal way for them to interact. Examples of these interactions include knowledge sharing, the team structure and hierarchy, if any with it, coding guidelines and more. In short, the team’s behavior, principles and characteristics emerge from the interactions between the individuals.
Every software project I have worked on is subject to changing requirements and assumptions. This requires software teams to be able to adapt to their changing environment. These changes can also be immensely disruptive. What can at first appear to be a seemingly simple change can in fact result in a significant amount of work, or even worse introduce catastrophic failures. This is yet again another characteristic of complex systems: non-linearity.
Ok, so what?
Maybe software development is indeed a complex system, maybe it’s not. I’m not here to make this argument or provide irrefutable evidence to sway you that it is. However, like any system, complex or not, you probably want your software development “system” to be resilient.
“resilience determines the persistence of relationships within a system and is a measure of the ability of these systems to absorb changes of state variables, driving variables, and parameters, and still persist.” C. S. Holling,
Resilience is basically the ability of a system to absorb a shock to it and quickly bounce back to a functional, equilibrium state. Oftentimes, the shocks that can bring a seemingly indestructible system down are initially very minor, even inconsequential. Consider the collapse of the Soviet Union, or the collapse of many of the Arab regimes in the spring of 2010.
Chaos engineering: Making software teams resilient
I can think of two main shocks to the software development process: changing requirements and personnel departures. I’ll be ignoring the former in this article and focusing on the latter, which, arguably, can have a larger negative impact and is one which I think is overlooked.
You can never truly plan for employee departures, they are typically abrupt, especially if the employee leaves on her own volition. Employees can leave for many reasons: moving, taking time off, a new job and many more. However, an employee departure represents a shock to your software development system. The shock, or impact, that an employee departure induces can take many forms. Perhaps the departing employee is an expert in one or more subsystems of the software and her departure will result in a significant knowledge gap. This expertise is not limited to knowledge about various modules of subsystems of the software. It can also include other skills. The departing employee could be an expert debugger, your go-to person for debugging complex and hard to reproduce bugs. She could also be an excellent system designer and your teams rely on her for system design and architecture. Obviously, an employee leaving her team will result in the team having to pick up her work which can result in delays to releasing the feature the team is working on.
Regardless of the skill that person possesses, each time an employee departs can result in a shock to your software development “system”. This shock can be limited to the team the individual is working on, or for more senior engineers the impact could be much wider.
One way to minimize the impact of shocks, like the ones that are induced by departing employees, is to artificially introduce them. You know that you will be losing people in any given year, so perhaps practicing the impact of these departures, will better position you and your teams to handle them when they actually do happen. No, am not suggesting random firings!
This, in principle, is what a recent article by Dan Lebrero is trying to address. Dan introduces the concept of a Lucky Lotto, shown below. Note, that the rules were later on modified by Dan. I encourage you to read his article in full.
Welcome to Akvo’s Lucky Lotto!
Starting last week of September, we are going to start running our own Akvo’s Lucky Lotto.
All of you will have a chance to win, and your team to enjoy the results of your disappearance.
Rules:
1- Every Monday a random person will win the Lucky Lotto.
2- The winner will work on some side project.
3- The winner will be completely unavailable to colleagues and to the rest of Akvo for the week.
4- Everybody, including product managers, gets one ticket every week, even if you don’t want it.
5- Every time that rule 3 must be broken, the winner must make a note (I will share some doc to do this).
Copied with permission
A reasonable assumption to make if such a process was implemented, is that the disruption it would initially cause would be almost identical to the real event. Said otherwise, if Bob wins the Lucky Lotto of the week, then Bob’s impact on his team is almost the same as if Bob had actually left the company. However, in time and as more of these shocks are introduced to the system, the impact should diminish. Teams will adapt, learn and become resilient.
I’d imagine that the relationship between disruption to the team, resiliency and the number of shocks induced could look like the graph below. Initially the disruptions are very strong and the resiliency build-up slow. However, with more practice, the disruption starts to diminish and the resiliency starts to ramp up. Both reach a certain plateau, or asymptote during which the impact of a departure results in little disruption or resiliency.
Did any of this work?
I obviously have no data to back this up. In fact, I haven’t even tried to run a lotto system like the one Dan introduced. Dan does share a few results shown below.
Three months running the Lucky Lotto showed several instances of a bus factor of one, and gave the teams the opportunity to step up, learn and cover for the missing person’s skills.
As an example, our one and only Android developer won the Lotto the same week that the team was going to fix some major performance issue on the communications between the app and the server. It was a great learning experience for the team.
For the Lucky Lotto winner, it was a very enjoyable week, to either learn something new (Kubernetes, backend development, our deployment pipeline, Cypress, Clojure, …), work on those long desired dev improvements that we never had time for, or to do something different from the usual churn.
These days were a great mirror into where I actually spend my time and if that is the best way to handle the tasks.
One of our Product Managers
In addition to the knowledge sharing, we got some cross-pollination and broader-team building as some winners decided to work with the other product team during their Lotto week.
Copied with permission
I don’t know what the long term impacts of a process like the Lucky Lotto are. Perhaps it does indeed result in more resilient teams, which I truly hope it does.
I do know that as an industry, we spend far more time focusing on trying to make our software resilient and ignore the much needed resiliency of our processes and teams. We introduce random failures to our software. We bring databases down. We pull disks out of servers. Reboot servers randomly. We do all of this to observe how our software behaves in response to failures or shocks to the system.
Perhaps its time we focus on making our individuals, teams and processes more resilient too? Lastly, if you’ve tried anything remotely similar to this lotto approach, or other methods, I would love to hear from you.
References
I used a few references during my preparation for this article, which are all listed below.
👋🏽 Thank you for reading! Please share my newsletter with someone you think will enjoy it 👇🏽