Deliberately Seeking Chaos

Improving System Resilience by Enabling Chaos

Introduction

Accidental discoveries take many forms, and not all of them lead to scientific breakthroughs on the scale of penicillin or Post-It notes. Yesterday I “accidentally discovered” that my furnace was leaking, and my “breakthrough” today was that my home warranty policy was useless. By nature, an accidental discovery is not the result of careful planning or a series of steps leading towards a profound insight, rather they arise unexpectedly and often as the result of several factors. Considerations which may increase the odds of a profound accidental discovery include expert knowledge in a domain, the serendipitous exposure to another domain of practice, collaborative insights helping to connect the dots, and the addition of random chance leading to an observable and unexpected outcome. In this post I will focus on one accidental discovery that is particularly helpful to my professional field – Chaos Engineering – including a summary of the forces that facilitated this discovery or could have impeded its development.

Chaos in the Network

Earlier this year the network that my team is responsible for suffered an outage that had a substantial impact on our customers, leaving thousands without the ability to make or receive phone calls for several hours. A network element outage is not unusual in our industry, which depends upon highly complex systems working flawlessly in concert year after year. To prevent impacting customers, we design our systems to be redundant in several ways, within the software, through duplicated hardware systems, and with geographically separated network core locations. We further attempt to limit customer impact by only performing upgrades and maintenance on these systems during late night and early morning hours, when usage of the network is at its lowest. These hardware, software, and process steps are focused on reducing the likelihood that failures in our system will be noticed by customers.

Jim Gray noted in a paper nearly forty years ago that even well-managed systems fail often enough to affect users, and that the solution was to design systems which allows individual elements to fail without impacting the overall service (Gray, 1986). This high level of availability remains challenging to achieve even with decades of advancement, in part because the failure modes of complex interconnected systems are difficult to predict or plan for. In the network outage previously described several independent failures contributed to the resulting customer impact, including a software bug, an unrelated firmware incompatibility, and an additional unrelated configuration problem dating back to the original installation of the system. All of the engineering best practices and maintenance processes that are in place within my team failed to protect our customers from this cascading sequence of issues. As Gray notes, “the key to high-availability is tolerating operations and software faults” (1986).

Learning from Chaos

Beginning in 2008 Netflix began working to replace their on-premise server equipment with the goal of improving reliability, after experiencing critical outages that impacted their customer experience (Thelin, 2021). After working through the root cause analysis of the system failures the team realized they had attained a deep understanding of the system complexities. This proved beneficial after the system was moved to a cloud environment, which solved the immediate server failures but added multiple new layers of system complexity. From observing and managing the challenges of their on-premise system, the Netflix team made the accidental discovery of the benefits of intermittently and randomly deactivating system elements as a way of pressure-testing the reliability of the entire system rather than isolated individual components. This led to the creation of a tool called Chaos Monkey, released as open source software in 2012, “which randomly selects virtual machine instances that host … production services and terminates them” (Basiri et al., 2016). Chaos Monkey serves as a valuable means to assess the robustness and resilience of complex systems, especially in handling unforeseen errors typical in large-scale products. Chaos Monkey has since evolved into a more robust system that Netflix calls Chaos Kong which enables testing of much larger issues, and which has helped Netflix to avoid region-wide outages of their service despite cascading failures in their Amazon Web Service hosted system (Netflix Technology, 2015).

Deliberately Injecting Chaos

Chaos Engineering is a stress testing methodology that involves intentionally disrupting systems to identify vulnerabilities and enhance their resilience. It is “the discipline of experimenting on a distributed system in order to build confidence in its capability to withstand turbulent conditions in production” (Basiri et al., 2016). The concept of injecting chaos into complex systems, in the form of randomly triggering failures and faults in individual elements and processes, has led to the development of this methodology, which continues to influence how systems are built and tested. Several principles have since been developed around how the concept of Chaos Engineering should be applied to systems, including starting with a deep understanding normal system behavior, applying random events that match the issues that occur in the real world, and running the tests continuously and on production systems (Principles of Chaos Engineering, 2019). The philosophy of injecting chaos into a system continues to evolve, more recently being applied in tandem with the technology of digital twins to retain the benefits of the practice while expanding the test surface to include business processes and practices and limiting the blast radius of leaked failures to avoid impacting customers (Poltronieri et al., 2021).

Factors at Play

Several key factors supported the accidental discovery of Chaos Engineering. First were technological forces resulting from the critical server failures that the Netflix team experienced as they transitioned to a cloud-based architecture. This was the immediate driver for their reflection on the tension between system reliability and increasing system complexity. Second was the organizational force reflected in Netflix’s culture innovation and experimentation, which created the environment that allowed the engineers to take the risk of purposefully injecting errors into their systems. At the same time, several factors could have impeded its development. Financial considerations in many organizations limit the amount of time and investment that is applied to the development of unproven solutions, especially ideas that create the risk of service disruption. Structural limitations associated with the infrastructure required to deliver fault tolerant services, including redundant hardware, software, systems, and locations, could have also acted as a limiting factor.

Conclusion

Accidental discoveries have a rich history in science, technology, and pop culture, often driven by the combination of unexpected events and unforeseen opportunities. The concept of Chaos Engineering, born out of Netflix's unlucky encounter with server failure, is one example of such an accidental discovery that applies to the professional world that I work in. Chaos Engineering has fundamentally changed how organizations stress test large-scale, distributed systems, by assuming that failures will occur and testing them in live systems to understand and build resilience regardless of increasing complexity. Netflix's journey highlights how a single critical error in 2008 not only emphasized the need for more resilient systems but also led to the realization that continuous testing through controlled failures was the key to gaining a deep understanding of their system.



 

References

Basiri, A., Behnam, N., De Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., & Rosenthal, C. (2016). Chaos engineering. IEEE Software, 33(3), 35–41.

Gray, J. (1986). Why do computers stop and what can be done about it? Symposium on Reliability in Distributed Software and Database Systems, 3–12.

Netflix Technology. (2015, September 25). Chaos Engineering Upgraded. Medium. https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa

Poltronieri, F., Tortonesi, M., & Stefanelli, C. (2021). Chaostwin: A chaos engineering and digital twin approach for the design of resilient IT services. 2021 17th International Conference on Network and Service Management (CNSM), 234–238.

Principles of chaos engineering. (2019, March). https://principlesofchaos.org/

Thelin, R. (2021, April 23). Bytesize: 2 accidental discoveries that changed programming. Educative. https://www.educative.io/blog/bytesize-accidental-discoveries-in-programming

Comments

Popular posts from this blog

Daydreams of Future Lives

Spoiled Fruit: BlackBerry’s Failure to Scenario Plan