Deliberately Seeking Chaos
Improving System Resilience by Enabling Chaos
Introduction
Accidental discoveries take many forms, and not all of them
lead to scientific breakthroughs on the scale of penicillin or Post-It notes.
Yesterday I “accidentally discovered” that my furnace was leaking, and my
“breakthrough” today was that my home warranty policy was useless. By nature,
an accidental discovery is not the result of careful planning or a series of
steps leading towards a profound insight, rather they arise unexpectedly and
often as the result of several factors. Considerations which may increase the
odds of a profound accidental discovery include expert knowledge in a domain, the
serendipitous exposure to another domain of practice, collaborative insights
helping to connect the dots, and the addition of random chance leading to an
observable and unexpected outcome. In this post I will focus on one accidental
discovery that is particularly helpful to my professional field – Chaos
Engineering – including a summary of the forces that facilitated this discovery
or could have impeded its development.
Chaos in the Network
Jim Gray noted in a
paper nearly forty years ago that even well-managed systems fail often enough
to affect users, and that the solution was to design systems which allows
individual elements to fail without impacting the overall service (Gray, 1986). This high level of availability remains
challenging to achieve even with decades of advancement, in part because the
failure modes of complex interconnected systems are difficult to predict or
plan for. In the network outage previously described several independent
failures contributed to the resulting customer impact, including a software
bug, an unrelated firmware incompatibility, and an additional unrelated
configuration problem dating back to the original installation of the system.
All of the engineering best practices and maintenance processes that are in
place within my team failed to protect our customers from this cascading
sequence of issues. As Gray notes, “the key to high-availability is tolerating
operations and software faults” (1986).
Learning
from Chaos
Beginning in 2008
Netflix began working to replace their on-premise server equipment with the
goal of improving reliability, after experiencing critical outages that
impacted their customer experience (Thelin, 2021). After working through the root cause
analysis of the system failures the team realized they had attained a deep
understanding of the system complexities. This proved beneficial after the
system was moved to a cloud environment, which solved the immediate server
failures but added multiple new layers of system complexity. From observing and
managing the challenges of their on-premise system, the Netflix team made the
accidental discovery of the benefits of intermittently and randomly
deactivating system elements as a way of pressure-testing the reliability of
the entire system rather than isolated individual components. This led to the
creation of a tool called Chaos Monkey, released as open source software in
2012, “which randomly selects virtual machine instances that host … production
services and terminates them” (Basiri et al., 2016). Chaos Monkey serves as a valuable means to
assess the robustness and resilience of complex systems, especially in handling
unforeseen errors typical in large-scale products. Chaos Monkey has since
evolved into a more robust system that Netflix calls Chaos Kong which enables
testing of much larger issues, and which has helped Netflix to avoid
region-wide outages of their service despite cascading failures in their Amazon
Web Service hosted system (Netflix Technology, 2015).
Deliberately
Injecting Chaos
Chaos Engineering
is a stress testing methodology that involves intentionally disrupting systems
to identify vulnerabilities and enhance their resilience. It is “the discipline
of experimenting on a distributed system in order to build confidence in its
capability to withstand turbulent conditions in production” (Basiri et al., 2016). The concept of injecting chaos into
complex systems, in the form of randomly triggering failures and faults in
individual elements and processes, has led to the development of this
methodology, which continues to influence how systems are built and tested.
Several principles have since been developed around how the concept of Chaos
Engineering should be applied to systems, including starting with a deep
understanding normal system behavior, applying random events that match the
issues that occur in the real world, and running the tests continuously and on
production systems (Principles of Chaos Engineering,
2019). The philosophy of injecting chaos into a system continues to
evolve, more recently being applied in tandem with the technology of digital
twins to retain the benefits of the practice while expanding the test surface
to include business processes and practices and limiting the blast radius of
leaked failures to avoid impacting customers (Poltronieri et al., 2021).
Factors
at Play
Several key factors
supported the accidental discovery of Chaos Engineering. First were
technological forces resulting from the critical server failures that the
Netflix team experienced as they transitioned to a cloud-based architecture.
This was the immediate driver for their reflection on the tension between
system reliability and increasing system complexity. Second was the
organizational force reflected in Netflix’s culture innovation and
experimentation, which created the environment that allowed the engineers to
take the risk of purposefully injecting errors into their systems. At the same
time, several factors could have impeded its development. Financial
considerations in many organizations limit the amount of time and investment
that is applied to the development of unproven solutions, especially ideas that
create the risk of service disruption. Structural limitations associated with
the infrastructure required to deliver fault tolerant services, including
redundant hardware, software, systems, and locations, could have also acted as
a limiting factor.
Accidental discoveries have a rich history in science,
technology, and pop culture, often driven by the combination of unexpected
events and unforeseen opportunities. The concept of Chaos Engineering, born out
of Netflix's unlucky encounter with server failure, is one example of such an
accidental discovery that applies to the professional world that I work in.
Chaos Engineering has fundamentally changed how organizations stress test
large-scale, distributed systems, by assuming that failures will occur and
testing them in live systems to understand and build resilience regardless of
increasing complexity. Netflix's journey highlights how a single critical error
in 2008 not only emphasized the need for more resilient systems but also led to
the realization that continuous testing through controlled failures was the key
to gaining a deep understanding of their system.
References
Basiri, A., Behnam, N., De Rooij, R.,
Hochstein, L., Kosewski, L., Reynolds, J., & Rosenthal, C. (2016). Chaos
engineering. IEEE Software, 33(3), 35–41.
Gray, J.
(1986). Why do computers stop and what can be done about it? Symposium on
Reliability in Distributed Software and Database Systems, 3–12.
Netflix
Technology. (2015, September 25). Chaos Engineering Upgraded. Medium.
https://netflixtechblog.com/chaos-engineering-upgraded-878d341f15fa
Poltronieri,
F., Tortonesi, M., & Stefanelli, C. (2021). Chaostwin: A chaos engineering
and digital twin approach for the design of resilient IT services. 2021 17th
International Conference on Network and Service Management (CNSM), 234–238.
Principles
of chaos engineering.
(2019, March). https://principlesofchaos.org/
Thelin, R.
(2021, April 23). Bytesize: 2 accidental discoveries that changed
programming. Educative.
https://www.educative.io/blog/bytesize-accidental-discoveries-in-programming
Comments
Post a Comment