The What and Why of Chaos Engineering
Netflix alumnus and serial entrepreneur Casey Rosenthal recently hosted a lunch-and-learn on chaos engineering at the Skillz San Francisco office. His work on chaos engineering at Netflix has established him as a top thought leader in the field. Rosenthal is currently the CEO/Cofounder of Verica. As an Executive Manager and Senior Architect, he manages teams to tackle big data, architect solutions to difficult problems, and train others to do the same. In this blog, we explore the principles and benefits of chaos engineering.
Modern systems in engineering are complex, with an ever-growing demand for both system reliability and abundant new features. A relatively new discipline, chaos engineering helps address the complexity and demands of modern systems. Chaos engineering can be thought of as the facilitation of experiments to uncover systemic weaknesses. It provides real-world insight into how services behave in production and addresses problems that result from the interaction of the numerous components in a distributed system.
Chaos Engineering Defined
Chaos engineering seeks to identify the “chaos” inherent in a complex system. Rather than reactively fixing bugs and addressing failures when they occur, the chaos approach practices proactive experiments based on hypotheses to uncover systemic vulnerabilities before they affect customers. It enhances a system’s ability to keep operating and providing a positive experience for end-users even when issues arise.
Scaling complex systems is hard, so there are many misconceptions and myths surrounding it. Rosenthal listed a few of them, including the common assumption that you can make your system more robust by removing people who cause accidents. However, they are rarely, if ever, caused by a single person. While some team members may be involved in more accidents than others, this is more likely due to resource constraints, communication hindrance, or inadequate training than an individual’s shortcomings. Moreover, the system often responds worse to removing a team member vital to its health. Rather than blaming individuals, chaos engineering focuses on proactively testing the system to support growth and stability.
The Principles of Chaos Engineering
Chaos engineering involves all system stakeholders to intentionally “break” areas of the system through well-planned experiments that mitigate the impact of failures. According to Rosenthal’s principles of chaos, you should:
- Define steady state behavior to build a hypothesis. Look for measurable output that combines customer experience and operational metrics (as opposed to focusing on internal attributes of the system). This will represent the system’s steady state. Then, build a hypothesis around the aspects of the system you believe are resilient.
- Simulate real-world events. Introduce variables that mirror realistic events like servers dying, malformed responses, and spikes in traffic. Anything that might disrupt steady state is a viable variable to experiment with.
- Carry out experiments in production. This shows in real-time how your system responds to change and how failures/outages affect customers.
- Run experiments continuously. Automating chaos experiments saves time and allows teams to focus on building out new features. It also deepens system knowledge for engineers to detect and tackle issues faster with little consumer impact.
- Minimize customer impact. When running chaos experiments, it’s the engineer’s responsibility to mitigate blast radius and ensure customers aren’t greatly impacted.
As systems grow in complexity, it’s important to incorporate continuous verification (CV) into the continuous integration and delivery (CI/CD) process. “Organizations do not have the time or resources to validate that the internal machinations of the system work as intended, so instead they verify that the output of the system is in line with expectations,” Rosenthal says of CV.
Benefits of Chaos Engineering
Chaos engineering helps uncover the unknowns in a complex system and addresses them in a controlled environment. It’s a proactive strategy that allows proactive planning for potential problems, rather than scrambling to react to a failure in production at 10pm on a Friday when most users are engaged.
Additionally, controlled chaos experiments help balance three important tradeoffs in engineering: economics, workload, and system safety. Ask yourself how to allocate money and resources, how your team is doing, and how much your servers can handle. By testing the system proactively within the boundaries of the team’s economic means and resources, it’s easier to maintain a healthy employee workload while ensuring the safety of the system.
A successful chaos approach empowers cultural growth across multiple teams. All stakeholders rally behind a common focus around avoiding system failures and building confidence in the future — resulting in happier, more efficient, and more engaged teams.
As Skillz strives to make mobile competition accessible for every player worldwide, we’re focused on building confidence in our increasingly dynamic microservices system to deliver the best user experience possible. With over 30 million users on the platform, reliability is paramount. Understanding our strengths and weaknesses allows us to proactively prepare as we scale and successfully tackle failures before they damage the experience of our customers.