The Road to Chaos … as a Service

Author: Siddon Tang (Chief Engineer at PingCAP)

In the world of distributed computing, workload clusters can fail unpredictably at any time and any location. This has long been a challenge for enterprises with a massive web presence that must support a large volume of real-time customer transactions, such as streaming content services, e-commerce companies and financial services firms. But as the pandemic accelerated digital transformation and drove customers online, many smaller organizations now also rely on distributed computing deployments.

To better identify vulnerabilities and improve resilience in these environments, Netflix pioneered the concept of chaos engineering in the early 2010s. Chaos engineering is a process for testing a highly distributed computing platform’s ability to withstand random disruptions and improve its reliability and resilience.

While traditional fault testing already existed at the time and served a similar purpose, it is limited to targeting specific points in the system that are anticipated to be vulnerable. Fault testing doesn’t allow you to assess and uncover hidden weaknesses, which often cause the most devastating failures. On the other hand, chaos engineering enables you to discover those “unknown unknowns.” By accounting for randomness and the unexpected, chaos engineering deepens your knowledge of the system being tested and unearths new information.

Think about the significant Facebook outages back in early October (which also affected Instagram, WhatsApp, and Oculus). The platform went offline for six hours, the company’s worst outage since 2019. Ultimately, Facebook determined the issue was caused by “configuration changes on the backbone routers that coordinate network traffic between our data centers.” Conditions like this are difficult to anticipate, but a good chaos engineering platform predicts these rare events and helps businesses prepare for them.

Chaos engineering has been adopted by other major web companies since it was introduced, but it hasn’t caught on with organizations running sub-hyperscale deployments, who lack the resources to do so. But that will change within the next couple of years as a new phase of chaos engineering begins to emerge: chaos-as-a-service.

Like many other XaaS movements, chaos-as-a-service will be a democratizing force that gives many more organizations access to a valuable but complex technology. And in this case, that democratization is badly needed. Chaos-as-a-service will ultimately enable enterprises that aren’t running at the scale of Netflix or Facebook to leverage chaos engineering.

Chaos-as-a-service will democratize service

Chaos-as-a-service will provide a quick, simple method for organizations of all sizes to run chaos experiments and test their systems’ resiliency. Matt Fornaciari, co-founder of Gremlin, outlined the essentials, explaining that chaos-as-a-service should provide users “intuitive UI, customer support, out-of-the-box integrations, and everything else you need to get experimenting in a matter of minutes.”

Those are the basics. Digging a little deeper, to achieve sufficient ease of use, chaos-as-a-service must deliver the following four things:

A unified console for management, where engineers can edit the configuration and create chaos experiments.
Visualized metrics that allow engineers to see an experiment’s status.
Operations to pause or archive experiments.
Simple interaction, allowing engineers to easily drag and drop the objects to orchestrate their experiments.

In short, chaos engineering platforms need simpler management, expanded visibility, and smoother operational capabilities. Current chaos engineering offerings fall short in these categories and lack the features detailed above. Thus, there are no true chaos-as-a-service platforms out there today; however, work is being done to make it a reality.

Chaos Mesh: moving toward chaos-as-a-service

Chaos Mesh is an open source, cloud native chaos engineering platform that orchestrates chaos experiments on Kubernetes environments. The platform features all-around fault injection methods for complex, distributed systems, covering faults in pod, network, file system, operating systems, JVM applications, and even the kernel. It was designed to be easy to use, highly scalable and to cover a wide variety of failure types. Thanks to increasing support from a diverse community of users, Chaos Mesh joined the Cloud Native Computing Foundation (CNCF) as a sandbox project.

With its open source background and focus on simplicity, Chaos Mesh was designed to be broadly accessible. The technology has been adopted by both smaller and larger enterprises (including Azure, which employs Chaos Mesh to allow users to inject faults into Azure Kubernetes Service clusters). Chaos Mesh’s ease of use has positioned it well to support some chaos-as-a-service functionality in the next two to three years. During that time, the community will work on making some key improvements:

Better usability: Some Chaos Mesh features are complicated to use. For example, when you apply a chaos experiment, you often have to manually check whether the experiment has started.
Beyond Kubernetes: Chaos Mesh is employed mostly for Kubernetes environments. Though chaos supports running chaos experiments on physical machines, the features are quite limited, and command-line usage is not user-friendly.
Increased customizations: The platform doesn’t currently allow plugins. To apply a customized chaos experiment, you have to alter the source code. In addition, today Chaos Mesh only supports Golang.

Like other chaos engineering platforms, Chaos Mesh needs some work before it can offer chaos-as-a-service. But the technology is uniquely suited for chaos-as-a-service and already delivers the fundamentals.

What is next for chaos-as-a-service?

Led by open source efforts like Chaos Mesh, chaos-as-a-service will bring chaos engineering to the masses. It will take another two years or so to get there, but the wait will be worth it.

By leveraging chaos-as-a-service, organizations of all sizes will be equipped to improve the resiliency and reliability of their distributing computing deployments. That will prove essential in an era of widespread digital transformation.

This article was first published on The New Stack.

Book a Demo

Chaos Engineering