Chaos engineering in serverless environments is more useful than you’d think

by Jeremy

Chaos engineering has been gaining a lot of traction over the last few years as it moved from its origins at Netflix to more and more companies across the industry. Many development teams use it to prevent downtime by trying to break their systems on purpose to improve those systems before they cause problems down the line.  Given the resilient nature of serverless computing, based on agreements of uptime and availability by the cloud providers, it might seem that chaos engineering is one method of testing that wouldn’t be practical in serverless. But Emrah Samdan, vice president of product for Thundra, believes that serverless computing and chaos engineering go well together. 

rain 5213306 640

Because the cloud vendor guarantees availability and scalability, when doing chaos engineering in serverless environments, the goal is not necessarily to bring down the system but to find application-level failures caused by lack of memory or time. “The purpose of chaos experiments is not to take the whole software down but to learn from failures by injecting small, controllable failures,” Samdan said. 

RELATED CONTENT: To build resilient systems, embrace the chaos

Some of the most common examples of chaos engineering in serverless that Samdan sees are injecting latency into serverless functions to check that timeouts work correctly and injecting failures into third-party connections.

Samdan noted that the step of chaos engineering of defining the status state is an essential first step that is often overlooked. “People just want to break things, but the first step is actually to understand how they work, what are the ups and downs of the system, what are the limits, how resilient is your system already,” he said.

He believes that determining this baseline is even more critical in serverless environments. This is because what is considered normal for serverless can be very different from usual in other systems. For example, both latency and the number of executions in serverless are significant, which isn’t as accurate in other systems.

Related Posts

Leave a Comment