Episode 6
Oliver Leaver-Smith - On how "just a monitoring change" took down the entire site and resilience engineering - #5
Software Misadventures · Ronak Nathani, Austin Ouyang, Guang Yang
February 19, 20211h 1m
Audio is streamed directly from the publisher (traffic.libsyn.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.