PLAY PODCASTS
Incident Reproduction with Tammy Butow
Episode 1256

Incident Reproduction with Tammy Butow

Software Engineering Daily · softwareengineeringdaily.com

October 16, 20191h 4m

Audio is streamed directly from the publisher (traffic.megaphone.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Databases go offline. Services fail to scale up. Deployment errors can cause an application backend to get DDoS’d.

When an event happens that prevents your company from operating as expected, it is known as an incident. Software teams respond to an incident by issuing a fix. Sometimes that fix returns the software to its ideal state. Other times the software remains in a degraded state, and it takes more fixing to return the software to the place it should be.

One way that a software team can learn from an incident is through incident reproduction. When an incident is turned into a reproducible system, it becomes a predictable training exercise rather than a surprising and painful outage.

Tammy Butow is an engineer with Gremlin, a company that makes chaos engineering software. Chaos engineering is the process of creating controlled experiments that simulate outages. Tammy joins the show to discuss common incident types, and how those can be made reproducible for training exercises.