PLAY PODCASTS
84: Trust But Canary: Configuration Safety at Scale
Episode 84

84: Trust But Canary: Configuration Safety at Scale

Meta Tech Podcast · Meta Tech Podcast

April 2, 202637m 8s

Audio is streamed directly from the publisher (traffic.libsyn.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Have you ever wondered how Meta makes config rollouts safe at scale? In this episode, Pascal sits down with Ishwari and Joe to discuss Meta's approach for propagating changes across services in seconds and discuss why speed increases the need for strong safeguards. Catch the episode to discover canarying and progressive rollouts, the health checks and monitoring signals used to catch regressions early, and how incident reviews focus on improving systems rather than blaming people. We also hear how data and early AI/ML are slashing alert noise and speeding up bisecting when something goes wrong.

Got feedback? Send it to us on Threads (https://threads.net/@metatechpod), Instagram (https://instagram.com/metatechpod) and don't forget to follow our host Pascal (https://mastodon.social/@passy, https://threads.net/@passy_). Fancy working with us? Check out https://www.metacareers.com/.

Links

Timestamps

  • Intro 0:06

  • Introduction and Overview of Configuration Changes 2:31

  • Understanding Configurations in Distributed Systems 4:02

  • Meta's Configuration Management Systems 6:43

  • Safeguards and Incident Prevention 9:22

  • Deployment Mechanisms: Canary and Progressive Rollouts 12:06

  • Challenges in Configuration Consumption 14:39

  • Health Checks and Incident Response 17:13

  • Mitigation Strategies for Configuration Issues 19:18

  • Balancing Developer Velocity and Configuration Safety 21:09

  • Data-Driven Improvements in Incident Management 22:12

  • Leveraging AI for Change Detection 26:05

  • Challenges in Deployment and Testing 28:21

  • Reinventing Change Safety Strategies 30:24

  • War Stories: Learning from Past Incidents 32:59

  • Outro 36:10