PLAY PODCASTS
Infrastructure Monitoring with Mark Carter
Episode 874

Infrastructure Monitoring with Mark Carter

Software Engineering Daily · softwareengineeringdaily.com

August 14, 201851m 57s

Audio is streamed directly from the publisher (traffic.megaphone.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

At Google, the job of a site reliability engineer involves building tools to automate infrastructure operations. If a server crashes, there is automation in place to create a new server. If a service starts to receive a high load of traffic, there is automation in place to scale up the instances of that service.

In order to create an automated response to an infrastructure problem, a site reliability engineer needs insights into that infrastructure. Every service needs tools around monitoring, alerting, debugging, and distributed tracing.

One benefit of working at a large company like Google is that an engineer building a new product gets this kind of tooling by default. If I am hacking on a project at home, I have to set up all kinds of tools to help me diagnose and resolve problems. Setting up this tooling takes time, and requires expertise.

Stackdriver is a set of tools and instrumentation that allows developers to monitor, debug, and inspect infrastructure. Stackdriver is based on the internal observability tools built for Google. Mark Carter is a group product manager at Google, and he joins the show to discuss site reliability engineering and the creation of Stackdriver.