Episode 1341

Great Expectations: Data Pipeline Testing with Abe Gong

Software Engineering Daily · softwareengineeringdaily.com

February 17, 20201h 4m

Audio is streamed directly from the publisher (traffic.megaphone.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

A data pipeline is a series of steps that takes large data sets and creates usable results from them. At the beginning of a data pipeline, a data set might be pulled from a database, a distributed file system, or a Kafka topic. Throughout a data pipeline, different data sets are joined, filtered, and statistically analyzed.

At the end of a data pipeline, data might be put into a data warehouse or Apache Spark for ad-hoc analysis and data science. At this point, the end-user of the data set expects that data to be clean and accurate. But how do we have any guarantees about the correctness?

Abe Gong is the creator of Great Expectations, a system for data pipeline testing. In Great Expectations, the developer creates tests called “expectations”, which verify certain characteristics of the data set at different phases in a data pipeline. This helps ensure that the end result of a multi-stage data pipeline is correct.

Abe joins the show to discuss the architecture of a data pipeline and the use cases of Great Expectations.

← All episodes of Software Engineering Daily