Episode 1059

Pachyderm: Data Pipelines with Joe Doliner

Software Engineering Daily · softwareengineeringdaily.com

February 11, 20191h 5m

Audio is streamed directly from the publisher (traffic.megaphone.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

Data infrastructure is advancing beyond the days of Hadoop MapReduce, single-node databases, and nightly reporting.

Companies are adopting modern data warehouses, streaming data systems, and cloud-specific data tools like BigQuery. Every company with a large amount of data wants to aggregate that data into a data lake and make the data available to developers. All of this data can be used to power machine learning models which can potentially improve every area within a company where they have historical data.

“Data pipeline” is a term used to describe the process of preparing data, building machine learning models, deploying those models, and tracking the results of those models.

Pachyderm is a company and open source project that is focused on deployment, management, and scalability of data pipelines. Pachyderm allows developers to version data, track the state of data sets, backtest machine learning models, and collaborate on data. It also tackles the very hard problem of machine learning auditability.

Joe Doliner is the CEO of Pachyderm and joins the show to discuss his experience building Pachyderm over the last five years. Data infrastructure has changed a lot in five years, and the world has moved in a direction that has benefitted Pachyderm, with more infrastructure moving to containers and more data teams advancing beyond a world of just Hadoop MapReduce.

In today’s show, Joe talks about modern infrastructure, data provenance, and the long-term vision of Pachyderm.

← All episodes of Software Engineering Daily