PLAY PODCASTS
171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

171: Machine Learning Pipelines Are Still Data Pipelines with Sandy Ryza of Dagster

This week on The Data Stack Show, Eric and Kostas chat with Sandy Ryza, Lead Engineer at Dagster. During the episode, Sandy shares insights on data cleaning, data engineering processes, and the need for improved tools. He introduces Dagster, an orchestrator that focuses on assets like tables, datasets, and machine learning models, and contrasts it with traditional workflow systems. He also explains Dagster’s integration with DBT, while also exploring the changing dynamics in data roles, the impact of modern tooling, the potential for increased creativity in the field, and more.

The Data Stack Show

January 3, 202455m 50s

Audio is streamed directly from the publisher (afp-928695-injected.calisto.simplecastaudio.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Highlights from this week’s conversation include:

  • The role of an orchestrator in the lifecycle of data (1:34)
  • Relevance of orchestration in data pipelines (00:02:45)
  • Changes around data ops and MLOps (3:37)
  • Data Cleaning (11:42)
  • Overview of Dagster (13:50)
  • Assets vs Tasks in Data Pipeline (19:15)
  • Building a Data Pipeline with Dexter (25:40)
  • Difference between Data Asset and Materialized Dataset (28:28)
  • Defining Lineage and Data Assets in Dagster (29:32)
  • The boundaries of software and organizational structures (37:25)
  • The benefits of a unified orchestration framework (39:56)
  • Orchestration in the development phase (45:29)
  • The emergence of analytics engineer role (51:53)
  • Fluidity in data pipeline and infrastructure roles (52:40)

The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we’ll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.

RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.


Hosted by Simplecast, an AdsWizz company. See pcm.adswizz.com for information about our collection and use of personal data for advertising.