
Episode 191
Focusing on Data Science & Less on Engineering and Dependencies
The Real Python Podcast · Real Python
February 9, 20241h 1m
Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
<p>How do you manage the dependencies of a large-scale data science project? How do you migrate that project from a laptop to cloud infrastructure or utilize GPUs and multiple instances in parallel? This week on the show, Savin Goyal returns to discuss the updates to the open-source framework Metaflow.</p>
<p>Savin briefly describes the Metaflow platform and the goal of simplifying engineering overhead for data scientists and programmers. We discuss how the platform captures snapshots of a project as you work, allowing you to go back in time or share the state of your project with another team member.</p>
<p>We dig into the complicated process of managing dependencies for machine learning and data science projects. Savin describes how the required external libraries can be specified within a flow with the new <code>@pypi</code> or <code>@conda</code> decorators. This allows a project to scale from a local machine to the cloud or multiple instances with all dependencies included.</p>
<p>He talks about starting a new company, Outerbounds, with fellow co-workers from Netflix. Their vision is to continue to build the Metaflow open-source platform and offer customers scalable enterprise-grade infrastructure. </p>
<p>This week’s episode is brought to you by Intel.</p>
<div class="alert alert-primary" role="alert">
<p><strong>Course Spotlight:</strong> <a href="https://realpython.com/courses/packaging-with-pyproject-toml/">Everyday Project Packaging With <code>pyproject.toml</code></a> </p>
<p>In this Code Conversation video course, you’ll learn how to package your everyday projects with <code>pyproject.toml</code>. Playing on the same team as the import system means you can call your project from anywhere, ensure consistent imports, and have one file that’ll work for many build systems.</p>
</div>
<p>Topics:</p>
<ul>
<li>00:00:00 – Introduction </li>
<li>00:02:25 – Update on Metaflow </li>
<li>00:04:13 – What is Outerbounds? </li>
<li>00:07:26 – An ML platform to serve data scientists needs </li>
<li>00:13:02 – Dependency reproducibility via <code>@conda</code> and <code>@pypi</code> decorators</li>
<li>00:26:18 – Sponsor: Intel </li>
<li>00:27:10 – Storing lock files along with snapshots </li>
<li>00:29:17 – Working alongside code and dependency management systems</li>
<li>00:34:03 – Scaling a project from laptop to the cloud </li>
<li>00:40:13 – Video Course Spotlight </li>
<li>00:41:41 – Getting visibility on processes </li>
<li>00:47:23 – Adjusting your project due to GPU availability </li>
<li>00:52:27 – Example of jumping back into a project one year later </li>
<li>00:55:54 – What are you excited about in the world of Python? </li>
<li>00:57:39 – What do you want to learn next? </li>
<li>00:59:35 – How can people follow your work online? </li>
<li>01:00:19 – Thanks and goodbye </li>
</ul>
<p>Show Links:</p>
<ul>
<li><a href="https://metaflow.org/">Metaflow - a framework for real-life ML, AI, and data science</a></li>
<li><a href="https://outerbounds.com/">Infrastructure for ML, AI, and Data Science - Outerbounds</a></li>
<li><a href="https://www.youtube.com/watch?v=KGpg8jwAda4">Human-Friendly, Production-Ready Data Science with Metaflow- Savin Goyal | SciPy 2022 - YouTube</a></li>
<li><a href="https://realpython.com/podcasts/rpp/61/">Episode #61: Scaling Data Science and Machine Learning Infrastructure Like Netflix – The Real Python Podcast</a></li>
<li><a href="https://outerbounds.com/blog/pypi-announcement/">New in Metaflow: The Long-Awaited <code>@pypi</code> Decorator - Outerbounds</a></li>
<li><a href="https://docs.metaflow.org/scaling/dependencies">Managing Dependencies - Metaflow Docs</a></li>
<li><a href="https://outerbounds.com/blog/secure-ml-secure-software-dependencies/">Secure ML with Secure Software Dependencies - Outerbounds</a></li>
<li><a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">Directed acyclic graph (DAG) - Wikipedia article</a></li>
<li><a href="https://docs.metaflow.org/metaflow/visualizing-results">Visualizing Results - Metaflow Docs</a></li>
<li><a href="https://outerbounds.com/blog/seamless-data-pipelines-airflow-metaflow/">Seamless Data and ML Pipelines with Airflow and Metaflow - Outerbounds</a></li>
<li><a href="https://realpython.com/podcasts/rpp/142/">Episode #142: Orchestrating Large and Small Projects With Apache Airflow – The Real Python Podcast</a></li>
<li><a href="https://twitter.com/SavinGoyal">Savin (@SavinGoyal) - X</a></li>
<li><a href="https://www.linkedin.com/in/savingoyal/">Savin Goyal - LinkedIn</a></li>
<li><a href="https://outerbounds.com/blog/">Building the ML-driven future - Outerbounds Blog</a></li>
</ul>
<p>Level up your Python skills with our expert-led courses:</p>
<ul>
<li><a href="https://realpython.com/courses/packaging-with-pyproject-toml/">Everyday Project Packaging With pyproject.toml</a></li>
<li><a href="https://realpython.com/courses/data-pandas-concat-and-merge/">Combining Data in pandas With concat() and merge()</a></li>
<li><a href="https://realpython.com/courses/python-histograms/">Histogram Plotting in Python: NumPy, Matplotlib, Pandas & Seaborn</a></li>
</ul> <p><a rel="payment" href="https://realpython.com/join">Support the podcast & join our community of Pythonistas</a></p>