PLAY PODCASTS
Focusing on Data Science & Less on Engineering and Dependencies
Episode 191

Focusing on Data Science & Less on Engineering and Dependencies

The Real Python Podcast · Real Python

February 9, 20241h 1m

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

<p>How do you manage the dependencies of a large-scale data science project? How do you migrate that project from a laptop to cloud infrastructure or utilize GPUs and multiple instances in parallel? This week on the show, Savin Goyal returns to discuss the updates to the open-source framework Metaflow.</p> <p>Savin briefly describes the Metaflow platform and the goal of simplifying engineering overhead for data scientists and programmers. We discuss how the platform captures snapshots of a project as you work, allowing you to go back in time or share the state of your project with another team member.</p> <p>We dig into the complicated process of managing dependencies for machine learning and data science projects. Savin describes how the required external libraries can be specified within a flow with the new <code>@pypi</code> or <code>@conda</code> decorators. This allows a project to scale from a local machine to the cloud or multiple instances with all dependencies included.</p> <p>He talks about starting a new company, Outerbounds, with fellow co-workers from Netflix. Their vision is to continue to build the Metaflow open-source platform and offer customers scalable enterprise-grade infrastructure. </p> <p>This week&rsquo;s episode is brought to you by Intel.</p> <div class="alert alert-primary" role="alert"> <p><strong>Course Spotlight:</strong> <a href="https://realpython.com/courses/packaging-with-pyproject-toml/">Everyday Project Packaging With <code>pyproject.toml</code></a> </p> <p>In this Code Conversation video course, you&rsquo;ll learn how to package your everyday projects with <code>pyproject.toml</code>. Playing on the same team as the import system means you can call your project from anywhere, ensure consistent imports, and have one file that&rsquo;ll work for many build systems.</p> </div> <p>Topics:</p> <ul> <li>00:00:00 &ndash; Introduction </li> <li>00:02:25 &ndash; Update on Metaflow </li> <li>00:04:13 &ndash; What is Outerbounds? </li> <li>00:07:26 &ndash; An ML platform to serve data scientists needs </li> <li>00:13:02 &ndash; Dependency reproducibility via <code>@conda</code> and <code>@pypi</code> decorators</li> <li>00:26:18 &ndash; Sponsor: Intel </li> <li>00:27:10 &ndash; Storing lock files along with snapshots </li> <li>00:29:17 &ndash; Working alongside code and dependency management systems</li> <li>00:34:03 &ndash; Scaling a project from laptop to the cloud </li> <li>00:40:13 &ndash; Video Course Spotlight </li> <li>00:41:41 &ndash; Getting visibility on processes </li> <li>00:47:23 &ndash; Adjusting your project due to GPU availability </li> <li>00:52:27 &ndash; Example of jumping back into a project one year later </li> <li>00:55:54 &ndash; What are you excited about in the world of Python? </li> <li>00:57:39 &ndash; What do you want to learn next? </li> <li>00:59:35 &ndash; How can people follow your work online? </li> <li>01:00:19 &ndash; Thanks and goodbye </li> </ul> <p>Show Links:</p> <ul> <li><a href="https://metaflow.org/">Metaflow - a framework for real-life ML, AI, and data science</a></li> <li><a href="https://outerbounds.com/">Infrastructure for ML, AI, and Data Science - Outerbounds</a></li> <li><a href="https://www.youtube.com/watch?v=KGpg8jwAda4">Human-Friendly, Production-Ready Data Science with Metaflow- Savin Goyal | SciPy 2022 - YouTube</a></li> <li><a href="https://realpython.com/podcasts/rpp/61/">Episode #61: Scaling Data Science and Machine Learning Infrastructure Like Netflix – The Real Python Podcast</a></li> <li><a href="https://outerbounds.com/blog/pypi-announcement/">New in Metaflow: The Long-Awaited <code>@pypi</code> Decorator - Outerbounds</a></li> <li><a href="https://docs.metaflow.org/scaling/dependencies">Managing Dependencies - Metaflow Docs</a></li> <li><a href="https://outerbounds.com/blog/secure-ml-secure-software-dependencies/">Secure ML with Secure Software Dependencies - Outerbounds</a></li> <li><a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">Directed acyclic graph (DAG) - Wikipedia article</a></li> <li><a href="https://docs.metaflow.org/metaflow/visualizing-results">Visualizing Results - Metaflow Docs</a></li> <li><a href="https://outerbounds.com/blog/seamless-data-pipelines-airflow-metaflow/">Seamless Data and ML Pipelines with Airflow and Metaflow - Outerbounds</a></li> <li><a href="https://realpython.com/podcasts/rpp/142/">Episode #142: Orchestrating Large and Small Projects With Apache Airflow – The Real Python Podcast</a></li> <li><a href="https://twitter.com/SavinGoyal">Savin (@SavinGoyal) - X</a></li> <li><a href="https://www.linkedin.com/in/savingoyal/">Savin Goyal - LinkedIn</a></li> <li><a href="https://outerbounds.com/blog/">Building the ML-driven future - Outerbounds Blog</a></li> </ul> <p>Level up your Python skills with our expert-led courses:</p> <ul> <li><a href="https://realpython.com/courses/packaging-with-pyproject-toml/">Everyday Project Packaging With pyproject.toml</a></li> <li><a href="https://realpython.com/courses/data-pandas-concat-and-merge/">Combining Data in pandas With concat() and merge()</a></li> <li><a href="https://realpython.com/courses/python-histograms/">Histogram Plotting in Python: NumPy, Matplotlib, Pandas &amp; Seaborn</a></li> </ul> <p><a rel="payment" href="https://realpython.com/join">Support the podcast &amp; join our community of Pythonistas</a></p>