
Episode 142
Orchestrating Large and Small Projects With Apache Airflow
The Real Python Podcast · Real Python
January 27, 202354m 24s
Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
<p>Have you worked on a project that needed an orchestration tool? How do you define the workflow of an entire data pipeline or a messaging system with Python? This week on the show, Calvin Hendryx-Parker is back to talk about using Apache Airflow and orchestrating Python projects.</p>
<p>Calvin is the co-founder and CTO of Six Feet Up and a Python Web Conference co-organizer. He’s recently been working on a massive project that requires thousands of jobs involving transferring and transforming data. Through his research into orchestration systems, he found Apache Airflow. </p>
<p>Airflow is an open-source tool to define, schedule, and monitor workflows. The platform is pure Python and integrates with a wide variety of services. We discuss how workflows are defined by creating directed acyclic graphs (DAG). </p>
<p>Calvin talks about how a recent project outgrew the system and how his team built a clever solution using Python. We also discuss the upcoming Python Web Conference and what virtual attendees can expect.</p>
<div class="alert alert-primary" role="alert">
<p><strong>Course Spotlight:</strong> <a href="https://realpython.com/courses/python-basics-oop/">Python Basics: Object-Oriented Programming</a> </p>
<p>In this video course, you’ll get to know OOP, or object-oriented programming. You’ll learn how to create a class, use classes to create new objects, and instantiate classes with attributes.</p>
</div>
<p>Topics:</p>
<ul>
<li>00:00:00 – Introduction</li>
<li>00:02:24 – Describing the large data pipeline</li>
<li>00:04:38 – What format was the data in?</li>
<li>00:06:04 – Was the format of the data changed for storage?</li>
<li>00:09:34 – Data engineering and describing sources and targets</li>
<li>00:11:29 – Apache Airflow orchestration and hitting limitations</li>
<li>00:18:12 – Sponsor: CData Software</li>
<li>00:18:54 – DAG: Directed acyclic graphs</li>
<li>00:22:29 – Streaming data and other tool choices</li>
<li>00:25:38 – Overcoming DAG Factory limitations</li>
<li>00:31:49 – Another industry example for Airflow</li>
<li>00:34:24 – Finding solutions as a consultancy</li>
<li>00:35:12 – Is there a minimum-size project for Airflow?</li>
<li>00:37:37 – Django under the hood</li>
<li>00:38:31 – Video Course Spotlight</li>
<li>00:39:58 – The Python Web Conference 2023</li>
<li>00:44:24 – Do you have any upcoming conference talks?</li>
<li>00:45:53 – How can people follow your work online?</li>
<li>00:46:52 – IndyPy talk by Mariatta Wijaya</li>
<li>00:48:01 – What are you excited about in the world of Python?</li>
<li>00:51:45 – What do you want to learn next?</li>
<li>00:53:22 – Thanks and goodbye</li>
</ul>
<p>Show Links:</p>
<ul>
<li><a href="https://airflow.apache.org/docs/">Apache Airflow - Documentation</a></li>
<li><a href="https://sixfeetup.com/blog/too-big-for-dag-factories">Too Big for DAG Factories? — Six Feet Up</a></li>
<li><a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">Directed acyclic graph - Wikipedia</a></li>
<li><a href="https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html">DAGs — Airflow Documentation</a></li>
<li><a href="https://docs.astronomer.io/learn/dynamically-generating-dags">Dynamically generating DAGs in Airflow - Astronomer Documentation</a></li>
<li><a href="https://www.databricks.com/">Data Lakehouse Architecture and AI Company - Databricks</a></li>
<li><a href="https://realpython.com/podcasts/rpp/10/">Episode #10: Python Job Hunting in a Pandemic – The Real Python Podcast</a></li>
<li><a href="https://realpython.com/podcasts/rpp/124/">Episode #124: Exploring Recursion in Python With Al Sweigart – The Real Python Podcast</a></li>
<li><a href="https://inventwithpython.com/recursion/">The Recursive Book of Recursion</a></li>
<li><a href="https://realpython.com/podcasts/rpp/61/">Episode #61: Scaling Data Science and Machine Learning Infrastructure Like Netflix – The Real Python Podcast</a></li>
<li><a href="https://indypy.org/#">IndyPy — Indiana Python User Group</a></li>
<li><a href="https://www.youtube.com/watch?v=zEIPTg22OYE&list=PLt4L3V8wVnF6JgEz7BLuRIZSS6Qsx_AFn">Contributing to Python - Mariatta Wijaya - Python Core Developer - YouTube</a></li>
<li><a href="https://www.home-assistant.io/">Home Assistant</a></li>
<li><a href="https://www.arturia.com/products/hardware-synths/microfreak/details">Arturia - MicroFreak</a></li>
<li><a href="https://www.arturia.com/products/software-instruments/pigments/overview">Arturia - Pigments</a></li>
<li><a href="https://fosstodon.org/@calvinhp">CalvinHP (@[email protected]) - Fosstodon</a></li>
<li><a href="https://twitter.com/calvinhp">calvinhp - Twitter</a></li>
<li><a href="https://sixfeetup.com/blog">Six Feet Up - Blog</a></li>
<li><a href="https://2023.pythonwebconf.com/">Python Web Conference 2023</a></li>
</ul>
<p>Level up your Python skills with our expert-led courses:</p>
<ul>
<li><a href="https://realpython.com/courses/data-cleaning-with-pandas-and-numpy/">Data Cleaning With pandas and NumPy</a></li>
<li><a href="https://realpython.com/courses/python-basics-oop/">Python Basics: Object-Oriented Programming</a></li>
<li><a href="https://realpython.com/courses/intro-object-oriented-programming-oop-python/">A Conceptual Primer on OOP in Python</a></li>
</ul> <p><a rel="payment" href="https://realpython.com/join">Support the podcast & join our community of Pythonistas</a></p>