PLAY PODCASTS
What Is Data Engineering and Researching 10 Million Jupyter Notebooks
Episode 42

What Is Data Engineering and Researching 10 Million Jupyter Notebooks

The Real Python Podcast · Real Python

January 8, 202155m 41s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

<p>Are you familiar with the role data engineers play in the modern landscape of data science and Python? Data engineering is a sub-discipline that focuses on the transportation, transformation, and storage of data. This week on the show, David Amos is back, and he&rsquo;s brought another batch of PyCoder&rsquo;s Weekly articles and projects.</p> <p>Along with the Real Python article on data engineering, we talk about a project where researchers downloaded 10 million Jupyter notebooks from Github to gather insights about the current state of data science technology.</p> <p>We also discuss an article about validating data in Python with the package Cerberus. And this led us to a conversation about a set of coding challenges from Advent of Code.</p> <p>We also cover several other articles and projects from the Python community including, building my own chess engine, the visual guide to NumPy, a free and open-source alternative to SAP, a library for working with STL files and 3D objects, and is Python really a bottleneck?</p> <div class="alert alert-primary" role="alert"> <p><strong>Course Spotlight:</strong> <a href="https://realpython.com/courses/django-rest-framework/">Building With Django REST Framework</a> </p> <p>This course will get you ready to build with Django REST Framework. The Django REST framework (DRF) is a toolkit built on top of the Django web framework that reduces the amount of code you need to write to create REST interfaces.</p> </div> <p>Topics:</p> <ul> <li>00:00:00 &ndash; Introduction</li> <li>00:01:51 &ndash; What Is Data Engineering and Is It Right for You?</li> <li>00:12:07 &ndash; Building My Own Chess Engine</li> <li>00:17:52 &ndash; We Downloaded 10,000,000 Jupyter Notebooks From Github: This Is What We Learned</li> <li>00:28:12 &ndash; Video Course Spotlight</li> <li>00:29:20 &ndash; Is Python Really a Bottleneck?</li> <li>00:34:01 &ndash; Validating Data in Python With Cerberus</li> <li>00:39:04 &ndash; NumPy Illustrated: The Visual Guide to NumPy</li> <li>00:42:54 &ndash; erpnext: Free and Open Source Alternative to SAP</li> <li>00:48:49 &ndash; numpy-stl: Library for Working With STL Files and 3D Objects</li> <li>00:54:54 &ndash; Thanks and goodbye</li> </ul> <p>Show Links:</p> <p><a href="https://realpython.com/python-data-engineer/">What Is Data Engineering and Is It Right for You?</a> — In this article, you&rsquo;ll get an overview of the discipline of data engineering. You&rsquo;ll learn what is and isn&rsquo;t part of a data engineer&rsquo;s job, who data engineers work with, and why data engineers play a crucial role in many industries.</p> <p><a href="https://healeycodes.com/building-my-own-chess-engine/">Building My Own Chess Engine</a> — Writing your own chess engine is a great way to explore computational complexity and combinatorial aspects of programming. Not to mention it&rsquo;s pretty fun! Follow along with this reflection on how one coder created his own Chess engine from scratch.</p> <p><a href="https://blog.jetbrains.com/datalore/2020/12/17/we-downloaded-10-000-000-jupyter-notebooks-from-github-this-is-what-we-learned/">We Downloaded 10,000,000 Jupyter Notebooks From Github: This Is What We Learned</a> — The JetBrains Datalore team downloaded ten million Jupyter Notebooks and analyzed them to determine things like which languages were the most popular, what kinds of content are in notebook cells, and how consistently notebooks can be reproduced. It&rsquo;s a fascinating look into trends in data science technology!</p> <p><a href="https://towardsdatascience.com/is-python-really-a-bottleneck-786d063e2921">Is Python Really a Bottleneck?</a> — Python is slow. From one perspective, that is. But what are the true bottlenecks in the data engineering/data processing space, and how does Python compare to other technologies when those factors are considered?</p> <p><a href="https://hector.dev/2020/12/29/validating-data-in-python-with-cerberus.html">Validating Data in Python With Cerberus</a> — Thanks to an Advent of Code challenge, author Hector Castro was exposed to the Cerberus Python package for data validation. Get a quick introduction to Cerberus and see Hector&rsquo;s solution to an Advent of Code challenge in this quick-yet-informative read.</p> <p><a href="https://medium.com/better-programming/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d">NumPy Illustrated: The Visual Guide to NumPy</a> — This illustrated guide to NumPy is a great way to learn NumPy or brush up on the package. Full of great visual aides, this tutorial covers all the basics and more!</p> <p>Projects:</p> <ul> <li><a href="https://github.com/frappe/erpnext">erpnext: Free and Open Source Alternative to SAP</a></li> <li><a href="https://github.com/WoLpH/numpy-stl">numpy-stl: Library for Working With STL Files and 3D Objects</a></li> </ul> <p>Additional Links:</p> <ul> <li><a href="https://davidepstein.com/the-range/">Range - Why Generalists Triumph In a Specialized World: David Epstein</a></li> <li><a href="https://en.wikipedia.org/wiki/Shannon_number#Shannon's_calculation">Shannon number: Wikipedia article</a></li> <li><a href="https://twitter.com/sharifshameem/status/1344246374737399808">Apple&rsquo;s open source chess engine minimum response times: Twitter thread</a></li> <li><a href="https://adventofcode.com/2020/about">Advent of Code</a></li> <li><a href="https://github.com/pyeve/cerberus">cerberus: Lightweight and Extensible Data Validation Library for Python</a></li> <li><a href="https://en.wikipedia.org/wiki/Cerberus">Cerberus - Greek Mythology: Wikipedia article</a></li> <li><a href="http://jalammar.github.io/visual-numpy/">A Visual Intro to NumPy and Data Representation</a></li> <li><a href="https://micronote.tech/2020/12/Generating-STL-Models-with-Python/">Generating STL Models With Python</a></li> </ul> <p>Level up your Python skills with our expert-led courses:</p> <ul> <li><a href="https://realpython.com/courses/django-portfolio-project/">Getting Started With Django: Building a Portfolio App</a></li> <li><a href="https://realpython.com/courses/django-rest-framework/">Building HTTP APIs With Django REST Framework</a></li> <li><a href="https://realpython.com/courses/python-histograms/">Histogram Plotting in Python: NumPy, Matplotlib, Pandas &amp; Seaborn</a></li> </ul> <p><a rel="payment" href="https://realpython.com/join">Support the podcast &amp; join our community of Pythonistas</a></p>