
Dask: Scalable Python with Matthew Rocklin
Software Engineering Daily · softwareengineeringdaily.com
Audio is streamed directly from the publisher (traffic.megaphone.fm) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Python is the most widely used language for data science, and there are several libraries that are commonly used by Python data scientists including Numpy, Pandas, and scikit-learn. These libraries improve the user experience of a Python data scientist by giving them access to high level APIs.
Data science is often performed over huge datasets, and the data structures that are instantiated with those datasets need to be spread across multiple machines. To manage large distributed datasets, a library such as scikit-learn can use a system called Dask. Dask allows the instantiation of data structures such as a Dask dataframe or a Dask array.
Matthew Rocklin is the creator of Dask. He joins the show to talk about distributed computing with Dask, its use cases, and the Python ecosystem. He also provides a detailed comparison between Dask and Spark, which is also used for distributed data science.