Data Engineering Podcast

513 episodes — Page 5 of 11

Ep 312Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Summary Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing David Bader about Arkouda, a horizontally scalable parallel compute library for exploratory data analysis in Python Interview Introduction How did you get involved in the area of data management? Can you describe what Arkouda is and the story behind it? What are the main goals of the project? How does it address those goals? Who is the primary audience for Arkouda? What are some of the main points of friction that engineers and scientists encounter while conducting exploratory data analysis (EDA)? What kinds of behaviors are they engaging in during these exploration cycles? When data scientists run up against the limitations of their tools and environments how does that impact the work of data engineers/data platform owners? There have been a number of libraries/frameworks/

Data Engineering Podcast

Ep 312Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Ep 311What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Ep 309Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering

Ep 310Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Ep 307Making The Total Cost Of Ownership For External Data Manageable With Crux

Ep 308Joe Reis Flips The Script And Interviews Tobias Macey About The Data Engineering Podcast

Ep 306Charting the Path of Riskified's Data Platform Journey

Ep 305Maintain Your Data Engineers' Sanity By Embracing Automation

Ep 304Be Confident In Your Data Integration By Quickly Validating Matching Records With data-diff

Ep 303The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Ep 302Strategies And Tactics For A Successful Master Data Management Implementation

Ep 301Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Ep 299Level Up Your Data Platform With Active Metadata

Ep 300Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Ep 297Hire And Scale Your Data Team With Intention

Ep 298Discover And De-Clutter Your Unstructured Data With Aparavi

Ep 296Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Ep 295Bringing The Modern Data Stack To Everyone With Y42

Ep 294A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore

Ep 293Data Cloud Cost Optimization With Bluesky Data

Ep 292Unlocking The Value Of Data Across The Organization Through User Friendly Data Tools With Prophecy

Ep 291Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Ep 289Designing And Deploying IoT Analytics For Industrial Applications At Vopak

Ep 290Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Ep 287Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data

Ep 288Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Ep 286Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion

Ep 285Evolving And Scaling The Data Platform at Yotpo

Ep 284Operational Analytics At Speed With Minimal Busy Work Using Incorta

Ep 283Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Ep 282Connecting To The Next Frontier Of Computing With Quantum Networks

Ep 281What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?

Ep 280DataOps As A Service For Your Data Integration Workflows With Rivery

Ep 279Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel

Ep 278Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder

Ep 277Repeatable Patterns For Designing Data Platforms And When To Customize Them

Ep 276Eliminate The Bottlenecks In Your Key/Value Storage With SpeeDB

Ep 275Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Ep 274Exploring Incident Management Strategies For Data Teams

Ep 273Accelerate Your Embedded Analytics With Apache Pinot

Ep 272Accelerating Adoption Of The Modern Data Stack At 5X Data

Ep 271Taking A Multidimensional Approach To Data Observability At Acceldata

Ep 270Move Your Database To The Data And Speed Up Your Analytics With DuckDB

Ep 269Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB

Ep 268Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise

Ep 267Reflections On Designing A Data Platform From Scratch

Ep 266Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Ep 265Understanding The Immune System With Data At ImmunAI

Ep 264Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine

Ep 263Build Your Own End To End Customer Data Platform With Rudderstack