Data Engineering Podcast

513 episodes — Page 3 of 11

Ep 412Data Sharing Across Business And Platform Boundaries

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing Interview Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms? What are some of the main challenges/shortcomings that teams/organizations experience with these options? What are the technical capabilities that need to be present for an effective data sharing solution? How does that change as a function of the type of data? (e.g. tabular, image, etc.) What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email [email protected]) with your story. Links Bobsled OLAP == OnLine Analytical Processing Cassandra Podcast Episode Neo4J FTP == File Transfer Protocol S3 Access Points Snowflake Sharing BigQuery Sharing Databricks Delta Sharing DuckDB Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASponsored By:Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping y

Data Engineering Podcast

Ep 412Data Sharing Across Business And Platform Boundaries

Ep 411Tackling Real Time Streaming Data With SQL Using RisingWave

Ep 410Build A Data Lake For Your Security Logs With Scanner

Ep 409Modern Customer Data Platform Principles

Ep 408Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Ep 407Designing Data Platforms For Fintech Companies

Ep 406Troubleshooting Kafka In Production

Ep 405Adding An Easy Mode For The Modern Data Stack With 5X

Ep 404Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Ep 403Designing Data Transfer Systems That Scale

Ep 402Addressing The Challenges Of Component Integration In Data Platform Architectures

Ep 401Unlocking Your dbt Projects With Practical Advice For Practitioners

Ep 400Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Ep 399Shining Some Light In The Black Box Of PostgreSQL Performance

Ep 398Surveying The Market Of Database Products

Ep 397Defining A Strategy For Your Data Products

Ep 396Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Ep 395Using Data To Illuminate The Intentionally Opaque Insurance Industry

Ep 394Building ETL Pipelines With Generative AI

Ep 393Powering Vector Search With Real Time And Incremental Vector Indexes

Ep 392Building Linked Data Products With JSON-LD

Ep 391An Overview Of The State Of Data Orchestration In An Increasingly Complex Data Ecosystem

Ep 390Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Ep 389Building An Internal Database As A Service Platform At Cloudflare

Ep 388Harnessing Generative AI For Creating Educational Content With Illumidesk

Ep 387Unpacking The Seven Principles Of Modern Data Pipelines

Ep 386Quantifying The Return On Investment For Your Data Team

Ep 385Strategies For A Successful Data Platform Migration

Ep 384Build Real Time Applications With Operational Simplicity Using Dozer

Ep 383Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future

Ep 382Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Ep 381How Data Engineering Teams Power Machine Learning With Feature Platforms

Ep 380Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

Ep 379How Column-Aware Development Tooling Yields Better Data Models

Ep 378Build Better Tests For Your dbt Projects With Datafold And data-diff

Ep 377Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service

Ep 376A Roadmap To Bootstrapping The Data Team At Your Startup

Ep 375Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Ep 374What Happens When The Abstractions Leak On Your Data

Ep 373Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify

Ep 372Realtime Data Applications Made Easier With Meroxa

Ep 371Building Self Serve Business Intelligence With AI And Semantic Modeling At Zenlytic

Ep 370An Exploration Of The Composable Customer Data Platform

Ep 369Mapping The Data Infrastructure Landscape As A Venture Capitalist

Ep 368Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Ep 367Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

Ep 366Use Your Data Warehouse To Power Your Product Analytics With NetSpring

Ep 365Exploring The Nuances Of Building An Intentional Data Culture

Ep 364Building A Data Mesh Platform At PayPal

Ep 363The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse