Data Engineering Podcast

513 episodes — Page 9 of 11

Ep 112Building The Materialize Engine For Interactive Streaming Analytics In SQL

Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures Interview Introduction How did you get involved in the area of data management? Can you start by describing what Materialize is and the problems that you are aiming to solve with it? What was your motivation for creating it? What use cases does Materialize enable? What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize? How does it fit into the broader ecosystem of data tools and platforms? What are some of the use cases that Materialize is uniquely able to support? How is Materialize architected and how has the design evolved since you first began working on it? Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided? What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems? In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize? A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize? Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine? What are the considerations and constraints on maintaining the full history of the source data within Materialize? For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources? What is the workflow for defining and maintaining a set of views? What are some of the complexities that users might face in ensuring the ongoing functionality of those views? For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of? The Materialize product is currently pre-release. What are the remaining steps before launching it? What do you have planned for the future of the product and company? Contact Info frankmcsherry on GitHub @frankmcsherry on Twitter Blog Parting Question From your perspective, what

Data Engineering Podcast

Ep 112Building The Materialize Engine For Interactive Streaming Analytics In SQL

Ep 111Solving Data Lineage Tracking And Data Discovery At WeWork

Ep 110SnowflakeDB: The Data Warehouse Built For The Cloud

Ep 109Organizing And Empowering Data Engineers At Citadel

Ep 108Building A Real Time Event Data Warehouse For Sentry

Ep 107Escaping Analysis Paralysis For Your Data Platform With Data Virtualization

Ep 106Designing For Data Protection

Ep 105Automating Your Production Dataflows On Spark

Ep 104Build Maintainable And Testable Data Applications With Dagster

Ep 103Data Orchestration For Hybrid Cloud Analytics

Ep 102Keeping Your Data Warehouse In Order With DataForm

Ep 101Fast Analytics On Semi-Structured And Structured Data In The Cloud

Ep 100Ship Faster With An Opinionated Data Pipeline Framework

Ep 99Open Source Object Storage For All Of Your Data

Ep 98Navigating Boundless Data Streams With The Swim Kernel

Ep 97Building A Reliable And Performant Router For Observability Data

Ep 96Building A Community For Data Professionals at Data Council

Ep 95Building Tools And Platforms For Data Analytics

Ep 94A High Performance Platform For The Full Big Data Lifecycle

Ep 93Digging Into Data Replication At Fivetran

Ep 92Solving Data Discovery At Lyft

Ep 91Simplifying Data Integration Through Eventual Connectivity

Ep 90Straining Your Data Lake Through A Data Mesh

Ep 89Data Labeling That You Can Feel Good About With CloudFactory

Ep 88Scale Your Analytics On The Clickhouse Data Warehouse

Ep 87Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Ep 86The Workflow Engine For Data Engineers And Data Scientists

Ep 85Maintaining Your Data Lake At Scale With Spark

Ep 84Managing The Machine Learning Lifecycle

Ep 83Evolving An ETL Pipeline For Better Productivity

Ep 82Data Lineage For Your Pipelines

Ep 81Build Your Data Analytics Like An Engineer With DBT

Ep 80Using FoundationDB As The Bedrock For Your Distributed Systems

Ep 79Running Your Database On Kubernetes With KubeDB

Ep 78Unpacking Fauna: A Global Scale Cloud Native Database

Ep 77Index Your Big Data With Pilosa For Faster Analytics

Ep 76Serverless Data Pipelines On DataCoral

Ep 75Why Analytics Projects Fail And What To Do About It

Ep 74Building An Enterprise Data Fabric At CluedIn

Ep 73A DataOps vs DevOps Cookoff In The Data Kitchen

Ep 72Customer Analytics At Scale With Segment

Ep 71Deep Learning For Data Engineers

Ep 70Speed Up Your Analytics With The Alluxio Distributed Storage System

Ep 69Machine Learning In The Enterprise

Ep 68Cleaning And Curating Open Data For Archaeology

Ep 67Managing Database Access Control For Teams With strongDM

Ep 66Building Enterprise Big Data Systems At LEGO

Ep 65TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Ep 64Performing Fast Data Analytics Using Apache Kudu - Episode 64

Ep 63Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63