Linear Digressions

312 episodes — Page 4 of 7

The Case for Learned Index Structures, Part 1: B-Trees

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to look a little bit like a regression model--hence the relevance of machine learning models. If this sounds kinda weird, or we lost you at b-tree, don't worry--lots more details in the episode itself.

Jan 22, 201818 min

Challenges with Using Machine Learning to Classify Chest X-Rays

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to visually recognize various diseases in chest x-rays is called into question by a radiologist with machine learning expertise. As it seemingly always does, it comes down to the dataset that's used for training--medical records assume a lot of context that may or may not be available to the algorithm, so it's tough to make something that actually helps (in this case) predict disease that wasn't already diagnosed.

Jan 15, 201818 min

The Fourier Transform

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what the amplitude, frequency and offset of those component waves are. It's a really handy way of re-expressing periodic data--you'll never look at a time series graph the same way again.

Jan 8, 201815 min

Statistics of Beer

What better way to kick off a new year than with an episode on the statistics of brewing beer?

Jan 2, 201815 min

Re - Release: Random Kanye

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.

Dec 24, 20179 min

Debiasing Word Embeddings

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and give results that are sexist. For example, occupational words like "doctor" and "nurse" are more highly aligned with "man" or "woman," which can create problems because these word embeddings are used in algorithms that help people find information or make decisions. However, a group of researchers has released a new paper detailing ways to de-bias the embeddings, so we retain gender info that's not particularly problematic (for example, "king" vs. "queen") while correcting bias.

Dec 18, 201718 min

The Kernel Trick and Support Vector Machines

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.

Dec 11, 201717 min

Maximal Margin Classifiers

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a neat way to think about statistical learning and a prerequisite for understanding support vector machines, which we'll cover next week--stay tuned!

Dec 4, 201714 min

Re - Release: The Cocktail Party Problem

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Nov 27, 201713 min

Clustering with DBSCAN

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it can fit irregularly-shaped clusters, and it can also identify outlier points that don't belong to any of the clusters. Pretty cool!

Nov 20, 201716 min

The Kaggle Survey on Data Science

Want to know what's going on in data science these days? There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle. Kaggle asked practicing and aspiring data scientists about themselves, their tools, how they find jobs, what they find challenging about their jobs, and many other questions. Then Kaggle released an interactive summary of the data, as well as the anonymized dataset itself, to help data scientists understand the trends in the data. In this episode, we'll go through some of the survey toplines that we found most interesting and counterintuitive.

Nov 13, 201725 min

Machine Learning: The High Interest Credit Card of Technical Debt

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. This is technical debt, and it's particularly easy to accrue with machine learning workflows. That's the premise of this episode's paper.

Nov 6, 201722 min

Improving Upon a First-Draft Data Science Analysis

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your favorite machine learning class, or your most recent online machine learning competition. However, if you've ever tried to maintain a machine learning workflow (as opposed to building it from scratch), you know that taking a simple modeling script and turning it into clean, well-structured and maintainable software is way harder than most people give it credit for. That said, if you're a professional data scientist (or want to be one), this is one of the most important skills you can develop. In this episode, we'll walk through a workshop Katie is giving at the Open Data Science Conference in San Francisco in November 2017, which covers building a machine learning workflow that's more maintainable than a simple script. If you'll be at ODSC, come say hi, and if you're not, here's a sneak preview!

Oct 30, 201715 min

Survey Raking

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do? Reweighting the survey data, so that things like demographic distributions look similar between the survey and general populations, is a standard technique and in this episode we'll talk about survey raking, a way to calculate survey weights when there are several distributions of interest that need to be matched.

Oct 23, 201717 min

Happy Hacktoberfest

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you can get involved. You can even get a free t-shirt! Hacktoberfest main page: https://hacktoberfest.digitalocean.com/#details

Oct 16, 201715 min

Re - Release: Kalman Runners

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're running a race but don't have a watch, how fast you're going. Katie's Chicago race report: miles 1-13: light ankle pain, lovely cool weather, the most fun EVAR miles 13-17: no more ankle pain but quads start getting tight, it's a little more effort miles 17-20: oof, really tight legs but still plenty of gas in then tank. miles 20-23: it's warmer out now, legs hurt a lot but running through Pilsen and Chinatown is too fun to notice mile 24: ugh cramp everything hurts miles 25-26.2: awesome crowd support, really tired and loving every second Final time: 3:54:35

Oct 9, 201717 min

Neural Net Dropout

Neural networks are complex models with many parameters and can be prone to overfitting. There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout. It seems counterintuitive that undermining the structural integrity of the neural net makes it robust against overfitting, but in the world of neural nets, weirdness is just how things go sometimes. Relevant links: https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Oct 2, 201718 min

Disciplined Data Science

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes some thinking Katie's been doing herself around how to guide data science teams toward more mature, effective practices. We'll go through five key characteristics of great data science teams, which we collectively refer to as "disciplined data science," and why they matter.

Sep 25, 201729 min

Hurricane Forecasting

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those models, talking about the different types of models, the theory behind them, and how they've evolved through the years.

Sep 18, 201727 min

Finding Spy Planes with Machine Learning

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to find them. The fun thing here, in our opinion, is the blend of intrigue (spy planes!) with tech journalism and a heavy dash of publicly available and reproducible analysis code so that you (yes, you!) can see exactly how BuzzFeed identifies the surveillance planes.

Sep 11, 201718 min

Data Provenance

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system. For data scientists who might want to reconstruct past models, though, it's not just about keeping the modeling code. It's also about saving a version of the data that made the model. There are a lot of other benefits to keeping track of datasets, so in this episode we'll talk about data lineage or data provenance.

Sep 4, 201722 min

Adversarial Examples

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we have a roundup of a few successful efforts to create robust adversarial examples, including what it means for an adversarial example to be robust and what this might mean for machine learning in the future.

Aug 28, 201716 min

Jupyter Notebooks

This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snippets with explanations for sharing your work with others. If you're not a data person, or you are but you haven't tried out Jupyter notebooks yet, here's your nudge to go give them a try. In this episode we'll go back to the old days, before notebooks, and talk about all the ways that data scientists like to work that wasn't particularly well-suited to the command line + text editor setup, and talk about how notebooks have evolved over their lifetime to become even more powerful and well-suited to the data scientist's workflow.

Aug 21, 201715 min

Curing Cancer with Machine Learning is Super Hard

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with artificial intelligence. There are enough conflicting accounts in the media to make it tough to say exactly went wrong, but it's a good chance to remind ourselves that even in a post-AI world, hard problems remain hard.

Aug 14, 201719 min

KL Divergence

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution. It comes to us originally from information theory, but today underpins other, more machine-learning-focused algorithms like t-SNE. And boy oh boy can it be tough to explain. But we're trying our hardest in this episode!

Aug 7, 201725 min

Sabermetrics

It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the best baseball teams and players. It can be hard to objectively measure sports greatness, but baseball has a data-rich history and plenty of nerdy fans interested in analyzing that data. In this episode we'll dissect a few of the metrics from standard baseball and compare them to related metrics from Sabermetrics, so you can nerd out more effectively at your next baseball game.

Jul 31, 201725 min

What Data Scientists Can Learn from Software Engineers

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science. If last week's episode was for software engineers who are interested in becoming more like data scientists, then this week's episode is for data scientists who are looking to improve their game with best practices from software engineering.

Jul 24, 201723 min

Software Engineering to Data Science

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain. In this episode, we'll chat with a Friend of the Pod named Walt, who started out as a software engineer but works as a data scientist now. We'll talk about that transition from software engineering to data science, and what special capabilities software engineers have that data scientists might benefit from knowing about (and vice versa).

Jul 17, 201719 min

Re-Release: Fighting Cholera with Data, 1854

This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera outbreak in London killed scores of people, a doctor named John Snow used it as a chance to study whether the cause might be very small organisms that were spreading through the water supply (the prevailing theory at the time was miasma, or “bad air”). By tracing the geography of all the deaths from the outbreak, Snow was practicing elementary data science--and stumbled upon one of history’s most famous outliers. In this episode, we’ll tell you more about this single data point, a case of cholera that cracked the case wide open for Snow and provided critical validation for the germ theory of disease.

Jul 10, 201712 min

Re-Release: Data Mining Enron

This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apples had been cooking the books, and billions of dollars and thousands of jobs disappeared. In the aftermath, surprisingly, one of the greatest datasets in all of machine learning was born--the Enron emails corpus. Hundreds of thousands of emails amongst top executives were made public; there's no realistic chance any dataset like this will ever be made public again. But the dataset that was released has gone on to immortality, serving as the basis for a huge variety of advances in machine learning and other fields.

Jul 2, 201732 min

Factorization Machines

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Jun 26, 201719 min

Anscombe's Quartet

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is. Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur. It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics. In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.

Jun 19, 201715 min

Traffic Metering Algorithms

Originally release June 2016 This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome).

Jun 12, 201718 min

Page Rank

The year: 1998. The size of the web: 150 million pages. The problem: information retrieval. How do you find the "best" web pages to return in response to a query? A graduate student named Larry Page had an idea for how it could be done better and created a search engine as a research project. That search engine was called Google.

Jun 5, 201719 min

Fractional Dimensions

We chat about fractional dimensions, and what the actual heck those are.

May 29, 201720 min

Things You Learn When Building Models for Big Data

As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back. This week is a first-hand case study in using scikit-learn (a popular python machine learning library) on multi-terabyte datasets, which is something that Katie does a lot for her day job at Civis Analytics. There are a lot of considerations for doing something like this--cloud computing, artful use of parallelization, considerations of model complexity, and the computational demands of training vs. prediction, to name just a few.

May 22, 201721 min

How to Find New Things to Learn

If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!

May 15, 201717 min

Federated Learning

As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of related algorithms from Google that are designed to help out in exactly this scenario. If you've used keyboard shortcuts or autocomplete on an Android phone, chances are you've encountered Federated Learning even if you didn't know it.

May 8, 201714 min

Word2Vec

Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually). And all that's before we get to the part about how Word2Vec allows you to do algebra with text. Seriously, this stuff is cool.

May 1, 201717 min

Feature Processing for Text Analytics

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms. That's why there are text vectorization algorithms, which re-format text data so it's ready for using for machine learning. In this episode, we'll go over some of the most common and useful ways to preprocess text data for machine learning.

Apr 24, 201717 min

Education Analytics

This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'll talk about what some of those signs are, and then dig into the meatier questions around discrimination, who owns a student's data, and correlation vs. causation. Spoiler: we have more questions than we have answers on this one. Bonus appearance from Maeby the dog, who isn't a data scientist but does like to steal food off the counter.

Apr 17, 201721 min

A Technical Deep Dive on Stanley, the First Self-Driving Car

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You betcha! Drive-by-wire? Of course! Probabilistic terrain reconstruction? Absolutely! All this and more this week on Linear Digressions.

Apr 10, 201740 min

An Introduction to Stanley, the First Self-Driving Car

In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car. In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge from 2005 and the rules and constraints of what it took for Stanley to win the competition. Next week, we'll do a deep dive into Stanley's control systems and overall operation and what the key systems were that allowed Stanley to win the race.

Apr 3, 201713 min

Feature Importance

Figuring out what features actually matter in a model is harder to figure out than you might first guess. When a human makes a decision, you can just ask them--why did you do that? But with machine learning models, not so much. That's why we wanted to talk a bit about both regularization (again) and also other ways that you can figure out which models have the biggest impact on the predictions of your model.

Mar 27, 201720 min

Space Codes!

It's hard to get information to and from Mars. Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge. The messages you do pass have to traverse millions of miles, which provides ample opportunity for the message to get corrupted or scrambled. How, then, can you encode messages so that errors can be detected and corrected? How does the decoding process allow you to actually find and correct the errors? In this episode, we'll talk about three pieces of the process (Reed-Solomon codes, convolutional codes, and Viterbi decoding) that allow the scientists at NASA to talk to our rovers on Mars.

Mar 20, 201723 min

Finding (and Studying) Wikipedia Trolls

You may be shocked to hear this, but sometimes, people on the internet can be mean. For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem. Fighting the problem starts with understanding it, and understanding it starts with measuring it; the thing is, for a huge website like Wikipedia, there can be millions of edits and comments where abuse might happen, so measurement isn't a simple task. That's where machine learning comes in: by building an "abuse classifier," and pointing it at the Wikipedia edit corpus, researchers at Jigsaw and the Wikimedia foundation are for the first time able to estimate abuse rates and curate a dataset of abusive incidents. Then those researchers, and others, can use that dataset to study the pathologies and effects of Wikipedia trolls.

Mar 13, 201715 min

A Sprint Through What's New in Neural Networks

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up. So this week we have another installment in our "neural nets: they so smart!" series, talking about three topics. And all the topics this week were listener suggestions, too!

Mar 6, 201716 min

Stein's Paradox

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.

Feb 27, 201727 min

Empirical Bayes

Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? Empirical Bayes has an elegant answer: look to your previous experience, and use past measurements as a starting point in your prior. Scratching your head about some of those terms, and why they matter? Lucky for you, you're standing in front of a podcast episode that unpacks all of this.

Feb 20, 201718 min

Endogenous Variables and Measuring Protest Effectiveness

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed. In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging. So, what to do? Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest. In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open. What does rainfall have to do with protests? Do protests actually matter? What do we mean when we talk about endogenous and instrumental variables? We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.

Feb 13, 201716 min

« Prev 1 2 345 6 7 Next »