PLAY PODCASTS
Linear Digressions

Linear Digressions

310 episodes — Page 6 of 7

Text Analysis on the State Of The Union

First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk about a really cool analysis of State of the Union text, which analyzes the topics and word choices of every President from Washington to Obama. Relevant link: https://civisanalytics.com/blog/data-science/2016/01/15/data-science-on-state-of-the-union-addresses/

Feb 26, 201622 min

Paradigms in Artificial Intelligence

Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify all the approaches to AI, perhaps as a first step toward unifying these approaches and move closer to strong AI. In this episode, we'll touch on some of the most provocative work in many different subfields of artificial intelligence, and their strengths and weaknesses. Relevant links: https://www.technologyreview.com/s/544606/can-this-man-make-aimore-human/ https://www.youtube.com/watch?v=B8J4uefCQMc http://venturebeat.com/2013/11/29/sentient-code-an-inside-look-at-stephen-wolframs-utterly-new-insanely-ambitious-computational-paradigm/ http://www.slate.com/articles/technology/bitwise/2014/03/stephen_wolfram_s_new_programming_language_can_he_make_the_world_computable.html

Feb 22, 201617 min

Survival Analysis

Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to understand how the characteristics of a war inform how long the war goes on. This episode talks about the special challenges associated with survival analysis, and the tools that (data) scientists use to answer all kinds of duration-related questions.

Feb 19, 201615 min

Gravitational Waves

All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational waves, how are they detected, and what does this announcement mean for future studies of the universe. Relevant links: http://www.nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html https://www.ligo.caltech.edu/news/ligo20160211

Feb 15, 201620 min

The Turing Test

Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human conversational partner that it, the computer, was in fact a human. 60 years later, the Turing Test endures as a gold standard of artificial intelligence. It hasn't been beaten, either--yet. Relevant links: https://en.wikipedia.org/wiki/Turing_test http://commonsensereasoning.org/winograd.html http://consumerist.com/2015/09/29/its-not-just-you-robots-are-also-bad-at-assembling-ikea-furniture/

Feb 12, 201615 min

Item Response Theory: how smart ARE you?

Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the test-taker is, in order to get the results you want. How to solve this problem, one equation with two unknowns? Item response theory--the data science behind such tests and the GRE. Relevant links: https://en.wikipedia.org/wiki/Item_response_theory

Feb 8, 201611 min

Go!

As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of game-playing computer programs, and what makes Google's AlphaGo so special. Relevant link: http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html

Feb 5, 201619 min

Great Social Networks in History

The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of the network of letter-writing during the Enlightenment is helping humanities scholars track the dispersion of great ideas across the world during that time, from Voltaire to Benjamin Franklin and everyone in between. Relevant links: https://www2.bc.edu/~jonescq/mb851/Mar12/PadgettAnsell_AJS_1993.pdf http://republicofletters.stanford.edu/index.html

Feb 1, 201612 min

How Much to Pay a Spy (and a lil' more auctions)

A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Relevant links: https://tuecontheoryofnetworks.wordpress.com/2013/02/25/the-origin-of-the-dutch-auction/ http://www.nowozin.net/sebastian/blog/the-fair-price-to-pay-a-spy-an-introduction-to-the-value-of-information.html

Jan 29, 201616 min

Sold! Auctions (Part 2)

The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you know it must be pretty cool. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf

Jan 25, 201617 min

Going Once, Going Twice: Auctions (Part 1)

The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory of auctions, and what makes a "good" auction. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf

Jan 22, 201612 min

Chernoff Faces and Minard Maps

A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links: http://lya.fciencias.unam.mx/rfuentes/faces-chernoff.pdf https://en.wikipedia.org/wiki/Charles_Joseph_Minard

Jan 18, 201615 min

t-SNE: Reduce Your Dimensions, Keep Your Clusters

Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality reduction when you have clustering in mind. Relevant links: https://www.youtube.com/watch?v=RJVL80Gg3lA

Jan 15, 201616 min

The [Expletive Deleted] Problem

The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links: https://en.wikipedia.org/wiki/Scunthorpe_problem https://www.washingtonpost.com/news/worldviews/wp/2016/01/05/where-is-russia-actually-mordor-in-the-world-of-google-translate/

Jan 11, 20169 min

Unlabeled Supervised Learning--whaaa?

In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links: http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf

Jan 8, 201612 min

Hacking Neural Nets

Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links: http://arxiv.org/pdf/1412.1897v4.pdf

Jan 5, 201615 min

Zipf's Law

Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we all interact with every day. Relevant links: http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/ http://arxiv.org/pdf/cond-mat/0412004.pdf https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/

Dec 31, 201511 min

Indie Announcement

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show: https://twitter.com/lindigressions https://twitter.com/benjaffe https://twitter.com/multiarmbandit https://soundcloud.com/linear-digressions http://lineardigressions.com/

Dec 30, 20151 min

Portrait Beauty

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.

Dec 27, 201511 min

The Cocktail Party Problem

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!

Dec 18, 201512 min

A Criminally Short Introduction to Semi Supervised Learning

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.

Dec 4, 20159 min

Thresholdout: Down with Overfitting

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting

Nov 27, 201515 min

The State of Data Science

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."

Nov 10, 201515 min

Data Science for Making the World a Better Place

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new algorithm or billion-dollar company idea--there's a whole world of social data science that just wants to make the world a better place to live in.

Nov 6, 20159 min

Kalman Runners

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!

Oct 29, 201514 min

Neural Net Inception

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.

Oct 23, 201515 min

Benford's Law

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population of a country, the price of a stock... not all first digits are created equal.

Oct 16, 201517 min

Guinness

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.

Oct 7, 201514 min

PFun with P Values

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh interesting". Also, there's a lot of physics in this episode, nerds.

Sep 2, 201517 min

Watson

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?

Aug 25, 201515 min

Bayesian Psychics

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.

Aug 18, 201511 min

Troll Detection

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.

Aug 7, 201512 min

Yiddish Translation

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.

Aug 3, 201512 min

Modeling Particles in Atomic Bombs

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the Metropolis-Hastings algorithm... mentioning Solitaire along the way.

Jul 6, 201515 min

Random Number Generation

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on the security of systems, and the accuracy of models and research. In this episode, Katie and Ben talk about randomness, its place in machine learning and computation in general, along with some random digressions of their own.

Jun 19, 201510 min

Electoral Insights (Part 2)

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair of graduate students used to reverse-engineer the likely way that the fraudulent data was generated. But a bigger question still remains—what does this whole episode tell us about fraud and oversight in science?

Jun 9, 201521 min

Electoral Insights (Part 1)

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters. But with randomized controlled trials, small variations from person to person can even out when you look at a larger group. With the advent of randomized experiments in elections a few decades ago, a whole new door was opened for studying the most effective ways to campaign.

Jun 5, 20159 min

Falsifying Data

In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? We’ll explore that in this episode.

Jun 1, 201517 min

Reporter Bot

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same thing, but one is a good input for a machine learning algorithm and the other is a good story to read over your morning coffee. Data science and machine learning are starting to bridge this gap, taking the raw data on things like baseball games, financial scenarios, etc. and automatically writing human-readable stories that are increasingly indistinguishable from what a human would write. In this episode, we’ll talk about some examples of auto-generated content—you’ll be amazed at how sophisticated some of these reporter-bots can be. By the way, this summary was written by a human. (Or was it?)

May 20, 201511 min

Careers in Data Science

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers. In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, education, location, etc. along with their salaries, and then talk about how this data was fed into a linear regression so that you (yes, you!) can use the patterns in the data to know what kind of salary any particular kind of data scientist might expect.

May 16, 201516 min

That's "Dr Katie" to You

Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.

May 14, 20153 min

Neural Nets (Part 2)

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English sentence, and when you put them together you get an automated captioning tool. Two heads are better than one indeed...

May 11, 201510 min

Neural Nets (Part 1)

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms. This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks of the neural net (called neurons, naturally) and the networks that are built out of them. We’ll also explore the results that neural nets get when used to do object recognition in photographs.

May 1, 20159 min

Inferring Authorship (Part 2)

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith. Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) called Bitcoin; no one knows who he, she or they are, but we have plenty of writing samples in the form of whitepapers and Bitcoin forum posts. We’ll discuss some attempts to link Satoshi Nakamoto with a cryptocurrency expert and computer scientist named Nick Szabo; the links are tantalizing, but not a smoking gun. “Who is Satoshi” remains an example of attempted author identification where the threads are tangled, the conclusions inconclusive and the stakes high.

Apr 28, 201514 min

Inferring Authorship (Part 1)

This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk. By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author of a given piece of text. We’ll use a seminal paper from the 1960’s as our example here, where the Naive Bayes algorithm was used to determine whether Alexander Hamilton or James Madison was the more likely author of a number of anonymous Federalist Papers.

Apr 16, 20158 min

Statistical Mistakes and the Challenger Disaster

After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associated with the cold temperatures. In this episode, we'll unpack the mistakes they made. We'll talk about how they excluded data points that they thought were irrelevant but which actually were critical to recognizing a fatal pattern.

Apr 6, 201513 min

Genetics and Um Detection (HMM Part 2)

In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear about how HMMs are used in genetics research.

Mar 25, 201514 min

Introducing Hidden Markov Models (HMM Part 1)

Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean? In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Detector."

Mar 24, 201514 min

Monte Carlo For Physicists

This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue! Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the laws of physics as we understand them) and since you generated the event, you "know" what the correct label is. Of course, it's a lot of work to validate your MC, but the payoff is that then you can use Machine Learning where you never could before.

Mar 12, 20158 min

Random Kanye

Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of steps, numbers, words, etc. where each next step/number/word depends only on the previous one, which makes it fast and efficient to computationally generate. Usually Markov Chains are used for serious academic uses, but this ain't one of them: here they're used to randomly generate rap lyrics based on Kanye West lyrics.

Mar 4, 20158 min