Linear Digressions
310 episodes — Page 6 of 7
Text Analysis on the State Of The Union
First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk about a really cool analysis of State of the Union text, which analyzes the topics and word choices of every President from Washington to Obama. Relevant link: https://civisanalytics.com/blog/data-science/2016/01/15/data-science-on-state-of-the-union-addresses/
Paradigms in Artificial Intelligence
Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify all the approaches to AI, perhaps as a first step toward unifying these approaches and move closer to strong AI. In this episode, we'll touch on some of the most provocative work in many different subfields of artificial intelligence, and their strengths and weaknesses. Relevant links: https://www.technologyreview.com/s/544606/can-this-man-make-aimore-human/ https://www.youtube.com/watch?v=B8J4uefCQMc http://venturebeat.com/2013/11/29/sentient-code-an-inside-look-at-stephen-wolframs-utterly-new-insanely-ambitious-computational-paradigm/ http://www.slate.com/articles/technology/bitwise/2014/03/stephen_wolfram_s_new_programming_language_can_he_make_the_world_computable.html
Survival Analysis
Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social science to understand how the characteristics of a war inform how long the war goes on. This episode talks about the special challenges associated with survival analysis, and the tools that (data) scientists use to answer all kinds of duration-related questions.
Gravitational Waves
All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitational waves, how are they detected, and what does this announcement mean for future studies of the universe. Relevant links: http://www.nytimes.com/2016/02/12/science/ligo-gravitational-waves-black-holes-einstein.html https://www.ligo.caltech.edu/news/ligo20160211
The Turing Test
Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human conversational partner that it, the computer, was in fact a human. 60 years later, the Turing Test endures as a gold standard of artificial intelligence. It hasn't been beaten, either--yet. Relevant links: https://en.wikipedia.org/wiki/Turing_test http://commonsensereasoning.org/winograd.html http://consumerist.com/2015/09/29/its-not-just-you-robots-are-also-bad-at-assembling-ikea-furniture/
Item Response Theory: how smart ARE you?
Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how smart the test-taker is, in order to get the results you want. How to solve this problem, one equation with two unknowns? Item response theory--the data science behind such tests and the GRE. Relevant links: https://en.wikipedia.org/wiki/Item_response_theory
Go!
As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of game-playing computer programs, and what makes Google's AlphaGo so special. Relevant link: http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html
Great Social Networks in History
The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analysis of the network of letter-writing during the Enlightenment is helping humanities scholars track the dispersion of great ideas across the world during that time, from Voltaire to Benjamin Franklin and everyone in between. Relevant links: https://www2.bc.edu/~jonescq/mb851/Mar12/PadgettAnsell_AJS_1993.pdf http://republicofletters.stanford.edu/index.html
How Much to Pay a Spy (and a lil' more auctions)
A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Relevant links: https://tuecontheoryofnetworks.wordpress.com/2013/02/25/the-origin-of-the-dutch-auction/ http://www.nowozin.net/sebastian/blog/the-fair-price-to-pay-a-spy-an-introduction-to-the-value-of-information.html
Sold! Auctions (Part 2)
The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you know it must be pretty cool. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf
Going Once, Going Twice: Auctions (Part 1)
The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the theory of auctions, and what makes a "good" auction. Relevant links: https://en.wikipedia.org/wiki/English_auction http://people.ischool.berkeley.edu/~hal/Papers/2006/position.pdf http://www.benedelman.org/publications/gsp-060801.pdf
Chernoff Faces and Minard Maps
A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links: http://lya.fciencias.unam.mx/rfuentes/faces-chernoff.pdf https://en.wikipedia.org/wiki/Charles_Joseph_Minard
t-SNE: Reduce Your Dimensions, Keep Your Clusters
Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionality reduction when you have clustering in mind. Relevant links: https://www.youtube.com/watch?v=RJVL80Gg3lA
The [Expletive Deleted] Problem
The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links: https://en.wikipedia.org/wiki/Scunthorpe_problem https://www.washingtonpost.com/news/worldviews/wp/2016/01/05/where-is-russia-actually-mordor-in-the-world-of-google-translate/
Unlabeled Supervised Learning--whaaa?
In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links: http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf
Hacking Neural Nets
Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links: http://arxiv.org/pdf/1412.1897v4.pdf
Zipf's Law
Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena that we all interact with every day. Relevant links: http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/ http://arxiv.org/pdf/cond-mat/0412004.pdf https://terrytao.wordpress.com/2009/07/03/benfords-law-zipfs-law-and-the-pareto-distribution/
Indie Announcement
We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show: https://twitter.com/lindigressions https://twitter.com/benjaffe https://twitter.com/multiarmbandit https://soundcloud.com/linear-digressions http://lineardigressions.com/
Portrait Beauty
It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.
The Cocktail Party Problem
Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
A Criminally Short Introduction to Semi Supervised Learning
Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies, it might also be the closest to how humans usually learn--we go through the world, getting (noisy) feedback on the choices we make and learn from the outcomes of our actions.
Thresholdout: Down with Overfitting
Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. But an algorithm from the field of privacy research shows promise for keeping your test data safe from accidental overfitting
The State of Data Science
How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their analysis with the world. In this very special interview episode, we welcome Tristan Handy, VP of Marketing at RJMetrics, who will talk about "The State of Data Science Report."
Data Science for Making the World a Better Place
There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest new algorithm or billion-dollar company idea--there's a whole world of social data science that just wants to make the world a better place to live in.
Kalman Runners
The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. By the way, we neglected to mention in the episode: Katie's marathon time was 3:54:27!
Neural Net Inception
When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when neural nets are put through the same process? Train a neural net to recognize pictures, and then send through an image of white noise, and it will start to see some weird (but cool!) stuff.
Benford's Law
Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population of a country, the price of a stock... not all first digits are created equal.
Guinness
Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.
PFun with P Values
Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh interesting". Also, there's a lot of physics in this episode, nerds.
Watson
This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?
Bayesian Psychics
Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.
Troll Detection
Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learning to automatically detect trolls, and minimize the impact when they try to derail online conversations.
Yiddish Translation
Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that language to English. That's the problem that confronted researchers when they set out to automatically translate between Yiddish and English; the tricks they used help us understand a lot about machine translation.
Modeling Particles in Atomic Bombs
In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of the Metropolis-Hastings algorithm... mentioning Solitaire along the way.
Random Number Generation
Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on the security of systems, and the accuracy of models and research. In this episode, Katie and Ben talk about randomness, its place in machine learning and computation in general, along with some random digressions of their own.
Electoral Insights (Part 2)
Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people can convince them to change their minds on topics like abortion and gay marriage, has been exposed as the likely product of a fraudulently produced dataset. We’ll talk about a cool data science tool called the Kolmogorov-Smirnov test, which a pair of graduate students used to reverse-engineer the likely way that the fraudulent data was generated. But a bigger question still remains—what does this whole episode tell us about fraud and oversight in science?
Electoral Insights (Part 1)
The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involves studying voters, who are people, and people are tricky to study—every one of them is different, and the same treatment can have different effects on different voters. But with randomized controlled trials, small variations from person to person can even out when you look at a larger group. With the advent of randomized experiments in elections a few decades ago, a whole new door was opened for studying the most effective ways to campaign.
Falsifying Data
In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? We’ll explore that in this episode.
Reporter Bot
There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same thing, but one is a good input for a machine learning algorithm and the other is a good story to read over your morning coffee. Data science and machine learning are starting to bridge this gap, taking the raw data on things like baseball games, financial scenarios, etc. and automatically writing human-readable stories that are increasingly indistinguishable from what a human would write. In this episode, we’ll talk about some examples of auto-generated content—you’ll be amazed at how sophisticated some of these reporter-bots can be. By the way, this summary was written by a human. (Or was it?)
Careers in Data Science
Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers. In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, education, location, etc. along with their salaries, and then talk about how this data was fed into a linear regression so that you (yes, you!) can use the patterns in the data to know what kind of salary any particular kind of data scientist might expect.
That's "Dr Katie" to You
Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.
Neural Nets (Part 2)
In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English sentence, and when you put them together you get an automated captioning tool. Two heads are better than one indeed...
Neural Nets (Part 1)
There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms. This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks of the neural net (called neurons, naturally) and the networks that are built out of them. We’ll also explore the results that neural nets get when used to do object recognition in photographs.
Inferring Authorship (Part 2)
Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith. Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) called Bitcoin; no one knows who he, she or they are, but we have plenty of writing samples in the form of whitepapers and Bitcoin forum posts. We’ll discuss some attempts to link Satoshi Nakamoto with a cryptocurrency expert and computer scientist named Nick Szabo; the links are tantalizing, but not a smoking gun. “Who is Satoshi” remains an example of attempted author identification where the threads are tangled, the conclusions inconclusive and the stakes high.
Inferring Authorship (Part 1)
This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk. By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author of a given piece of text. We’ll use a seminal paper from the 1960’s as our example here, where the Naive Bayes algorithm was used to determine whether Alexander Hamilton or James Madison was the more likely author of a number of anonymous Federalist Papers.
Statistical Mistakes and the Challenger Disaster
After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched. In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associated with the cold temperatures. In this episode, we'll unpack the mistakes they made. We'll talk about how they excluded data points that they thought were irrelevant but which actually were critical to recognizing a fatal pattern.
Genetics and Um Detection (HMM Part 2)
In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear about how HMMs are used in genetics research.
Introducing Hidden Markov Models (HMM Part 1)
Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean? In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Detector."
Monte Carlo For Physicists
This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue! Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the laws of physics as we understand them) and since you generated the event, you "know" what the correct label is. Of course, it's a lot of work to validate your MC, but the payoff is that then you can use Machine Learning where you never could before.
Random Kanye
Ever feel like you could randomly assemble words from a certain vocabulary and make semi-coherent Kanye West lyrics? Or technical documentation, imitations of local newscasters, your politically outspoken uncle, etc.? Wonder no more, there's a way to do this exact type of thing: it's called a Markov Chain, and probably the most powerful way to generate made-up data that you can then use for fun and profit. The idea behind a Markov Chain is that you probabilistically generate a sequence of steps, numbers, words, etc. where each next step/number/word depends only on the previous one, which makes it fast and efficient to computationally generate. Usually Markov Chains are used for serious academic uses, but this ain't one of them: here they're used to randomly generate rap lyrics based on Kanye West lyrics.