PLAY PODCASTS
Linear Digressions

Linear Digressions

310 episodes — Page 2 of 7

Communicating data science, from academia to industry

For something as multifaceted and ill-defined as data science, communication and sharing best practices across the field can be extremely valuable but also extremely, well, multifaceted and ill-defined. That doesn’t bother our guest today, Prof. Xiao-Li Meng of the Harvard statistics department, who is leading an effort to start an open-access Data Science Review journal in the model of the Harvard Business Review or Law Review. This episode features Xiao-Li talking about the need he sees for a central gathering place for data scientists in academia, industry, and government to come together to learn from (and teach!) each other. Relevant links: https://hdsr.mitpress.mit.edu/

Dec 30, 201926 min

Optimizing for the short-term vs. the long-term

When data scientists run experiments, like A/B tests, it’s really easy to plan on a period of a few days to a few weeks for collecting data. The thing is, the change that’s being evaluated might have effects that last a lot longer than a few days or a few weeks—having a big sale might increase sales this week, but doing that repeatedly will teach customers to wait until there’s a sale and never buy anything at full price, which could ultimately drive down revenue in the long term. Increasing the volume of ads on a website might lead people to click on more ads in the short term, but in the long term they’ll be more likely to visually block the ads out and learn to ignore them. But these long-term effects aren’t apparent from the short-term experiment, so this week we’re talking about a paper from Google research that confronts the short-term vs. long-term tradeoff, and how to measure long-term effects from short-term experiments. Relevant links: https://research.google/pubs/pub43887/

Dec 23, 201919 min

Interview with Prof. Andrew Lo, on using data science to inform complex business decisions

This episode features Prof. Andrew Lo, the author of a paper that we discussed recently on Linear Digressions, in which Prof. Lo uses data to predict whether a medicine in the development pipeline will eventually go on to win FDA approval. This episode gets into the story behind that paper: how the approval prospects of different drugs inform the investment decisions of pharma companies, how to stitch together siloed and incomplete datasts to form a coherent picture, and how the academics building some of these models think about when and how their work can make it out of academia and into industry. Professor Lo is an expert in business (he teaches at the MIT Sloan School of Management) and work like his shows how data science can open up new ways of doing business. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043

Dec 16, 201927 min

Using machine learning to predict drug approvals

One of the hottest areas in data science and machine learning right now is healthcare: the size of the healthcare industry, the amount of data it generates, and the myriad improvements possible in the healthcare system lay the groundwork for compelling, innovative new data initiatives. One spot that drives much of the cost of medicine is the riskiness of developing new drugs: drug trials can cost hundreds of millions of dollars to run and, especially given that numerous medicines end up failing to get approval from the FDA, pharmaceutical companies want to have as much insight as possible about whether a drug is more or less likely to make it through clinical trials and on to approval. Professor Andrew Lo and collaborators at MIT Sloan School of Management is taking a look at this prediction task using machine learning, and has an article in the Harvard Data Science Review showing what they were able to find. It’s a fascinating example of how data science can be used to address business needs in creative but very targeted and effective ways. Relevant links: https://hdsr.mitpress.mit.edu/pub/ct67j043

Dec 8, 201925 min

Facial recognition, society, and the law

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already) our consensus as a society about what is acceptable in facial recognition and what isn’t. The threats to privacy, fairness, and freedom are real, and Microsoft has become one of the first large companies using this technology to speak out in specific support of its regulation through legislation. Their arguments are interesting, provocative, and even if you don’t agree with every point they make or harbor some skepticism, there’s a lot to think about in what they’re saying.

Dec 2, 201943 min

Lessons learned from doing data science, at scale, in industry

If you’ve taken a machine learning class, or read up on A/B tests, you likely have a decent grounding in the theoretical pillars of data science. But if you’re in a position to have actually built lots of models or run lots of experiments, there’s almost certainly a bunch of extra “street smarts” insights you’ve had that go beyond the “books smarts” of more academic studies. The data scientists at Booking.com, who run build models and experiments constantly, have written a paper that bridges the gap and talks about what non-obvious things they’ve learned from that practice. In this episode we read and digest that paper, talking through the gotchas that they don’t always teach in a classroom but that make data science tricky and interesting in the real world. Relevant links: https://www.kdd.org/kdd2019/accepted-papers/view/150-successful-machine-learning-models-6-lessons-learned-at-booking.com

Nov 25, 201928 min

Varsity A/B Testing

When you want to understand if doing something causes something else to happen, like if a change to a website causes and dip or rise in downstream conversions, the gold standard analysis method is to use randomized controlled trials. Once you’ve properly randomized the treatment and effect, the analysis methods are well-understood and there are great tools in R and python (and other languages) to find the effects. However, when you’re operating at scale, the logistics of running all those tests, and reaching correct conclusions reliably, becomes the main challenge—making sure the right metrics are being computed, you know when to stop an experiment, you minimize the chances of finding spurious results, and many other issues that are simple to track for one or two experiments but become real challenges for dozens or hundreds of experiments. Nonetheless, the reality is that there might be dozens or hundreds of experiments worth running. So in this episode, we’ll work through some of the most important issues for running experiments at scale, with strong support from a series of great blog posts from Airbnb about how they solve this very issue. For some blog post links relevant to this episode, visit lineardigressions.com

Nov 18, 201936 min

The Care and Feeding of Data Scientists: Growing Careers

In the third and final installment of a conversation with Michelangelo D’Agostino, VP of Data Science and Engineering at Shoprunner, about growing and mentoring data scientists on your team. Some of our topics of conversation include how to institute hack time as a way to learn new things, what career growth looks like in data science, and how to institutionalize professional growth as part of a career ladder. As with the other episodes in this series, the topics we cover today are also covered in the O’Reilly report linked below. Relevant links: https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

Nov 11, 201925 min

The Care and Feeding of Data Scientists: Recruiting and Hiring Data Scientists

This week’s episode is the second in a three-part interview series with Michelangelo D’Agostino, VP of Data Science at Shoprunner. This discussion centers on building a team, which means recruiting, interviewing and hiring data scientists. Since data science talent is in such high demand, and data scientists are understandably choosy about where they go to work, a good recruiting and hiring program can have a big impact on the size and quality of the team. Our chat covers much a couple of sections in our dual-authored O’Reilly report, “The Care and Feeding of Data Scientists,” which you can read at the link below. https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

Nov 4, 201920 min

The Care and Feeding of Data Scientists: Becoming a Data Science Manager

Data science management isn’t easy, and many data scientists are finding themselves learning on the job how to manage data science teams as they get promoted into more formal leadership roles. O’Reilly recently release a report, written by yours truly (Katie) and another experienced data science manager, Michelangelo D’Agostino, where we lay out the most important tasks of a data science manager and some thoughts on how to unpack those tasks and approach them in a way that makes a new manager successful. This episode is an interview episode, the first of three, where we discuss some of the common paths to data science management and what distinguishes (and unifies) different types of data scientists and data science teams. Relevant links: https://oreilly-ds-report.s3.amazonaws.com/Care_and_Feeding_of_Data_Scientists.pdf

Oct 28, 201924 min

Procella: YouTube's super-system for analytics data storage

If you’re trying to manage a project that serves up analytics data for a few very distinct uses, you’d be wise to consider having custom solutions for each use case that are optimized for the needs and constraints of that use cases. You also wouldn’t be YouTube, which found themselves with this problem (gigantic data needs and several very different use cases of what they needed to do with that data) and went a different way: they built one analytics data system to serve them all. Procella, the system they built, is the topic of our episode today: by deconstructing the system, we dig into the four motivating uses of this system, the complexity they had to introduce to service all four uses simultaneously, and the impressive engineering that has to go into building something that “just works.” Relevant links: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45a6cea2b9c101761ea1b51c961628093ec1d5da.pdf

Oct 21, 201929 min

Kalman Runners

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missile, you probably know all about these challenges already. IMPORTANT NON-DATA SCIENCE CHICAGO MARATHON RACE RESULT FROM KATIE: My finish time was 3:20:17! It was the closest I may ever come to having the perfect run. That’s a 34-minute personal record and a qualifying time for the Boston Marathon, so… guess I gotta go do that now.

Oct 13, 201915 min

What's *really* so hard about feature engineering?

Feature engineering is ubiquitous but gets surprisingly difficult surprisingly fast. What could be so complicated about just keeping track of what data you have, and how you made it? A lot, as it turns out—most data science platforms at this point include explicit features (in the product sense, not the data sense) just for keeping track of and sharing features (in the data sense, not the product sense). Just like a good library needs a catalogue, a city needs a map, and a home chef needs a cookbook to stay organized, modern data scientists need feature libraries, data dictionaries, and a general discipline around generating and caring for their datasets.

Oct 6, 201921 min

Data storage for analytics: stars and snowflakes

If you’re a data scientist or data engineer thinking about how to store data for analytics uses, one of the early choices you’ll have to make (or live with, if someone else made it) is how to lay out the data in your data warehouse. There are a couple common organizational schemes that you’ll likely encounter, and that we cover in this episode: first is the famous star schema, followed by the also-famous snowflake schema.

Sep 30, 201915 min

Data storage: transactions vs. analytics

Data scientists and software engineers both work with databases, but they use them for different purposes. So if you’re a data scientist thinking about the best way to store and access data for your analytics, you’ll likely come up with a very different set of requirements than a software engineer looking to power an application. Hence the split between analytics and transactional databases—certain technologies are designed for one or the other, but no single type of database is perfect for both use cases. In this episode we’ll talk about the differences between transactional and analytics databases, so no matter whether you’re an analytics person or more of a classical software engineer, you can understand the needs of your colleagues on the other side.

Sep 23, 201916 min

GROVER: an algorithm for making, and detecting, fake news

There are a few things that seem to be very popular in discussions of machine learning algorithms these days. First is the role that algorithms play now, or might play in the future, when it comes to manipulating public opinion, for example with fake news. Second is the impressive success of generative adversarial networks, and similar algorithms. Third is making state-of-the-art natural language processing algorithms and naming them after muppets. We get all three this week: GROVER is an algorithm for generating, and detecting, fake news. It’s quite successful at both tasks, which raises an interesting question: is it safer to embargo the model (like GPT-2, the algorithm that was “too dangerous to release”), or release it as the best detector and antidote for its own fake news? Relevant links: https://grover.allenai.org/ https://arxiv.org/abs/1905.12616

Sep 16, 201918 min

Data science teams as innovation initiatives

When a big, established company is thinking about their data science strategy, chances are good that whatever they come up with, it’ll be somewhat at odds with the company’s current structure and processes. Which makes sense, right? If you’re a many-decades-old company trying to defend a successful and long-lived legacy and market share, you won’t have the advantage that many upstart competitors have of being able to bake data analytics and science into the core structure of the organization. Instead, you have to retrofit. If you’re the data scientist working in this environment, tasked with being on the front lines of a data transformation, you may be grappling with some real institutional challenges in this setup, and this episode is for you. We’ll unpack the reason data innovation is necessarily challenging, the different ways to innovate and some of their tradeoffs, and some of the hardest but most critical phases in the innovation process. Relevant links: https://www.amazon.com/Innovators-Dilemma-Revolutionary-Change-Business/dp/0062060244 https://www.amazon.com/Other-Side-Innovation-Execution-Challenge/dp/1422166961

Sep 9, 201915 min

Can Fancy Running Shoes Cause You To Run Faster?

This is a re-release of an episode that originally aired on July 29, 2018. The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one are really tough to verify, because even if the data suggests that people wearing the shoe are faster that might be because of correlation, not causation, so I loved reading this article that went through an analysis of thousands of runners' data in 4 different ways. Each way has a great explanation with pros and cons (as well as results, of course), so be sure to read the article after you check out this episode! Relevant links: https://www.nytimes.com/interactive/2018/07/18/upshot/nike-vaporfly-shoe-strava.html

Sep 1, 201930 min

Organizational Models for Data Scientists

When data science is hard, sometimes it’s because the algorithms aren’t converging or the data is messy, and sometimes it’s because of organizational or business issues: the data scientists aren’t positioned correctly to bring value to their organization. Maybe they don’t know what problems to work on, or they build solutions to those problems but nobody uses what they build. A lot of this can be traced back to the way the team is organized, and (relatedly) how it interacts with the rest of the organization, which is what we tackle in this issue. There are lots of options about how to organize your data science team, each of which has strengths and weaknesses, and Pardis Noorzad wrote a great blog post recently that got us talking. Relevant links: https://medium.com/swlh/models-for-integrating-data-science-teams-within-organizations-7c5afa032ebd

Aug 25, 201923 min

Data Shapley

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorithm for thinking about which examples in a dataset are most important. It makes a lot of intuitive sense: data that’s just repeating examples that you’ve already seen, or that’s noisy or an extreme outlier, might not be that valuable for using to train a machine learning model. But some data is very valuable, it’s disproportionately useful for the algorithm figuring out what the most important trends are, and Data Shapley is explicitly designed to help machine learning researchers spend their time understanding which data points are most valuable and why. Relevant links: http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf https://blog.acolyer.org/2019/07/15/data-shapley/

Aug 19, 201916 min

A Technical Deep Dive on Stanley, the First Self-Driving Car

This is a re-release of an episode that first ran on April 9, 2017. In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert. Lidar? You betcha! Drive-by-wire? Of course! Probabilistic terrain reconstruction? Absolutely! All this and more this week on Linear Digressions.

Aug 12, 201941 min

An Introduction to Stanley, the First Self-Driving Car

In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose car could do it the fastest); the winning car, Stanley, now sits in the Smithsonian Museum in Washington DC as arguably the world's first real self-driving car. In this episode (part one of a two-parter), we'll revisit the DARPA grand challenge from 2005 and the rules and constraints of what it took for Stanley to win the competition. Next week, we'll do a deep dive into Stanley's control systems and overall operation and what the key systems were that allowed Stanley to win the race. Relevant links: http://isl.ecst.csuchico.edu/DOCS/darpa2005/DARPA%202005%20Stanley.pdf

Aug 5, 201914 min

Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking

The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found their skills in high demand in the business world, where knowing more about a market, or industry, or type of user becomes a competitive advantage. But the scientific method is built upon certain processes, and is disciplined about following them, in a way that can get swept aside in the rush to get something out the door—not the least of which is the fact that in science, sometimes a result simply doesn’t materialize, or sometimes a relationship simply isn’t there. This makes data science different than operations, or software engineering, or product design in an important way: a data scientist needs to be comfortable with finding nothing in the data for certain types of searches, and needs to be even more comfortable telling his or her boss, or boss’s boss, that an attempt to build a model or find a causal link has turned up nothing. It’s a result that often disappointing and tough to communicate, but it’s crucial to the overall credibility of the field.

Jul 29, 201924 min

Interleaving

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonical answer for testing how users respond to software changes, but it gets tricky really fast to think about what an A/B test means in the context of an algorithm that returns a ranked list. That’s why we’re talking about interleaving this week—it’s a simple modification to A/B testing that makes it much easier to race two algorithms against each other and find the winner, and it allows you to do it with much less data than a traditional A/B test. Relevant links: https://medium.com/netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55 https://www.microsoft.com/en-us/research/publication/predicting-search-satisfaction-metrics-with-interleaved-comparisons/ https://www.cs.cornell.edu/people/tj/publications/joachims_02b.pdf

Jul 22, 201916 min

Federated Learning

This is a re-release of an episode first released in May 2017. As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of related algorithms from Google that are designed to help out in exactly this scenario. If you've used keyboard shortcuts or autocomplete on an Android phone, chances are you've encountered Federated Learning even if you didn't know it.

Jul 14, 201915 min

Endogenous Variables and Measuring Protest Effectiveness

This is a re-release of an episode first released in February 2017. Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causality, but there's no reason to believe that big protests are actually randomly distributed. In other words, protest size is endogenous to legislative response, and understanding cause and effect is very challenging. So, what to do? Well, at least in the case of studying Tea Party protest effectiveness, researchers have used rainfall, of all things, to understand the impact of a big protest. In other words, rainfall is the instrumental variable in this analysis that cracks the scientific case open. What does rainfall have to do with protests? Do protests actually matter? What do we mean when we talk about endogenous and instrumental variables? We wouldn't be very good podcasters if we answered all those questions here--you gotta listen to this episode to find out.

Jul 7, 201917 min

Deepfakes

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be a challenge to distinguish a fabricated video from a real one, which is an extraordinary challenge in an era when the truth of what you see on the news or especially on social media is worthy of skepticism. And just in case that wasn’t unsettling enough, the algorithms just keep getting better and more accessible—which means it just keeps getting easier to make completely fake, but real-looking, videos of celebrities, politicians, and perhaps even just regular people. Relevant links: http://lineardigressions.com/episodes/2016/5/28/neural-nets-play-cops-and-robbers-aka-generative-adversarial-networks http://fortune.com/2019/06/12/deepfake-mark-zuckerberg/ https://www.youtube.com/watch?v=EfREntgxmDs https://spectrum.ieee.org/tech-talk/robotics/artificial-intelligence/will-deepfakes-detection-be-ready-for-2020 https://giorgiop.github.io/posts/2018/03/17/AI-and-digital-forgery/

Jul 1, 201915 min

Revisiting Biased Word Embeddings

The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias around professions (as well as other forms of social bias getting reproduced in the algorithm’s embeddings). We covered the topic again a while later, covering methods for de-biasing embeddings to counteract this effect. And now we’re back, with a second pass on the original Word2Vec analogy task, but where the researchers deconstructed the “rules” of the analogies themselves and came to an interesting discovery: the bias seems to be, at least in part, an artifact of the analogy construction method. Intrigued? So were we… Relevant link: https://arxiv.org/abs/1905.09866

Jun 24, 201918 min

Attention in Neural Nets

There’s been a lot of interest lately in the attention mechanism in neural nets—it’s got a colloquial name (who’s not familiar with the idea of “attention”?) but it’s more like a technical trick that’s been pivotal to some recent advances in computer vision and especially word embeddings. It’s an interesting example of trying out human-cognitive-ish ideas (like focusing consideration more on some inputs than others) in neural nets, and one of the more high-profile recent successes in playing around with neural net architectures for fun and profit.

Jun 17, 201926 min

Interview with Joel Grus

This week’s episode is a special one, as we’re welcoming a guest: Joel Grus is a data scientist with a strong software engineering streak, and he does an impressive amount of speaking, writing, and podcasting as well. Whether you’re a new data scientist just getting started, or a seasoned hand looking to improve your skill set, there’s something for you in Joel’s repertoire.

Jun 10, 201939 min

Re - Release: Factorization Machines

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.

Jun 3, 201920 min

Re-release: Auto-generating websites with deep learning

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website that looks like the wireframes. If you're a programmer who thinks your job is challenging enough that you're automation-proof, guess again...

May 27, 201919 min

Advice to those trying to get a first job in data science

We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes online, doing a bootcamp, reading books, something else? How can they stand out in a crowd? There’s no single answer, because so much depends on the person asking in the first place, but that doesn’t stop us from giving some perspective. So in this episode we’re sharing that advice out more widely, so hopefully more of you can benefit from it.

May 19, 201917 min

Re - Release: Machine Learning Technical Debt

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that crop up in the code when you're trying to go fast. You take shortcuts, hard-code variable values, skimp on the documentation, and generally write not-that-great code in order to get something done quickly, and then end up paying for it later on. This is technical debt, and it's particularly easy to accrue with machine learning workflows. That's the premise of this episode's paper. https://ai.google/research/pubs/pub43146

May 12, 201922 min

Estimating Software Projects, and Why It's Hard

If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics is most likely actively working against your best efforts to give your boss an accurate delivery date. This week, we’ll talk through a great blog post that digs into the underlying probability and statistics assumptions that are probably driving your estimates, versus the ones that maybe should be driving them. Relevant links: https://erikbern.com/2019/04/15/why-software-projects-take-longer-than-you-think-a-statistical-model.html

May 5, 201919 min

The Black Hole Algorithm

53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the galaxy (the motion is characteristic of there being a huge gravitational mass present), scientists have believed for years that there is a supermassive black hole at the center of that galaxy. However, black holes are really hard to see directly because they aren’t a light source like a star or a supernova. They suck up all the light around them, and moreover, even though they’re really massive, they’re small in volume. That’s why it was so amazing a few weeks ago when scientists announced that they had reconstructed an image of a black hole for the first time ever. The image was the result of many measurements combined together with a clever reconstruction strategy, and giving scientists, engineers, and all the rest of us something to marvel at.

Apr 29, 201920 min

Structure in AI

As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of the problem. There’s usually some amount of knowledge we already have of each domain, an understanding of how it usually works, but it’s not clear how (or even if) to lend this knowledge to an AI algorithm to help it get started. Sure, it may get the algorithm caught up to where we already were on solving that problem, but will it eventually become a limitation where the structure and assumptions prevent the algorithm from surpassing human performance? It’s a problem without a universal answer. This week, we’ll talk about the question in general, and especially recommend a recent discussion between Christopher Manning and Yann LeCun, two AI researchers who hold different opinions on whether structure is a necessary good or a necessary evil. Relevant link: http://www.abigailsee.com/2018/02/21/deep-learning-structure-and-innate-priors.html

Apr 21, 201919 min

The Great Data Science Specialist vs. Generalist Debate

It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing that has been changing, though, as the field becomes a bit older and more mature, is our ideas about what data scientists should focus on to stay relevant. Should they specialize in a particular area (if so, which one)? Should they instead stay general and work across many different areas? In either case, what are the costs and benefits? This question has prompted a number of think pieces lately, which are sometimes advocating for specializing, and sometimes pointing out the benefits of generalists. In short, if you’re trying to figure out what to actually do, you might be hearing some conflicting opinions. In this episode, we break apart the arguments both ways, and maybe (hopefully?) reach a little resolution about where to go from here.

Apr 15, 201914 min

Google X, and Taking Risks the Smart Way

If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to think about, and the payoff would be huge if they succeed, but the high-risk piece means that you have to be smart about what you choose to work on and be wary of investing all your resources in projects that fail entirely or starve other, higher-value projects. This episode focuses mainly on Google X, the so-called “Moonshot Factory” at Google that is a modern-day heir to the research legacies of Bell Labs and Xerox PARC. It’s an organization entirely focused on rapidly imagining, prototyping, invalidating, and, occasionally, successfully creating game-changing technologies. The process and philosophy behind Google X are useful for anyone thinking about how to stay aggressive and “responsibly irresponsible,” which includes a lot of you data science folks out there.

Apr 8, 201919 min

Statistical Significance in Hypothesis Testing

When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a world where experimenting is generally not free, and you want to move quickly once you know the answer, there is such a thing as collecting too much data. Statisticians have been solving this problem for decades, and their best practices are encompassed in the ideas of power, statistical significance, and especially how to generally think about hypothesis testing. This week, we’re going over these important concepts, so your next AB test is just as data-intensive as it needs to be.

Apr 1, 201922 min

The Language Model Too Dangerous to Release

OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too good. It can answer reading comprehension questions, summarize text, translate from one language to another, and generate realistic fake text. This last case, in particular, raised concerns inside OpenAI that the raw model could be dangerous if bad actors had access to it, so researchers will spend the next six months studying the model (and reading comments from you, if you have strong opinions here) to decide what to do next. Regardless of where this lands from a policy perspective, it’s an impressive model and the snippets of released auto-generated text are quite impressive. We’re covering the methodology, the results, and a bit of the policy implications in our episode this week.

Mar 25, 201921 min

The cathedral and the bazaar

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit architect role. Which one do you think would make the better product? “The Cathedral and the Bazaar” is an essay exploring this question for open source software, and making an argument for the bottom-up approach. It’s not entirely intuitive that projects like Linux or scikit-learn, with many contributors and an open-door policy for modifying the code, would be able to resist the chaos of many cooks in the kitchen. So what makes it work in some cases? And sometimes not work in others? That’s the topic of discussion this week. Relevant links: http://www.catb.org/~esr/writings/cathedral-bazaar/cathedral-bazaar/index.html

Mar 17, 201932 min

AlphaStar

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the AI agent is AlphaStar, from the same team that built the Go-playing AlphaGo AI last year. StarCraft presents some interesting challenges though: the gameplay is continuous, there are many different kinds of actions a player must take, and of course there’s the usual complexities of playing strategy games and contending with human opponents. AlphaStar overcame all of these challenges, and more, to notch another win for the computers.

Mar 11, 201922 min

Are machine learning engineers the new data scientists?

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or machine learning methodology. Productionizing and maintaining data science code has more in common with software engineering than traditional science, and to reflect that, there’s a new-ish role, and corresponding job title, that you should know about. It’s called machine learning engineer, and it’s what a lot of data scientists are becoming. Relevant links: https://medium.com/@tomaszdudek/but-what-is-this-machine-learning-engineer-actually-doing-18464d5c699 https://www.forbes.com/sites/forbestechcouncil/2019/02/04/why-there-will-be-no-data-science-job-titles-by-2029/#64e3906c3a8f

Mar 4, 201920 min

Interview with Alex Radovic, particle physicist turned machine learning researcher

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inventing technologies like the world wide web and cloud computing grids to help them distribute and analyze the datasets required to make particle physics discoveries. Somewhat counterintuitively, though, deep learning has only really debuted in particle physics in the last few years, although it’s making up for lost time with many exciting new advances. This episode of Linear Digressions is a little different from most, as we’ll be interviewing a guest, one of my (Katie’s) friends from particle physics, Alex Radovic. Alex and his colleagues have been at the forefront of machine learning in physics over the last few years, and his perspective on the strengths and shortcomings of those two fields together is a fascinating one.

Feb 25, 201935 min

K Nearest Neighbors

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a prediction that’s the average of their answers. On the other hand, what does “nearest” mean when you’re dealing with complex data? How do you decide whether a man and a woman of the same age are “nearer” to each other than two women several years apart? What if you convert all your monetary columns from dollars to cents, your distances from miles to nanometers, your weights from pounds to kilograms? Can your definition of “nearest” hold up under these types of transformations? We’re discussing all this, and more, in this week’s episode.

Feb 17, 201916 min

Not every deep learning paper is great. Is that a problem?

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly good? What even makes a paper good in the first place? It’s an interesting thing to think about, and debate, since there’s no clean-cut answer and there are worthwhile arguments both ways. Wherever you find yourself coming down in the debate, though, you’ll appreciate the good papers that much more. Relevant links: https://blog.piekniewski.info/2018/07/14/autopsy-dl-paper/ https://www.reddit.com/r/MachineLearning/comments/90n40l/dautopsy_of_a_deep_learning_paper_quite_brutal/ https://www.reddit.com/r/MachineLearning/comments/agiatj/d_google_ai_refuses_to_share_dataset_fields_for_a/

Feb 11, 201917 min

The Assumptions of Ordinary Least Squares

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportunity to build up your understanding from the foundations, though, listen up: there are a number of assumptions underlying OLS that you should know and love. They’re interesting, force you to think about data and statistics, and help you know when you’re out of “good” OLS territory and into places where you could run into trouble.

Feb 3, 201925 min

Quantile Regression

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or the 10th percentile? Or the 90th percentile. You need quantile regression, which is similar to ordinary least squares regression in some ways but with some really interesting twists that make it unique. This week, we’ll go over the concept of quantile regression, and also a bit about how it works and when you might use it. Relevant links: https://www.aeaweb.org/articles?id=10.1257/jep.15.4.143 https://eng.uber.com/analyzing-experiment-outcomes/

Jan 28, 201921 min

Heterogeneous Treatment Effects

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in terms of making a certain outcome more or less likely to happen. But there’s more to life than averages: sometimes the relationship works one way in some cases, and another way in other cases, such that the average isn’t giving you the whole story. In that case, you want to start thinking about heterogeneous treatment effects, and this is the podcast episode for you. Relevant links: https://eng.uber.com/analyzing-experiment-outcomes/ https://multithreaded.stitchfix.com/blog/2018/11/08/bandits/ https://www.locallyoptimistic.com/post/against-ab-tests/

Jan 20, 201917 min