Data Shapley

We talk often about which features in a dataset a…

August 19, 201916m 55s

Audio is streamed directly from the publisher (feeds.soundcloud.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

We talk often about which features in a dataset are most important, but recently a new paper has started making the rounds that turns the idea of importance on its head: Data Shapley is an algorithm for thinking about which examples in a dataset are most important. It makes a lot of intuitive sense: data that’s just repeating examples that you’ve already seen, or that’s noisy or an extreme outlier, might not be that valuable for using to train a machine learning model. But some data is very valuable, it’s disproportionately useful for the algorithm figuring out what the most important trends are, and Data Shapley is explicitly designed to help machine learning researchers spend their time understanding which data points are most valuable and why. Relevant links: http://proceedings.mlr.press/v97/ghorbani19c/ghorbani19c.pdf https://blog.acolyer.org/2019/07/15/data-shapley/

Topics

datasciencemachinelearninglineardigressions

← All episodes of Linear Digressions