PLAY PODCASTS
Into the World of Genomics & Entrepreneurship — Adina Mangubat
Episode 7

Into the World of Genomics & Entrepreneurship — Adina Mangubat

Adina Mangubat, the CEO of Spiral Genetics, takes us into the world of genomics and how she works around the entrepreneurship side of things and more.

Deep Future · Deep Future

March 17, 20212h 6m

Audio is streamed directly from the publisher (deepfuture.tech) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Today we get to hang out with Adina Mangubat, a friend of mine that I know from a salsa dancing, and also hanging out with computer hackers. She’s probably the youngest founder that I know. And she’s been running her company for almost a decade since starting it in college at age 22 called Spiral Genetics.

It could be considered probably the most advanced bioinformatics technology for population genomics. And what that means is DNA sequencing, massive populations, hundreds of thousands of people, if you can, and then correlating that data to see what can be learned about it. And it’s a huge frontier there’s so much that can be learned from doing this kind of work.

And Adina is really at the forefront of that. And so it’s a really fascinating conversation where she breaks down all that stuff: What it means, what DNA sequencing is about, the potential for bioinformatics, the potential for population genomics etc. So, this is the perfect episode for you. If you don’t know anything about it, because I’m asking Adina, a lots of dumb questions, you’re going to love it.

She’s also a super entrepreneurial and hustler which is very inspiring. Adina has built this company. She actually sold it to a large biotech company and then spun it back out on. And so she’s been through a lot as an entrepreneur and we’d talk about that a bunch.

And the other thing about Adina that’s super interesting to me is that she’s really committed to figuring out how you can create these transparent, high integrity, mission driven cultures in startups and small companies, and that’s pioneering work. It is really important and difficult work.

It’s unproven. We don’t know if it’s even going to work, but it’s so necessary to figure out how we make better companies. Some people have to be the ones to try that. And so we talk about that and I think there’s a lot to learn at the end of this episode. Adina and I talk a little bit about adoption and parenting and I am kind of deep into that, having adopted a child and raised her to age 14, so far seemingly successfully. Adina is kind of early in that cycle. And so if you’re interested in that sort of thing at the end, there’s a conversation about that. I hope you guys liked this episode and get a lot out of it.

Pablos: You seem to be possibly the youngest Founder/CEO that I know. I know other people who are young. You might not be the youngest now, but when you started, you were the youngest.

Adina: 22? I know some people that are younger.

It’s not common. I know people who started younger but they did not succeed at keeping it going for very long. You started at 22 and you’re still at it, which means you’re tenacious.

Maybe also stupid or crazy or all three.

I’m curious about that. First, I want to know how that happened for you. I don’t know if the track you’re on now is what you had planned. When you were a child, were your parents trying to convince you to be an entrepreneur?

No, definitely not. My family had planned on helping me out with grad school if I wanted to do that. I was like, “I don’t want to do that. I want to start this company instead.” My dad wrote me a tiny check and he slid it across the table. He’s like, “This is going to be the hardest and most educational year of your life.” I was like, “Really?” He was like, “Yes.” My mom was supportive but worried. She would call me every couple of weeks and be like, “How’s it going? Are you thinking about applying for a normal job?” After a while, she figured out this was clearly not a phase and that I was going to be okay.

I gathered later that the reason why she kept on asking is because she had started a company when she was young. She ran up a CPR business, a training business, and trained a bunch of the Secret Service. Back then, she had connections and you could roll up to the White House at midnight and be like, “Can I get a tour?” They would be like, “Yes.” She’d had to go through the entire process of starting a company back then. She knew that it was going to be hard.

Were they right?

Yes, it’s hard.

Do you think that you believe them or you didn’t believe them?

I didn’t. The blessing and the curse of being a newbie is that you’re so naive that you don’t know what you don’t know. If I knew everything that I know now if I were given the option of starting this particular company, I don’t know. I wonder about if I move on to another company someday…

If you’d have the guts to take it on?

It’s a big one. I’m clear that I have a reasonable business acumen. There are a lot of other companies that I could start that would be easy by comparison like stuff that is not this complicated. I have fantasies about easy companies. The reality is that I get bored after running a festival earrings company or something like that. That’s an idea that has been sitting around for a long time.

One of the things that’s missing in our vernacular is a way of describing the difference between entrepreneurship and tech entrepreneurship because you could start a Taco Wagon or a festival earrings company or something that will be entrepreneurship but it’s something that’s been done before.

There might even be more fine-grain definitions because I think that we have something to describe that there are “lifestyle businesses,” which is the Taco Wagon or the festival earrings company or whatever where the intention is to build a company that is going to feed your lifestyle and be fun generally. There’s then tech entrepreneurship, and it’s tech entrepreneurship of stuff that is hard but not bonkers hard. If you want to make enterprise software or a Fitbit device but for your animals or your pets, things like that. Things where you’re basically you’re taking off the shelf technology and you’re retooling it to a particular vertical or something. You’re doing some innovation but it’s not hard and never ever been seen before innovation.

I call the hard stuff deep tech now. All that other stuff, I call it shallow tech. If it’s iPhone apps or enterprise software or modified Fitbit’s, shallow tech. I don’t know if that’s going to stick but that’s how I’m thinking about it.

It’s unlikely because a lot of people feel like, “My tech isn’t shallow.”

It’s revolutionary and it’s sprinkled with blockchain. You went to college and you finished college.

It’s nothing related to this. I got my degree in Psychology with a focus on Bio Psych.

Do you feel like you know about the brains of people?

I do. At the time people were like, “What on earth are you going to do with that?” It turns out that a lot of things in life have a lot to do with humans that have brains. Understanding how humans work and how they tick and all of that stuff is ridiculously useful.

That’s why I like computers. I just reboot them if things go wrong. They mystify me. You finished college for your parents. Was that here?

Yes, at UDub.

The plan was like, “I’ve got to find something to do and I don’t want a job.” How did you end up deciding to start a company?

I’d been involved in two companies while I was in college as an intern. One was a smart grid company that was looking at like, “How can you optimize the usage of energy that come from green sources like wind energy.” Essentially, it was how can we use big buildings as batteries? If you know that the wind energy is spiking at a particular moment, you can overheat or over-cool your building a little bit. If it’s a large enough building so you can use that.

Did it seem it would work?

It seemed it would work. They had a good run but didn’t end up making it in the end. It turns out that the tech for building control is ancient and terrible. When you’re trying to integrate with that, that’s more complicated and there were a variety of reasons. I was involved in that company. I did a bunch of marketing related stuff for them, initial business strategy sales.

How big was the company at the time?

Seven people, maybe.


The curse of being a newbie is that you're so naive that you don't know what you don't know.
Share on X


That’s cool though, as an internship, especially because you get to see pretty much every part of the company. If you go be an intern at Microsoft, you get to see one dime and you don’t get to see the whole operation.

You learn about management, what works, what doesn’t work and all of those pieces. I was involved in that company and before that, my first job was an internship with a company that did home automation. Turning your lights on and off with a control panel. This is before cell phones. They were in direct competition with Control4. They did pretty well up until the housing crash. Nobody was looking to outfit their homes with cool smart techs.

I had a lot of friends because lots of my friends are nerds, especially when one of them leaves the company or sells their company or ends up with some free time, they dive into home automation and they try and install everything and they tell you all about it. It’s all amazing. They’re spending full-time integrating their home automation stuff. A year later, I’ll ask them about it. They’ll be like, “I had to tear all that shit out.” You’re going to be a sysadmin for yourself for your home light switches.

At some point you’re like, “I’m going to open the blinds by myself like an adult and it will be fine.” Another very interesting experience, they were larger, probably 30-something people. I learned a lot about what worked and what didn’t work. Back then, I was doing video creation for them, teaching people how to use the product and stuff like that. I took it to see the whole operation and have direct contact with all of the “upper-level management.” I was involved in those.

After that, it seemed like, “It’s easy to start a company. I’ll do that.”

No. It was cockier than that. It was like, “These guys are doing it wrong.” I had a whole thing about how people were managed. Especially as a psych person, I was like, “Maslow’s hierarchy of needs. Some of these people aren’t fulfilled.” I feel I could do that better. It turns out that that’s hard and complicated and they were doing it for the best that they could. There are definitely best practices but that stuff’s hard. That was part of it. I took an entrepreneurship course at the UDub Bothell because UDub Seattle wouldn’t let me into the normal course because you need prerequisites like accounting and stuff like that. I wasn’t going to do accounting for taking one class.

Do you wish you had now?

No. Accounting’s not that complicated. If you need deep accounting, you can call somebody that loves that, which I don’t. Somebody else can do that. That’s cool. I need to know how much money do I need in order to get to X and make sure I don’t run out of it before then. I commuted 40 minutes each way 2 to 3 times a week to take this class. Bothell is a very interesting campus because it’s got a lot more diversity in terms of its student population. One of the people that was up there, her name is Becky Drees and she was a non-matriculated student, Molecular Biologist, PhD at Berkeley. She had been the industry for years and ran one of the labs at UDub.

On the first day, the way the entrepreneurship course works is you get up there and you either picked your skillset, which is what I did, or you pitch your idea because I didn’t have an idea. She pitched this genetic analysis company that was very similar to a 23andMe style company these days. Her ultimate pitch was if we are going to be able to impact disease, we have to understand the code of who we are. For me, that was compelling both from an intellectual perspective but also from a personal perspective. I’d had people in my life that had been significantly impacted by things that are entirely genetic in basis. Cancer is quite literally the disease of the genome. I’d had my grandfather pass when I was thirteen from lung cancer. I remember at the time when that was happening.

Mine too. I was eleven.

Where was he located? Was he in Seattle or no?

No. I grew up in Alaska but he was mostly in California. That doesn’t seem that much in the way, but it was still my grandfather and it was hard. It was the first person I lost.

What was striking for me at the time was like my dad is a doctor. He’s a surgeon but when you’re thirteen, you don’t differentiate between surgeon and oncologist. You’re like, “Doctors should be cool,” and it wasn’t. As I got older, I learned that in large part it was because they didn’t know what to do. Even now, to a certain extent, if you get diagnosed with cancer, there’s a super over simplification. They’re like, “You have this, we have X number of drugs that could be used to treat that. We’re going to try this first one. If that doesn’t work then we’re going to try the second one and we’re going to try and hope that we get the right one before it’s too late.” Becky’s basic pitch to me was if you could see what’s going on in there, then you could pick the right one the first time. That for me was far more interesting than any of the other companies that were being pitched. I didn’t want to do Fitbit for your dog or any of that stuff or a concierge service.

It sounds like I should go pitch my crazy ideas to the entrepreneurial class and I can pick up people like you to run with them.

People have a very interesting relationship or opinion about young students because I was 21, 22 at the time. I went back. I’ve maintained a good relationship with the professor that teaches it, Alan Leong. He now teaches at the UDub proper, the Seattle campus, but I’ve gone back and helped to judge classes and etc. Frankly, the younger they are, the better. I’ve even been in classes where there have been MBA students and freshmen and then they’re competing in a competition, a business plan competition together. I’ve got to tell you, the freshmen kick the MBA’s butts every time. The vast majority of the reason why, in my opinion, is that the older we get, the more we fall susceptible to thinking, “This is that way. This is possible. That’s not possible,” whereas the freshmen, they don’t know. They’re like, “Let’s try this thing.”

There are probably two escape hatches, either personality or naiveté. Have you ever hired anybody from one of these?

We’ve had bunches of them as interns. We’ve definitely hired some pretty young people sometimes.

I want to dive into the problem with genetics. What happened to the woman who had the idea that you heard pitch it? Did you end up working on it with her?

We were cofounders. She and I won a business plan competition and that’s how this whole thing got started. It’s 2009. If you recall back then, the economy was not so hot. Here I am graduating with a Psych degree. My options are stupid, boring, shitty job or go to grad school but I’d come to the conclusion that I didn’t want to get a PhD because it turns out that I’m impatient enough that research doesn’t appeal to me or it was like start this thing. It seems like a very similar risk profile at the time. Find a job, make a job. I was like, “How hard could it be? That was my naiveté speaking. We went for it and we found our third cofounder Jeremy through a Japanese tea ceremony, which is how you usually find cofounders. I worked for Jeremy for a long time.

Maybe you should start with what you guys thought you were doing.

At the time we thought that we were doing essentially what is 23andMe now. 23andMe then came out and pro tip, don’t go head-to-head with essentially what is a Google-backed company. This was in 2009.

Was that when 23andMe came out?

They were even maybe right before 2009. I’ll have to look but they weren’t so big at that time. There were large enough announcements where it was like, “Somebody already doing this. Maybe we should pivot.” At the time, the plan was to do exactly what 23andMe is doing in terms of snip chips. There are lots of ways to look at a genome. One way is to look at the whole three billion base pairs. Another way is to only look at certain markers that you’re interested in.

That’s what they were doing at the time.

They’re still doing that for the vast majority of things. It’s cheap to do that, etc.

I met them in January of 2009. I’m one of the first 100 people on 23andMe. Now there’s all this cool stuff they can do, but my sample was done so long ago that they don’t have as much data about me. There’s a bunch of things they can’t do for me. I have to redo it again.

The chips that they were using probably at that point didn’t have as many individual markers but even now, at least the last time I looked, they’re only doing about 500,000 markers, which sounds a lot but out of three billion, it’s tiny. They’re single base pair. They’re also usually the most commonly varying ones. While that’s interesting from, “Are you a fast metabolizer of coffee and caffeine? What color your eyes?” It is relevant for some medical things BRCA1, BRCA2, the breast cancer genes that research is on lockdown. It’s very good. Whereas your risk for diabetes or other things, the reality is that you might have a single-base pair that has been changed that increases your risk right next to something that you didn’t look at that decreases your risk and you would have no idea.

It’s not high risk yet. If I go to a service that can do my whole genome because some of those exist, it could cost $2,000, maybe.

It’s $1,000 for chemistry, plus the analysis on the computational side.

That’s vastly less expensive than it was.

Back in 2009, it was $100,000.

If I did that, would I discover vastly more?

Probably not.

It’s because the analysis hasn’t been done on all those other things. The first 500,000 that the zillion people have done, we have a lot of data on.

The game that we’re focused on now is how do you figure out what the rest of it means. That’s what our business is focused on now. As a juxtaposition, we started out with like, “Let’s do a 23andMe-like thing.” What we do now is we make the software to compare large groups of whole human genomes. We’re going after the folks that are doing country genome sequencing projects, where they’re sequencing hundreds of thousands or millions of people and trying to sort out what’s going on in there.

Are they trying to do the whole genome for millions of people?

The United States is doing that.

Can you tell me a little bit about what that project looks like?

There are over 50 countries now that are doing this thing. A fun fact that most people don’t know about that.

If countries are doing it, what’s the example of a country’s project like?

The one that’s furthest along is England, the UK. They have this project called Genomics England. It started out as 100,000 people and they did all of that. They did it for people generally that had gone through most of the medical system, hadn’t received a diagnosis and they were trying to still assess out what’s going on. At that point, they were given an option of, “We don’t know what it is. You can either be at a dead-end or we can sequence your genome. It’s totally up to you. If we sequence you, you’re consenting to research.” They had 100,000 people say yes to that and they were able to solve 23% of the cases. It’s pretty good.

They were able to diagnose people who were unable to be diagnosed 23% of 100,000 people from genomic data.

Diagnosed, maybe we shouldn’t use that word. It’s like a candidate is what they call it, a candidate-able gene. They made a very large dent.

The essential problem then is even if you have the money and the people and you can do all these tests, it yields a ridiculous amount of data. How much data is sequencing three billion base pair are going to yield?

It’s 120 gigs per person. If you multiply that by 100,000 people, that’s a good chunk.

All the memory on my top of line iPhone would be full with my genome.

The raw data of your genome.

That’s just the raw data and it’s going to be more once I started trying to analyze it.


The older we get, the more we fall susceptible to thinking one way is the way.
Share on X


In terms of getting into the tech side, this is where there’s a lot of debate about how should you do the analysis because the vast majority of the industry right now tries to take that 120 gigs and determine what’s important as part of it. They go through this whole process where they align your genome against a reference genome, which is a Franken genome of Craig Venter and a bunch of other people. They do a big diff and then they output the diff into what is essentially a glorified Excel spreadsheet. What they try and do is compare these Excel spreadsheets against one another. There’s a bunch of technical issues with doing that. One, you’re not going to see everything that’s in there. Two, if you’re trying to align against essentially what is mostly a European white male reference genome, you don’t see a bunch of stuff. There’s a lot of interesting challenges in the space of diversity within genomics and how do you get representative information about what’s going on in an individual?

Are you making that up or is there any evidence that this is the case where people have gone back and said using the white male European reference turned out to make us miss out on this stuff though there are actual projects?

There have been multiple efforts to make custom reference genomes for specific ethnic groups. There was a whole Han Chinese genome effort. The Japanese did a genome that was specifically for theirs. There’s an error of reference genome that’s in the middle of construction. They did their first version. They’re trying to do a second run.

It seems that would be the thing that you would make by taking all the Japanese folk’s data, go find the stuff that’s the same in all of them and drop that out. It’s basic compression algorithm and then go looking into the rest. What am I missing here? Why is there work?

How do you figure out what is present in there? We’re going to have to get a little technical in order to get there. Let’s start with what does the data look when it comes off the sequencer? A lot of people have the misconception that when you sequence a genome, it comes off like a book. You read it from the beginning to the end. It doesn’t go that way. In a super squishy way, it’s like how they did the human genome project, which is why we know what the order of a human being looks like, but it costs a lot of money to do it that way. It was slow, etc. They did this innovation where it’s called shotgun sequencing. What they do is they take your genome and they duplicate it 30, 40 times. They chop it up into 150 base pair long chunks.

I’m glazing over a lot of technical details. For any of your technical readers out there, it’s not exactly that way but for the average reader, it’s 150 base pair long chunks. What the sequence reads is all of those 150 base pair long chunks. What comes off of the sequencer quite literally is a huge pile of text files. If you open up one of these text files, it’ll be 150 bases with a guess of the quality score for how likely it called that particular base correctly or not. That’s it. You don’t have any information about where that thing came from in the original. It’s the worst jigsaw puzzle of your entire life.

It’s 150 base pair and base four is the notation. It’s got something 256 bits of data or something like that.

Plus, the quality scores of how likely was the sequencer to call it correctly? Was it 90% sure?

Where does that number come from?

The sequencer generates it when it looks at it. It’s like, I’m 90% sure it was an A.

I probably could have 36 or 34 or whatever copies of the same 150 base pair in my data.

That’s how they fixed errors.

That’s my check sum that they are correcting there because if I find ones where that’s a few bits off, then it’s easy to fix. If I’ve got essentially a 256-bit key, ostensibly looking at a unique chunk of the genome or are there duplicates? There are lots of duplicates. I’m trying to figure out how to put them all together in the correct order if half the battle.

The reason why they use the reference genome thing is they take every 150 base pair long chunk and they look for the nearest match. That works great for small changes. In my 150 bases, if two of them don’t match the reference genome, that’s great. I know where that goes, but what happens when 50 of them out of the 150 don’t match? You then get to a place where there’s no longer a unique match. Where does it go? Does it go in this place in the genome or that place in the genome or this other place? What happens if the person has genetic code that didn’t show up at all in the reference genome, which totally happens? You can have a full 5,000 base pair of novel institution, which I know I do because I sequence my genome. You’re going to have this huge insertion.

How do you figure out where that goes? Do I have to go do it the old-fashioned way?

No, not exactly. What everybody else does is they try and do this process where they fish the things out of the garbage bin because when they don’t align, then they try and take all of the things that didn’t align. After the fact, they try and put things back together and figure out where it might’ve come from because there’s overlap with every little chunk. You could try and stitch it together and figure out where it goes. By then, you have already placed these little chunks potentially in the wrong spot. You’ve already biased everything against the reference. We try not to do that.

I have an idea. I got my 150 base pair but if I have it overlap by 50 on each side and it’s only 50 in the middle, it’d be really good, but now I’ve reduced the efficiency of my shotgun approach by two-thirds but then I should get the order of stuff for free at the end. Why can’t we do that?

They do a version of this. As I said, skipping over some stuff, it’s not 150 straight up. What they do is it’s about 500 or 600 bases. They read in from both sides of the chunk. There’s a gap in the middle of that they don’t hit. It’s about 300 bases. This is what is called a paired read. You have 150 bases, a gap of 300 where you don’t know what it is and then 150 bases. You can also get, to a certain extent, further with that because if one section aligns in this particular area and then this other one doesn’t align very well, then you can try and trace over that. It only works up into a certain extent for a 5,000 base pair insertion. It’s longer than the length of the overall chunk. The likelihood they’re going to have some error or dropout is relatively high such that you can’t assemble over it.

Now that we understand the problem, tell me why you don’t have that problem.

It turns out that there is another way of doing it. These little chunks are called reads. If you were to compare every read against every other read, you were essentially to create a probabilistic structure of how they could all go together without worrying about figuring out exactly how they go together. You could imagine that you could account for all of the possible paths and you’ll have a lot of information about what is more likely versus the other and you weight them. You could imagine that you could do that for one person. You could also imagine that you could do that for many other people and you could even overlay a lot of this information over one another. That’s part of the reason why population genomics is important.

That sounds the big data approach to guessing more or less what the likely structure this is. Is there metrics now on how well that works?

We’ve done testing a bunch of what are called truth sets, like golden datasets because one of the big challenges of knowing whether or not your stuff is good or not is how do you know what’s in there? There’s this one sample that has been sequenced a bazillion times with every single different sequencing technology out there. It’s run by this particular consortium out of NIST called the Genome in a Bottle Consortium. It’s very cutely named. They’ve done it with regular short-read sequencing, which is what we all described but they also have long read sequencing tech, which is much longer chunks like 10,000, 20,000 base for long chunks but it’s expensive, so nobody uses that at scale. Usually, it’s used for plant genomics, things like that.

There are other various chemical sequencing technologies that you can use to try and get at these golden datasets. We’ve done a bunch of testing regarding the golden data set approach. Other technologies can see about 33% of the genetic variations that a represent. If you look at on a base for base basis, we can see about 72% if you’re using any amount of population style data. If you think about it, it’s intuitive. The burden of proof for finding a variation the first time is pretty high, but if you’re looking for evidence of whether or not you’ve seen that thing that you’ve already seen before, then it’s much lower so you can pick out things.

If I do it the old school Craig Venter way, I get a complete and accurate genome. Is that right?

You’ll get about as good as you’re going to get.

That’s as close to 100% as we’re going to get. The best off-the-shelf technology we know of gets us to 33%.

If you’re going to identify variations, yes.

Your way, it’s 72%. That’s amazing.

You can characterize it like that.

Essentially the job of the company is to make the software tools to help us manage and analyze all this data at a large scale. What scale are we talking about here? How much data are you guys working with here?

It’s 100,000 people or 350,000 people. It depends on the country.

You guys have multiple countries you’re working with at this point. Is it a real business?

Yeah. I don’t think that people would keep giving me money if it wasn’t.

Is the customer is giving you money or investors?

Both.

If I understand all this correctly, we think that for a lot of the problems that people have, various diseases, different types of cancer and things, by analyzing their genome, we can probably at least gauge what risk they have of having this problem. In some cases, we might know. For sure you’re going to have this problem. I have the cilantro gene that makes me think cilantro tastes like soap. It’s pretty bad in Seattle because cilantro is everywhere.

That’s not the worst one you could possibly have.

It turns out there no other problems with me other than that. I got lucky. I suppose if we’d had a good analysis of my genome, I could have been more before the first time I had discovered that the hard way. What I’m wondering is if this is an important part of what we would want for this type of work. It’s tools that can handle the scale of the data that the testing is putting out and allowing us to analyze it in ever smarter ways. We would also want better testing or better ability to test a genome that’s more cost-effective. What else is missing? More research on what each of these markers would mean. That’s perpetual research project.

I would say that’s the research project now. To give you a perspective, at this point as a globe, we have characterized about 1% of all human variation and associated with anything. When I say anything, I mean high breast cancer risk, everything. That means that there’s still a lot of stuff that we don’t understand even a little bit. If you think about it, to a certain extent, it makes sense why because your search base is large. It’s three billion bases. If you do a study of 1,000 people, that’s not enough information to be able to sort out much of anything, unless you happen to be looking at something that is so strongly correlated that it’s going to be obvious. We’ve gotten some of those great, low-hanging fruit things like the breast cancer genes, BRCA1, BRCA2. Those have massive impact. You can discover that in a sample of twenty people, which is how it was originally discovered.

Is 100,000 people enough that we’re starting to feel confident or do you need 100 million?

It depends on the disease or the particular phenotype. For some of them, 100,000 is going to be sufficient. For some of them, it’s not and we don’t know yet. That’s a mystery that we’ll find out.

In some sense, to answer the question of to pick that mystery apart, you want to scale this up. In your life, what I suppose you’re hoping for is over the next 1, 2, 3 decades, we want to get from counting hundreds of thousands to millions or tens or hundreds of millions of people who we’ve got in the database so that we can start going after these things that have smaller that show up less frequently.

Sometimes you’re going to get those one hits. If you have this single base pair change in this particular location, you have this significantly increased risk. For a lot of it, it’s going to be something more like if you have this and this, then you have an increased risk. You’ve got to have all four.

You have to find all four to correlate them and that’ll take a lot of work. Do you think that the cases where we figure it out, what does that research look like? Is it lots of data and software finding it?

There are a lot of analysis techniques that people had to use when the datasets are smaller. Now we’re entering a time where there’s enough data that you can do things like machine learning or even deep learning. You could use that technology to be able to assess it out. Deep learning is a little bit more complicated because search base is large and the number of examples you have are lower. You have to be smarter about how you do the feature detection, but you can make some of that stuff work. All of a sudden, you don’t have human beings that are trying to do basic statistical analyses anymore. You have things that can look for much more interesting, subtle patterns that frankly our little human brains can’t.

In some sense, given enough data and desktop computers, we’re going to be able to set these things free and let them go find everything for us.

To some extent. I don’t know if it’s that simple, but yes.

Why wouldn’t it be? Can you think of a reason? What what’s missing from that?

Machine learning is complicated. It can over fit easily. There are a lot of nuances, I would say. Plus, biology is a lot messier than people anticipate.

I got interested one time. I saw this video of Danny Hillis talking about proteomics. He made a company called Applied Proteomics but his thesis was that cancer is a normal thing that your body does all the time. You got these cells mutating. Most of the time, nothing bad happens. They get flushed. Everybody is doing that. He is like, “Cancer shouldn’t be a noun. It’s more a verb. You’re cancering all the time.”

You’re pairing and occasionally you screw it up.

You don’t repair fast enough and things get out of hand and you end up with a tumor and then that breaks off and floats around your bloodstream and latches on to where it metastasizes and kills you. That’s the process. The way he described it was when we started the human genome project, we thought we were going to get the recipe for how to make the human. What we got after going through all that work and sequencing, the whole genome is more like list of ingredients but we still don’t have the recipe. The recipe is proteins and that’s interacting with your DNA. What he believed the next frontier was we need to go be able to sequence the proteins and figure out what they’re doing.

I would say that to think that you’re going to get the whole enchilada in just genomics is naive. Frankly, to think that you’re going to get it with just proteomics is also naive. You got to have the whole thing. It’s going to be a combination of all of the omics’, if you will. Genomics, transcriptomics, proteomics, metabolomics, you name that. As I said, biology is complicated. Back to your question of like, “Can we have sort out everything? Can the computers work it all out?” Maybe if we gave it everything that it needed.

Eventually, it gets all the data but right now we don’t even get all the data.

It changes too. The genome, for the most part, it seems it’s stable-ish. It doesn’t change that much over the course of your lifetime. Certain cells might write the ones that become cancerous, but your proteomics and metabolomics, it changes by the moment.

When I was a kid, we didn’t know any of this stuff. Not only did I not know it, the scientific community didn’t. In the last few years that we seem to have gotten our heads around things like the microbiome.

To a certain extent, we know that it’s important. Do we know what any of it does? No.

It hadn’t even been discovered. We thought your tummy is full of acid and then it eats up the food and then you poop it out. That was the entire understanding as far as I could tell.


At this point, as a globe, we have characterized about 1% of all human variation and associated with anything.
Share on X


A lot of people still think of it that way. It’s way more complicated than that.

I don’t know that much about it but you start to learn, you’re like, “I eat food that feeds a bunch of microbes in my gut and then they spit out what feeds me.” There’s a layer of indirection in there that there’s no measurement for. My microbiome is different than yours. It changes over time and none of us knows what we’re randomly shoving in our mouths. It’s crazy.

It’s not that crazy. We have been doing it for thousands of years.

That’s why we’ve got thousands of different microbes in case we eat that weird thing. This is also simplistic but coming from working on computers, my whole career, I lived through these multiple progressions where we started out with. I got the first digital camera with a CCD in it in 1990 or ‘91 or around that. I’m the world’s earliest adopter. I would go take pictures on the thing. It had a little Post-it note sized screen. I could show people immediately the picture that I took of them, which blew everyone’s mind because they’d never seen anything like that before.

Times have changed. Before that, it was Polaroid. That was the only other option.

Even Polaroids were slow. You had to stand up and wave them.

Which apparently makes no difference, by the way. It’s detrimental. You do not shake a Polaroid picture anymore.

The thing that happened is that camera sucked and the photos were 16k or something. Every year, it got a little better but you had this global scale argument going between photographers saying, “That digital crap will never be as good as real photography.” That progressed all through the ‘90s and into the 2000s. This asymptomatic progression where as the digital cameras got better, cheaper, higher resolution and better color management, all that stuff. All those guys started keeping their mouth shut because what they don’t realize is the chemistry was the best technology we had at the time to make photographs. Now we have a better technology and at the beginning, it’s low resolution. As the sensors get better and the data collection gets better and we can collect more data and we can collect data at a higher resolution than essentially the thing we’re sampling, that’s exactly the same progression went with audio too. Computer audio sucks, CD-ROMs aren’t as good as vinyl and all that.

People still have that argument.

They don’t understand how it works. It sounds warmer because there are imperfections in the vinyl. We go through those progressions but at some point with a lot of thing