A-Alpha Bio’s David Younger on machine learning in biotech, building cross-functional teams

A-Alpha Founded & Funded.

This week on Founded & Funded, Partner Chris Picardo is talking with A-Alpha Bio Co-founder and CEO David Younger for our first Intelligent Application 40 Spotlight episode of 2023. We announced the 2022 IA40 winners in October, and A-Alpha was the first biotech company to make the list, which was no surprise to us. As one of our portfolio companies, we know the work David and his team are doing at the intersections of biological and data/computer sciences will change the world — but having a group of judges agree with us makes us all the more certain.

Protein interactions govern just about all of biology and A-Alpha uses synthetic biology and machine learning to measure and engineer protein-protein interactions, speeding up a traditionally slow wet lab process. The company’s proprietary platform — AlphaSeq — uses genetically engineered cells to experimentally measure millions of protein-protein interactions simultaneously, generating enormous amounts of data to inform the discovery and development of therapeutics.

It is within that enormous amount of data that the company is adding to every day that so many answers will start to be found as the company is able to use machine learning to train predictive models and begin predicting new antibody sequences that could be effective against different viruses and diseases — improving the way that we are able to discover drugs.

Chris and David dive into that future that A-Alpha is working toward and so much more — the power of data engineering, building business models around data, building cross-functional teams, having both tech and biotech-focused investors. It is an episode you won’t want to miss. So with that, I’ll hand it over to Chris to take it away.

Chris: Thanks everyone for listening today. My name is Chris Picardo. I’m a partner at Madrona, and I’m super excited to be here with the co-founder, and CEO of A-Alpha Bio, David Younger. Welcome, David.

David: Thank you, Chris. Yeah. Wonderful to be here.

Chris: So, A-Alpha is one of our intersections of innovation portfolio companies. And for those who are new to that term, at Madrona, we use that to mean companies that combine machine learning and wet lab life sciences on a day-to-day basis as sort of a core part of what they do. And I think A-Alpha is an extremely good example of this. And building on that, they were named in the 2022 Intelligent Application 40 list — and were actually the first IOI, or life science, company named to the list. David, could you just share a little bit of background about A-Alpha Bio — kind of the founding story and how you started working on this problem?

David: I didn’t realize that we were the first IOI company on the top 40 — that makes it even more special. So A-Alpha Bio is the protein-protein interaction company. So we use synthetic biology, machine learning, and protein engineering as a tool to improve human health. We use these technologies to measure, predict, and engineer protein-protein interactions for a variety of different therapeutic applications. Protein interactions govern just about all of biology from how your cells communicate with each other to how genes are regulated. So for example, how coronavirus enters your cells, it’s through proteins on the surface of the virus binding to proteins on the surface of your cells. And so by understanding protein interactions, we can do things like design therapeutic proteins that will bind and block to prevent those proteins from interacting and thereby curing someone of coronavirus for example.

Chris: Historically this has been a really hard problem for people to figure out, right? How does one protein interact with the other protein? Because, as you said, that dynamic is really key to understanding what’s going on but also designing possibly an effective therapeutic for some disease that you’re looking for. I’d love to just go a little bit deeper into why this has been a hard problem and how you guys uniquely approach mapping and understanding these interactions.

David: Across all of biology, the traditional approach is to do wet lab experiments, to express proteins, to purify proteins, to measure things more or less one at a time. And those approaches are powerful but very, very slow. So, I think a great analogy to this is determining the structure of a protein — where historically, once you have a protein sequence and you express that protein, you purify that protein, it can take months or potentially even years to figure out how to crystallize that protein and then solve a crystal structure to determine a three-dimensional experimental structure. What we have been seeing over the last couple of years is essentially the infusion of data science, the infusion of machine learning into this space that has now allowed groups like DeepMind, like my alma mater, David Baker’s lab with Rosetta Fold, to develop software that can near-instantaneously predict the structure of a protein without doing any of those experiments.

With protein binding, the complexity of the experiments is maybe not quite as challenging as solving a structure, but they’re, they’re pretty close. We have to purify proteins, so express them, purify them, and then one at a time measure to see whether or not they interact. With the example of coronavirus, because this is a virus that is mutating so rapidly, there are hundreds of different variants that have been observed. There are near-infinite numbers of combinations of those mutants. And there are vast numbers of mutations that potentially could occur in a coronavirus that hasn’t ever been observed. And so, if we really want to understand how an antibody or multiple antibodies are going to bind to the coronavirus, we can’t possibly use experimentation to measure antibody binding against all of those different variants. So, machine learning becomes a really effective tool to measure subsets of those interactions and then train models that can essentially infer the binding properties of the remaining ones.

Chris: You talk about the old version of experimentation — one-to-one throughput. You know, it might be worth mentioning a little bit about how A-Alpha differs on the scale side of your capabilities.

David: A great way to kind of put this into perspective at a very high level is to say that the largest public repository of protein-protein measurements is a database called BioGRID. Any measurement of protein interactions that anyone publishes gets collated into BioGRID. And BioGRID currently contains about a million and a half protein interaction measurements. At A-Alpha, with a relatively small team of 39 folks, we’re measuring about 6 million protein interactions each week. So, we have a database that now has over 200 million protein-protein interactions. We expect that it’s by orders of magnitude the largest repository of PPI data in the world. And each assay that we perform because we’re able to leverage synthetic biology and next-generation sequencing and advances in DNA synthesis to really scale these experiments up, we’re able to measure millions of interactions at a time instead of one at a time.

Chris: It just impresses me every time you say the numbers. When we talk about data generation and the life sciences and in biology, A-Alpha is a really good example of differentiated data at scale. And I think that brings us to this machine learning point, which is I don’t think that what you guys are doing on a daily basis is really possible to parse through without machine learning. So, I’d love to get your take on how do you think about the role of machine learning in A-Alpha and how it fits into the way that you run the company and you run your assays on a daily basis.

David: I think that’s exactly right. I mean, we think about the power of our platform kind of in two different scales. So, one of those scales is what we call our platform advantage. And essentially our platform advantage is a technology that allows us to measure protein interactions faster, more quantitatively, and at higher throughput than other techniques. And even without machine learning, we can use our platform advantage to solve high-value problems across the pharmaceutical industry better, faster, and cheaper. It’s not that without machine learning there isn’t value to the data that we generate. But without machine learning, we’re leaving so much on the table because we are able to essentially extract insights from a tiny, tiny subset of the data that we generate. So, if we’re measuring a million interactions in a single assay. Sure, we can find the interactions that are the strongest, or we can find the interactions that fit a particular profile most closely. We can do those things without machine learning, but what we can’t do is uncover all of the nuances of the patterns that go behind the influence of each possible amino acid mutation and how that influences binding and stability, expression, and all of these other really important biological characteristics.

So, if we start to get to, how do we start to optimize proteins faster? How do we do multi-parameter optimization for different properties like affinity and specificity and cross-reactivity and epitope engagement? To solve these types of properties, without machine learning, we’d be living in the Stone Age.

Chris: A couple of things David just said came in at rapid fire, but I think one of the ways to understand this is that there are a bunch of different ways that proteins can interact with each other. From the strength of the binding to the behavior to the specific physical places where they actually touch. What David’s talking about here is that A-Alpha has built a way to interrogate all of those types of interactions at scale in a way that nobody else can. And I think one of the interesting things too is, for a long time, a lot of these experiments as you talk about, were kind of binary one-on-one. Put two things in a tube and see what happens. And I think with A-Alpha, you get all of these other variables. You get the stuff that doesn’t happen, you get the stuff that kind of happens. And so, as you parse that apart, it’s really this machine learning that kind of takes over. And I’m curious, as you start to observe those types of things, where is it going? Now you’ve got a 200 plus million data point, let’s call it a training set, you’ve got the ability to screen 6 million PPIs every week, and it’s getting bigger. Where are you pointing this? What’s ML going to start to really unlock when you think about, what the outputs look like down the line?

David: So, the first place where we’ve already seen our database have a substantive effect is in understanding what antibody sequences don’t work. Being able to rule out particular sequences, particular patterns of amino acids that just don’t produce good binders that don’t produce stable antibodies. By ruling these things out, we can design libraries that are essentially enriched for things that are better, or we can computationally knock down observations that we see about sequences that we know are not going to behave well.

If, for example, we generate a lot of data around a library of antibodies binding to a number of different targets, and now we want to generate a second set of newly predicted antibodies that have an improved binding profile, we can use all the data from that experiment, but we can also essentially reference all of the historical data that we’ve generated in order to take those predictions and screen them for how likely is this antibody actually going to function properly. So that allows us to essentially move faster, and we can optimize these really complex problems in fewer iterations, which means cost savings and time savings.

Where we see this going eventually is getting to a point where we’re doing more and more in silico prediction of new sequences. So coming back to the coronavirus example, if we have a dataset that consists of thousands or millions of different antibodies binding to thousands or millions of different coronavirus variants, we can start to map a landscape such that if a new coronavirus variant crops up, even if it’s one that we’ve never seen before, we can take all of the data that we’ve generated historically and predict a new antibody sequence that’s likely to be an effective drug against that never-before-seen virus. That’s an incredibly exciting promise of what we can really harness this quantity of data for.

Chris: I think that’s just such an important point that you just made. In the sort of early drug discovery process, it’s just been a very difficult search problem. Where you have to have, to use the industry’s term, a library of specific things that you want to try out, and then you search it — slowly, one by one. And if nothing works in that library, you have to find another library. And I think the way you’ve reframed the problem, which is, hey, if something kind of works or if a subset of things totally doesn’t work, oh, we go get another library, but we build that based on all of the information that we just learned and all of the information that we previously screened. I think that’s a powerful new paradigm in this entire space. And I think A-Alpha is the type of company that’s pushing that forward from like a “how do we change the game perspective here?”

David: I think that’s exactly right. The technologies that have generated many of the antibodies that are currently in the clinic as cancer therapies or autoimmune therapies were discovered by essentially throwing millions and millions of darts at a dart board and picking the ones that hit the bullseye and ignoring all the others. And we’re in a day and age, and our approach at A-Alpha is: Sure, we’re going to pick the ones that get closest to the bullseye, but we’re also going to take a snapshot of that entire landscape and learn from it so that when we’re developing the next therapy, we can get more shots on goal.

Chris: Yeah, every time you do it, you’re just making better darts, and you’re really figuring out what that looks like. It’s such a cool way to think about it. One other thing I wanted to ask you about here is just the power of data generation. I would say it’s something that we hear about a lot in biology and life sciences, but it’s hard to wrap your brain around it. How important do you think novel data generation is? And if you look forward 10 years, is every company going to be just figuring out, like, how do we produce huge novel data generation at scale? What’s the sort of importance level in terms of how you think about that as a component of strong discovery?

David: Yeah, I think it’s, central to everything that we do. It’s central to many companies that are taking a similar thesis to ours of really being a data-driven drug discovery company. And I think that this isn’t just true for companies that are developing therapeutics. This is really a paradigm shift across all companies that are doing biological research and all academic groups that are doing biological research. The tools that are at our disposal today are enabling just a step-function increase in the pace of scientific discovery. And that’s because experiments don’t have to be run one at a time anymore. We can now synthesize millions of different defined DNA sequences that are all in a test tube that arrive in a week and at an affordable price. And we can build experiments essentially thousands or millions at a time and then use next-gen sequencing as an output. No longer do we have to create a single controlled experiment that tests for one hypothesis that focuses on just one question. We can now ask a million questions simultaneously, which allows for very, very rapid discoveries for therapeutic applications, but also just in accelerating our understanding of biology.

Chris: And it becomes such an important machine learning problem because historically, there have been a couple of data sets, like you mentioned BioGRID, and you could try to throw the 500th algorithm at that and see if you can find anything interesting, or you can go generate totally novel training data and do that iteratively. I think this is an interesting transition that when you think about applying machine learning to this space — this is something that’s really compelling to people who want to do that. And so, when you think about that, something that you’re doing obviously every day is building a cross-functional team — software engineers, ML engineers, wet lab scientists — how do you think about this? It’s not a challenge that every company faces. It’s very multidisciplinary. What is your thought on team building and bringing different personas and backgrounds together?

David: I think the most important thing in our experience is that the teams have to be really excited and passionate about working with each other. It’s not necessarily that a data scientist needs to come into A-Alpha knowing everything about biology, it’s that they have to be incredibly excited about getting into the weeds and learning about, at least to the depth that’s needed for them to understand the biological context, for them to ask the right, ML questions.

The same is true for biologists and biochemists. They need to understand what the data science is capable of and the parameters by which we need data in, in order to effectively train models. So having a very close collaboration between those teams and a mutual interest in understanding what each other do is essential for the company to be effective. I think if we were building a team that was just wet lab and we were outsourcing all of our data science or vice versa, building just a data science team and outsourcing all of our wet lab, we would not be able to do what we do as effectively and certainly not nearly as quickly.

Chris: That leads me to a question, we hear this a lot now, right? That people are encouraging ML scientists to go work on life science. And the problem can seem scary. You’re walking into something where there’s all these biological terms, they don’t really make any sense — it feels like people are speaking a foreign language. Is it possible to teach an ML scientist biology and vice versa?

David: I think both are possible, but both are hard. My experience, so I, was trained as a wet lab biologist and would certainly not consider myself a computational biologist at all. But during my graduate work, I realized that in order to stay relevant long term in biological science, there is a need to figure out how to get more proficient at data science. Because biological science is moving more and more toward these massive data sets that are impossible to parse without some sort of data science or bioinformatics tools. It would be very hard to stay relevant. And so, from that perspective, there’s a good amount of sort of healthy pressure on biologists to pick up some of those skills. If a biologist does not have any of those data science skills, over time, they’re more or less going to be relegated to generating data and then not having the tools to be able to play with that data. And I think, from the perspective of just about any scientist, the fun part of the job is not just generating the data, but it’s digging into that data and trying to get insights. From a data science perspective, I think a lot of people who choose to work in companies like ours have some reason to be passionate about biology. Maybe it was interesting from high school. Maybe it’s that a family member had a particular disease and that launched a passion to get involved in some healthcare-type aspect of a career. But you know, typically, there is some driving force that leads to that interest.

Chris: I’ve watched you between these two, so I can tell everybody that David is more than capable of it. But I do think these worlds are colliding in a very interesting way. I think what you mentioned about the scientists realizing that the quantitative tools are becoming table stakes is getting more and more true. And the ML engineers and the ML scientists are realizing there are some interesting problems to go solve on the life science side where you’re just going to get handed a pile of novel, interesting, totally untouched data, and you’re going to go generate new insights. And that has to be a powerful message, right?

David: Absolutely. Part of it is the impact, right? You’re, potentially involved in better understanding biology or curing disease. But I think also there is something that’s just innately complicated and messy and noisy about biological data. It’s just a very exciting source of data that I see as very much the next frontier of data science.

Chris: Yeah, as we talk about, you know, the Intelligent Applications 40 and the companies that are on it this is a core theme right here, especially as the world of biology and life sciences moves forward — you’re not going to be able to interrogate the data without machine learning tools. It will be too high scale. You won’t be able to do it manually in the ways that you used to. And so, that really brings up another question, which is on the company creation side — it used to be, life science companies generally spun out of an academic lab, got a really core life science-focused investor, and then tried to get something into the clinic. A-Alpha and companies of your style have been built differently than that, right? You did spin out of an academic lab, but the funding and the investors are, are different and the goal is a little bit different. So, I’d just love you to talk a little bit about how the kind of different company-building process here has worked. You have a tech-focused investor on your board, — us at Madrona. You have a much more life-science-focused investor on your board over at Xontogeny. What has that dynamic been like, and how has that influenced the way you think about building the company?

David: We are very much a platform company in the biotech universe, so we are not focused just on getting single drugs to the clinic. We’re focused on building a platform that can be used over and over again to glean biological insights and to develop multiple therapeutics across potentially many different disease areas. It was very important to us to have the credibility and the know-how of a group like Xontogeny Perceptive and Ben Askew, who is on our board — traditional life science investors who know that process of taking biological data, finding targets, finding drugs, getting those drugs into the clinic — really core expertise that we absolutely need. But we’re also thinking again about building this platform that can have a long-term impact across lots of therapeutic programs, which is a big part of all of these efforts in machine learning to train predictive models and improve the way that we’re able to discover drugs over time. This is all sort of core to our platform thesis. And so, bringing Madrona and you, Chris, and Matt onto the board, have really helped to bring that sort of platform perspective. And has led to very productive conversations with some amount of healthy tension between that traditional life science investor and the kind of longer-term tech-enabled enabled VC perspective.

Chris: As someone who gets to go to the board meetings, it’s been so fascinating to get to see these discussions play out across a bunch of different angles because, at the end of the day, right, you’re one way or the other going to help a therapeutic get to market. It might come directly off of the platform, you might enable someone else to do it, but that would be the goal. However, you’re going to use all of these modern approaches and tools to do it. And so, you kind of have to think equally about, how do I build the software and ML capabilities of my business. How do I build them in line with the life sciences and the biological capabilities of the business? And it seems like this paired investor approach, at least in your case, has worked really well. Is that something you’d recommend to other companies — this kind of hybrid style of investors?

David: I think that if you’re building a company that really lies at this intersection of data science and biological science, and your goal is to build a platform company, I think having investors, having board members who have experience in those two different domains is incredibly valuable. I think a good example of that right is across the pharma industry, across biotech, there is still a lot of uncertainty around how to structure business models around data, right? So, you have all of these companies that are generating massive data sets and they’re using those data sets internally to discover drugs. But one of the things that has been fun about many of our board meetings is that we’ve started to have creative discussions about, you know, how else might we be able to use that data in creative ways. And I think those conversations only happen when you have folks with very different perspectives and different experiences at the table.

Chris: Yeah, you beat me to the question because before we jump into the lightning round, I was going to ask you about: changes in business models. Historically there have been three ways to build a business model in this platform space. One, you build a drug internally, you take it into the clinic, and then likely you license it to someone else.

Two, you have amazing platform capability, generate insights, and you partner with people who have interesting things for you to work on — and hopefully, they take a drug to the clinic. And three, you can be a little bit more, call it service or CRO, like — contract research organization, which has often been like, hey, someone needs some stuff done, and so they’ll send it to you. And A-Alpha, I think, has been very creative in figuring out how to create different versions of all of those models. And I’m curious, how are you seeing the business models evolve, and is there a type of business model here that you’re more excited about or that you are really excited to see emerge?

David: So, because we have a highly differentiated technology that really allows us to enable both ourselves and potential partners to discover new drugs that they wouldn’t be able to discover otherwise, we are able to work with partners in the context of a service-like model. But it’s sort of a very high-valued service. It’s a partnership that is structured around work done by A-Alpha but then comes with an upfront payment and milestone payments, and eventually royalties, which is that traditional pharma partnership model. We also can use our capabilities, kind of turn them inwards and build a pipeline of our own and there are certainly exciting opportunities there. What we’re also starting to think more about is, again, how we can really leverage data and data generation as an asset in and of itself. So, talking with partners about the potential to build data sets together that can be used to train predictive machine learning models in a way that really is only enabled by the type of tool that we have that gives us that competitive advantage for generating data sets that no one else is able to generate. I think that there are a number of different exciting ways, over different time scales, in which we are able to leverage the types of data, the types of capabilities that we have, both directly for discovering and optimizing therapeutics, but also for moving the whole field forward by generating these massive data sets.

Chris: Personally, like you just said, I think these data partnerships are going to be crucial for how this industry as a whole pushes forward with machine learning and as these data sets get created and these models get trained, we really will start to see these new types of deals emerge, and I think it’s emblematic of what companies in the IA40 are doing in general, which is pushing the boundaries of the existing business model by leveraging data — one way or the other, right? Data plus intelligence. So, I think you guys being out on the leading edge of that is particularly telling, and makes sense why you are part of the IA40.

David: I think as well, in the pharmaceutical industry, there has been a very strong historical resistance to sharing any type of data. Targets are confidential. Data is confidential. I mean, everything is behind closed doors, with a lock and key. I think because you’re starting to see more and more proprietary sources of kind of niche data sets that help to explain different aspects of biology that are hugely valuable in and of themselves but might be even more valuable when they’re all combined together. It really starts to create better incentives for companies to get creative about how we can leverage data sharing or different ways to enable the entire industry to grow by getting more creative with data partnerships.

Chris: And I just think it’s going to be such an exciting thing to watch going forward. To wrap up we’re going to do a lightning round of the three questions that we’re asking all of the IA40. So, I’ll start with the first one, aside from your own company, what startup or company are you most excited about in the intelligent application space and why that company?

David: There are so many of them. I’m most familiar with biotech, biopharma, so I’m going to give one from that space, which is Octant. This is a company that I’ve been a fan of for a long time — founded by Sri Kosuri, who I’ve also been a fan of for a very long time. They are using synthetic biology to engineer cells to essentially develop different disease models. They’re leveraging the power of DNA synthesis and DNA sequencing — but to create cellular models so they can test small molecule drugs in a much higher throughput setting.

But they, like us, can use the power of high throughput experimentation to develop these massive data sets and essentially improve their platform over time by training predictive machine learning models. So yeah, a very, very cool company that has some analogies to what we do but really focused on the small molecule drug discovery space.

Chris: Totally agree, Octant is a super cool company, pushing the boundaries. Second question. So, outside of enabling and applying artificial intelligence to solve real-world challenges, what do you think is going to be the next greatest source of technological disruption in innovation in the next five years?

David: Keeping this close to home and in kind of the realm of biology and biotech, I think it’s DNA synthesis. That has been a major limitation for many decades around, know, how quickly you can do experiments, how high throughput you can do experiments. We’ve made a lot of progress even over the last few years. We’re now at a point where we can order from a company like Twist on the order of tens of thousands of 300 nucleotide oligos and get them back in about a week. But that’s only long enough DNA for us to produce very, very short proteins. So over time, what we really want to see at A-Alpha is advances in DNA synthesis so that we can synthesize arbitrary-length proteins and massive, massive libraries of those arbitrary-length proteins. I think that that’s going to be a major driver for even higher throughput experimentation, even more, precise experimentation, because then we can fully define those proteins that we make. And we’ll just have a really, really huge impact across biotech, biopharma. And there are a lot of companies working in this space. I think there’s a good chance that there’s going to be at least one company that really cracks this.

Chris: All right. Last question. What is the most important lesson or something that you look back on and you’re like, boy, I wish I could have done that better, that you’ve taken away from your journey building A-Alpha so far?

David: One of the things that I’ve kind of gotten there eventually, but maybe it took me a little bit too long is to find the right balance of stepping away from all of the technical details. I think being a technical founder is a blessing and a curse. It’s a blessing because you understand the ins and outs of the system, of the platform. You can be involved in those technical conversations and help to steer the technical strategy of the company. But I think, at a certain point, there is a drawback to being too involved. Partly that’s a bandwidth issue, right? I mean, I have other things that I need to be focusing my time on. But I think the other thing that it took me maybe too long to realize was that when I’m in the room, the conversation is inherently different. So, sometimes, to have the best technical discussions, it’s important for the CEO not to be in the room. That was something that probably took me too long to figure out, and I’m glad that I’ve gotten there eventually.

Chris: I think that’s a really important and thoughtful insight for similarly technical founders out there and is a good one to end the conversation on and a good note for everybody to take home. So, David, we really appreciate having the conversation and for you being on this episode of Founded & Funded.

David: Thank you, Chris. Wonderful to be here.

Coral: Thank you for listening to this IA40 Spotlight episode of Founded & Funded. To learn more about the IA40, please visit IA40.com. To learn more about A-Alpha, visit AAlphabio.com bio.com. That’s A-A-L-P-H-A-B-I-O.com. Thanks again for listening and tune in in a couple of weeks for our next episode of Founded & Funded.

Related Insights

    Putting the Tech in Biotech: Why Now is the Time to Build Tech-Enabled Life Science Companies
    dbt Labs’ Tristan Handy on the Modern Data Stack, Partnerships, Creating Community
    Data Boundaries are Blurring in a Multi-Cloud World

Related Insights

    Putting the Tech in Biotech: Why Now is the Time to Build Tech-Enabled Life Science Companies
    dbt Labs’ Tristan Handy on the Modern Data Stack, Partnerships, Creating Community
    Data Boundaries are Blurring in a Multi-Cloud World