Can We Trust AI? The Future of Verified Reasoning in High-Stakes Systems

Today we’re bringing you a special live episode of Founded & Funded hosted by Madrona Partner Jon Turow, featuring Carina Hong, Founder & CEO of Axiom, and Byron Cook, Vice President & Distinguished Scientist at AWS.

Carina is building foundation models trained on verified proofs rather than human-written reasoning. Byron leads the automated reasoning group that secures AWS’s massive infrastructure by teaching machines to reason about real-world systems.

In this conversation, Jon, Carina, and Byron explore what it really means to move from models that appear to reason to systems that can prove they are right. They dig into why verification becomes essential as AI moves into high-consequence domains, how formal methods and machine learning are converging for the first time, and what happens when reasoning shifts from a scarce human resource to something that can scale with machines.

Listen on Spotify , Apple, and Amazon | Watch on YouTube.

You can also read Jon’s takeaways for founders here.

This transcript was automatically generated and edited for clarity.

Jon: So help me orient, we’re in a world where we’ve been talking about reasoning machines all day. And actually the models that we have that are sentence completion, they seem to reason, they seem to think.

So why do we need to be talking about next generations of reasoning when we have pretty amazing models already. Why do we need to be doing advanced ambitious new moonshots?

Carina: Yeah, for sure. I think models are getting really good at pattern matching and matching specifically for human preference. They are trained on chain of thought data, which is made by expert, lines and lines of chain of thought to approach a specific problem.

However, there are certain areas where human preferences don’t cut it. Those areas require objective truth. Absolute right or wrong answers that you cannot get right or wrong because some humans prefer it more than others. And one example is mathematics. The other examples are engineering, coding, quantitative finance sector, those more quantitative scientific areas. And recent models fall short as that.

For example, if you ask a model to approach a really difficult high school level mathematical proof, it actually cannot generate the correct intermediate steps. And that’s a really sharp kind of contrast with some other domains that it achieves expert level performance.

And there’s also a question about scaling, which is currently, we see them doing okay, but as you try to heal climb in these domains, it gets harder and harder to make recent models continue to work because currently, the way to give them some signals about whether they’re getting it right or wrong is by 10,000, 100,000 large language models, agents voting yes or no.

And that gets extremely expensive as you try to scale up to harder and harder difficulties in those domains. So verification must come in, which is objective shoes, right or wrong. I’ll leave it at that.

Jon: Byron, what do you think?

Byron: So I’m coming in, I’ve been at Amazon for 10 years applying proof tools in AWS infrastructure and also for customers. And so I come at it from the other side where an incorrect answer is a crime, so you can’t give the wrong answer. So we apply mathematical logic to-

Jon: $100 billion of billing in a year. We got to bill it right.

Byron: Exactly, yeah. The cryptography needs to be correct. The virtualization needs to be correct. The storage needs to be correct. The interpretation of the policies needs to be correct. So that’s where I’m coming at it from.

But the sort of challenge for us has been that the tools that you use to achieve that 100% correctness were harder to use and required PhDs. So for me, what’s quite exciting about the recent trends is that suddenly these tools in combination with machine learning become actually potentially much easier to use.

Jon: Can you say a little more about that? One thing that’s amazing about this field, applying reasoning and formal verification in technology has been around for a very long time. I think even before you got started with it, which was some time ago, but it has been an artisan craft and it required experts to verify line by line that this airplane autopilot system is going to work as designed.

So we couldn’t use it everywhere because we had to use it only where we could find the resources to do it, justify the resources, find the people. So what changed, Byron?

Byron: First of all, as computers were networked and as society began to take more dependence on computers, the need for that work became more and more important. So as people move to the cloud, for example, proving the correctness of the cryptography became the reason the customers actually came to AWS, proving the correctness of the policy interpreter became the reason that people came to AWS. And so it helped AWS grow. It helped customers move and that became an area of importance.

And then we and AWS have been able to make the tools more and more usable, but also there’s been other advances like, for example, the rest type system is essentially separation logic combined with really good error messages. So there’s been effort there. And then more recently with machine learning tools, that’s really been a breakout. The other thing to keep in mind is that the models are often trained with data from proofs, either synthesized or discovered.

And so the models are actually often quite good at using, for example, the Lean theorem prover. And so that combination of the human and the AI tools, understanding Lean or Isabelle makes the tools much more approachable.

Jon: I want to put a fine point on this because it’s something that took me a while to get my head around, chasing bugs, chasing incorrectness in something like a billing system or a security system. For a long time, or even like an autopilot system, was a bit like being a goalie, where you’re just trying to stop every vector of something that could go wrong and trying to think of all the things that could possibly go wrong and knock them down one by one.

And you’re using a word proof, which is a big word, which says, “It’s not just everything that I’ve thought of that is going to break, but this will not break. This works and I can prove it.” Can you say a little more about that?

Byron: So you can identify properties that you want to hold in a program. So, for example, all data at rest must be encrypted or there’s no memory corruption errors or that strong consistency is actually guaranteed as opposed to eventual consistency.

And then up to that property, now you can prove using techniques like mathematical induction, which is the same technique we use to prove the Pythagorean theorem or the Four Color theorem. And so it’s the sort of foundation of how we know what’s true of the infinites, how we know there’s an infinite number of primes. It’s the foundation of truth in our society.

Jon: And so when we say proof, we mean it in a mathematical sense that yes, your AWS bill’s going to be right?

Byron: Yeah. What’s amazing now is with the world and society, the usability of generative AI means that people have discovered this problem too. And so they are trying to remove the sociotechnical mechanisms with agentic tools or chatbots and that notion of, “Oh, the answer is wrong. How do I know the answer is right? Can you give me an argument as to why the answer is right?” Is something that the world is waking up to the importance of that, whereas before it was quite an obscure area.

Jon: Yeah. So, Carina, there’s something we’ve spoken about a bunch of times, and I want to sort of share this with the group, that there is a market for reasoning on this earth in this world, and there is not enough really sophisticated reasoning to go around to solve the important problems that we face.

How in the world is that true? What does that mean? And what does it look like when we go from a world of scarcity to abundance in reasoning?

Carina: Yeah, 100%. Double touching the point Byron was making, which is that for a lot of the programs we currently have an output and a property, large language models are good at generating the output. They’re unable to prove that this output satisfied that property. So that critical link is missing, and that’s why proofs, in addition to computation, becomes extremely important.

Only when a model is trained on proofs and similarly executable programs, you can know for sure it’s right or wrong. The really technical breakthrough happened in formal language called Lean and similarly like Isabelle and there are a few variants of theorem-proving languages, they term mathematical proofs into computer programs by Curry Howard correspondence, which is a scientific term. So that means for every single math problem and the proof associated with it, I can turn it into a computer program, which I can then click it and wait for it to run.

After it finish running, I can see either a check mark, which tells me congratulations, it’s good, proof is good to go, or an error message, which I can then feed it back to the model to help it to revise and adjust for the correct answer next time. And then it keeps doing so. And eventually, I will get a check mark, which is, wow, congratulations. This proof is actually logically sound.

At Axiom, we are trying to solve the problem of the scarcity of outlier human reasoners. If you think about the history of scientific discovery, and I keep thinking back to some mathematicians lives and their legacy. Ramanujan died, I think at the age of 32 and before he died because of starvation and ill health, he had tons of notebooks. And those notebooks have formulas in it that took subsequent mathematicians, I think, decades, if not centuries, to prove.

And I also think about Galois, the guy who really pioneered group theory. He died incredibly young, I think 22 because of a duel for a girl that he loved. These are the kind of human outlier reasoners that the society lacks for true scientific breakthroughs accelerating to market applications to happen. If we think about Abacus giving rise to accounting, integral calculus giving rise to thermodynamics, mechanics, and then industrial revolution, we think about babbage engine, which is a mathematical tool to make lock tables faster, that’s the prototype of a computer. But there’s a significant time lag of 200, 300 years.

AI compresses everything. You can consider what happens if billions of AI mathematician agents go to every single complex system that is currently not understood and work with applied scientists, domain experts in this field in chips, for example, have a theorem prover, help formally verify chips to work with people who are the best software engineers, have a theorem prover, help formally verify code, work with people who really understand the economics. I mean, quant trading has previously been really restricted to the domain of high-frequency or some mid-frequency quantitative trading. I was previously working at XTX Markets, a leading hedge fund, but really why has anyone not thought about combining those really outlier quantitative methods to some more traditional, say, day trading? What happens if you have an AI Terry Tao, AI math wiz work with every user on Robinhood? What happens? And I think that’s the question that we need to ask ourselves about. When we enter the era of math intelligence and math outlier reasoner grounded by verification, because we couldn’t afford to have really these billions of agents that we don’t really know whether the answer is right or wrong.

Okay, I see a 1,000-line math proof and you tell me that I have no way to check it. Surely, I cannot have humans check it. That’s extremely expensive when it comes to scaling, but once you have these thousands of math proofs and you know that there is no chance it’s incorrect, it’s good for use, that’s incredibly impactful. What’s even better is, in an era of agents, I can make sure that the stuff I pass from agent A to agent B is correct. I can’t really afford to pass incorrect and lousy work to my teammate during my legal training at Stanford Law. But what happens if you have verification, if you have agentic workflow, if you have the ability to conjecture and generate new problems to think about and then have the verifier to try ground it and prove it, this sort of self-improving AI, recursive, self-improving loop, I think that is the amazing future that building a reasoning engine as a model layer helps unlock.

Jon: So let me try and make this really tangible. Take me back to your work at this quant fund, XTX, which hires people who are really, really good at math and basically hires every person who is good enough at math that they can find on earth. Not enough people that they can find.

And there are other fields that do this too, by the way, but let’s focus on one field and one firm. They’re in a world of scarcity of reasoning. What happens to a place like that when they can suddenly turn on more reasoning like a tap of water, like a tap of electricity?

Carina: I think the future in these players is quite interesting. There are different dynamics with each of these players. So imagine you are a firm that actually doesn’t have access to the top MIT Princeton math PhD students. You just do not have the talent pipeline to employ these extraordinarily bright young minds to be quantraders.

Suppose you are a hedge fund in a market that’s total trading volume is quite low, which means low-hanging fruit is everywhere, like really not so hard to derive alphas or everywhere. You just don’t have the talent to help unlock it. I think that’s really meaningful for these players, first of all.

Second of all, for these players that have already amazing talents, then they have these AI mathematicians to collaborate with. They don’t have the sort of problem where I think as some other hedge funds, which is, I have never have worked in, but I just read it from all sorts of books where different desks have different trading strategies.

For example, you can be quite isolated to develop one thing that you own full end to end. And you have these AI mathematicians that’s bound to be correct to bounce ideas over. And you can think of talking to the AI at your tea time generates really interesting quantitative methods with the AI collaborators.

So having this sort of strong supervision from the existing amazing talent outlier, human reasoner is going to only accelerate the discoveries and breakthroughs in the algorithmic, in the scientific sense.

Jon: Yeah. One thing that I think we’ve seen with the first waves of GenAI is that the marginal cost of creativity and experiments is dropping. We saw this in the cloud, Byron at AWS when the marginal cost of experimentation for a cloud developer dropped and I could run an experiment much faster with it ordering racks.

And now I can do that with all kinds of creative fields. And I think to your point, Carina, if I’m now doing it in fields where being right super matters, be it because I’m betting billions of dollars or people are going to get hurt if I’m wrong or any number of reasons, you start to get to that next level if you can move from a world of reasoning scarcity or reasoning abundance.

Carina: For sure. Jovan’s paradox tells us that when the price, when the cost of the tool becomes lower, there will be new market cases, use cases to be unlocked. And I think really we are in an era where Jon’s paradox is our opportunity. What happens if every cloud compute provider can afford, say, Facebook engineers that currently pay two millions each year to work on the routine Mundane IT optimization ticket task?

Just think about what that means. And what if a hedge fund in a trading volume of only say eight millions each day can now afford a trader, an AI trader that’s as good as those being paid currently 20 millions each year as a starting salary in certain quantum places as this does happen to afford these AI mathematicians, but only pay them $5 each hour.

Jon: So help me, I’m laughing with maybe some other people at the term only eight million. Help me understand how broad is the application of reasoning with math? Is this about creating more math for its own sake or there was huge generalization here and I think that’s counterintuitive at least it was for me until you guys explained it to me.

Byron: So there’s a little bit of survivor bias going on because I’m pulled into the conversations where it’s needed. But from what I see, like I’ve worked in reasoning about genetic regulatory pathways, railway switching, safety, microprocessor design, operating systems, virtualization, storage systems, networking.

So any question where there’s a system and the system is evolving, there are questions around termination, around reachability, and most of those turned into questions about unbounded, infinite or intractably large problems. I haven’t really talked more about Lean, but there’s also the propositional satisfiability solvers and SMT solvers, which handle combinatorial reasoning.So you’ve got the problems are typically MP complete or they’re undecideable, and these tools eat those problems up for lunch. So many, many problems are translatable to state space reachability or termination.

Jon: Okay. Yeah. So if I follow on this line and I have tools and techniques that are broadly generalizable to be really, really powerful, super intelligent brains that we can turn on like a tap. Each of you are leaders who have to straddle two lines because you’re doing advanced research to push the frontier.

And I know because each of you have told me that you’re also pushing really hard to apply this and actually put it in front of customers and make it land as opposed to stay back in the lab. And so maybe you can talk about the leadership experience of that and how you balance your resources and your team and even your culture about the relative importance of awesome math and code for its own sake versus the impact it’s going to have for customers.

Byron: So, Strachey, who is a contemporary of Turing and founded the Oxford Computer Science Department, says that the division between theory and practicing computing is injurious. That the theoreticians essentially don’t know what problems to work on and the practitioners don’t have a grounding in what they’re doing. So what I have found is that even when I was in Ivory Tower, Blue Skies research labs, that those who ground themselves and customer problems were the most successful because they were able to extract out theoretical challenges.

So for example, I worked on termination proving, driven by the observation that device drivers need to terminate and also genetic regulatory pathways need homeostasis, and that is the same problem. So what I’ve done with the scientists that work at Amazon and in my career is to help people see it’s an and not an or to work both in the theoretical and be applied.

The other thing I’ll say is that one of the big challenges for the space is figuring out what it is you want, what should be true. So I can systematically remove incorrect statements from a chatbot about the Family Medical Leave Act, but first I need to encode what are the rules of the Family Medical Leave Act and to do that has typically been a challenge. So you end up spending a lot of time identifying what is the correctness of the AWS IAM policy language, what are the semantics of AWS VPCs, what is the semantics of Rust precisely? And that actually turns out to be surprisingly hard to do.

And so to make practical progress, you have to drive on that and then sometimes make some compromises. One of the very nice things about the rise of generative AI is that these tools now can help you do that. So you can get them to help translate the natural language into logic and help you figure out what the logic means and figure out what it is you want to try and improve and then maintain that over time.

So you might maintain your formalization of the Family Medical Leave Act and then as you deploy your agentic system or your chatbot based on that formalization, over time you might realize you got it slightly wrong and fix the formalization.

Jon: Got it. Let me turn this upside down before I bring it to you, Carina. Byron, explain it not to your team who are going to want to know how much should we be doing awesome research versus applying it? How do you explain this to Andy and Matt Garmon about why in the world we should be investing this level because Amazon has been investing bigger and heavier and longer in formal methods and automated reasons than anybody. And how do you explain why that matters? Why that’s so important?

Byron: The customers did it for us. So I was in the Blue Skies research lab explaining things on a technical level in the past. And then when I joined Amazon, because I’m in New York City, I was pulled into conversations with financial services.

When they discovered the work we were doing, they went and drove that discussion. Actually, they went to other customers and helped them appreciate the importance of the work. And they drove those discussions with Andy and so on.

Jon: Because they said,

Byron: This is why we’re going to move to AWS.

Jon: This is why we’re going to move to AWS?

Byron: Yeah. Orders of magnitude workload because now we understand that the cryptography has been proved correct. We understand the story around the virtualization, we understand the durability story, and this is why we’re now comfortable moving our workloads over, and this is something we would never be able to do ourselves.

Jon: Because previously it was a list of things that might go wrong crossed off the list. And no, Byron comes with a proof that says, no, seriously we mean it.

Byron: Yeah. And you can also, it’s a transparent way of describing to your auditors or to countries why certain things are impossible. You can provide a proof and mathematical logic without showing the data why data can’t flow to certain places. And that’s important to certain governments and customers.

Jon: Got it. So, Carina, if I bring it to you, you’ve assembled a team of mathematicians and developers and researchers who like things that are awesome and they want to do advanced cutting edge work and push the boundaries of reasoning. How do you set the point with them that we’re going to apply this? And how do you set that balance and how do you calibrate what inputs they’re getting?

Carina: For sure. I think this is actually a contrarian panel in a way. The call to action of formal methods applied to reasoning is not something I think that has been talked about and appreciated enough. And I have this challenges when I’m talking to investors, when I’m talking to talents, when I’m talking to my mom. So what exactly are we doing? And so we launched yesterday on morning Axiom came out of Stealth.

There’s a Forbes article, can read about it. We’re building a self-improving, super-intelligent reasoner, starting with an AI mathematician. So all these terms are contrarians like VCs don’t know why they should invest in an AI mathematician. People don’t know what does self-improving AI mean besides that being in the first paragraph of Duckerberg’s personal super intelligence memo. And what is super intelligence anyways? I think we know for sure this technology is likely to work more likely than not.

And this is the reason why Axiom has been a hiring magnet. I think we are able to attract talent at the caliber of Facebook, AI’s directors and other companies, senior tactic managers level, because they believe in the mission. They believe in working for a company that is developing the next step function change of technology.

That is, if you compound formal verification like we talked about with better and better learning and search, then you really have a decent shot toward super intelligence that is defined as a model that can generate new problems for itself to work on with 100% correctness and then generate more interesting, harder problems based on how it did last time. So I think the technical vision being sound and that really helped with hiring.

And in terms of say justifying to investors, customers, I think people are seeing three things coming together right now as consensus in the AI field. One is the reasoning models on the informal reasoning side have gotten stronger and stronger as a good backbone of building things on top of it.

Second is the formal verification tools like Leanford have really been widely accessible and so many Lean developers in the world are trying to build tools within Lean or trying to expand the mathematical library and the CS library in Lean. And that has really become a phenomenon from a subculture. Back then, I think two years ago, people who are Lean developers talk to each other.

It’s almost like, “Wow, I can’t believe you are also doing this. This is a language that no one has heard of.” But now it’s while we have a vibrant community of Lean users. So that’s the second part. And the third part being that co-generation have got extremely, extremely strong. Reinforcement learning applied to a verifiable domain is going to have performance gain that has shown encoding.

And now by turning math into programs the first time, we can see reinforcement learning being good at math, logic and reasoning. So these three trends powerfully come together, which is also the hiring thesis at Axiom, which is we need experts from three fields, from applied AI deep learning, from mathematics, and from programming languages.

Really, the combination of these three fields and math obviously include both mathematicians and people who are Lean experts who usually are mathematicians. So these three pillars is what it takes to build an AI mathematician as the first step to a super intelligent reasoning engine.

Jon: So we’ve got time for I think one question from the group. Who would like to hear from Carina and Byron? Yes, sir.

Question: It’s fascinating to hear from you. So, Carina and Byron, so with your outlier intelligence or super intelligence, what problem domains are you most excited about going into or solving problems? So we’ll be good to hear some examples of that.

Byron: I’ll characterize the solutions where the space of satisfying assignments is very sparse. I appreciate that’s probably a mathematician’s way of describing the space, but when you have in biological systems and physics and in programs, security, there are many, many programs that are incorrect and buggy and insecure and very few that are secure.

And so to balance the different dimensions of privacy, sovereignty, security, availability, durability, and to meet all of those and to write down those specifications and then to find the programs that meet the specifications. And especially in the enterprise, that becomes increasingly, increasingly sparse. And so I think the generative AI combined with the formal reasoning, neurosymbolic AI, I think that is really great. An example which would not be great is poetry.

So I don’t think that formal verification is going to be especially good at proving the correctness of a song or a poem or something like that. So there’s a spectrum for sure, but those areas where very large amounts of money is on the line, people’s safety’s on the line, security, privacy, sovereignty, and where the consequences of getting it wrong are pretty bad is I think the area where the methods are a best fit.

Carina: Yes. I’m most excited about, it’s really two parts when it comes to go-to-market, which is scaling formal reasoning in all existing market, like really looking around the corners because we’re talking about things that previously, people just accept that the verification of a protocol could take three years. And now you’re telling me, wow, it takes two weeks. What happens then?

So I think really scaling it up by looking around the corner and really that depends on collaboration between various industry experts. And we really appreciate customer inbound, which is, “Hey, you have this problem that you tried to solve. And I think maybe formal reasoning is helpful. I don’t really know, Carina.” And then we can talk. And these conversations starting are really just a generally promising signs that perhaps we can find a corner that’s immensely valuable.

And then the other part is once this math intelligence era is unlocked, what does it mean for unlocking new markets and use cases? That part requires some part of dreaming about the future. I read a lot of sci-fi novels and actually I was asking sci-fi authors this question is, what does it mean? And no one really could quite figure out exactly what that is going to mean, but that part is definitely something I’m extremely exciting about.

On the technology side, I want to emphasize that technology is extremely hard. We have a data scarcity problem. If we think about how many tokens of Python code there is on the internet, it’s probably about one trillion. I think that’s a safe estimate. And how many tokens of Lean code, the programming language for proofs, probably 10 million. So that’s a 100,000 times data gap. And we know that scaling works really well in AI.

So when your data is scarce, that generally means the technology is going to have a co-start problem, which is why we are taking bold data bets by various methods of automatedly not relying on human experts to write the link code, but to automatically synthetically generate high quality, extremely interesting in terms of helping model training data. I will highlight that as one pretty difficult technological problem.

Jon: Cool. I think we can keep this going for a long time. I’m going to be thinking about what is the most true and correct song that I know. We can leave that-

Carina: The song of pie perhaps?

Jon: Good, good. So anyway, that’s the discussion for later. But thank you so much to Byron Cook and Carina Hong. It’s been a fascinating conversations. Thanks everybody.

Related Insights

Related Insights

Sign up for our newsletter for emerging trends,insights and new investment news.

Sign up for our newsletter for emerging trends,
insights and new investment news.