TalkRL: The Reinforcement Learning Podcast | Transcript: Jake Beck, Alex Goldie, & Cornelius Braun on Sutton's OaK, Metalearning, LLMs, Squirrels @ RLC 2025

Jake Beck, Alex Goldie, & Cornelius Braun on Sutton's OaK, Metalearning, LLMs, Squirrels @ RLC 2025

August 19, 2025 / 12:20/E71

Speaker 1: 00:01

TalkRL . So we're just outside the lecture hall at RLC twenty twenty five in Edmonton at U Alberta, where Rich Sutton just presented his Oak architecture, and we're and there's some interesting discussions about and and comments about his architecture. So, first of all, we all love Rich Sutton, and we credit him for for admitting the field, and and we're glad he did.

Speaker 2: 00:25

Much respect to Rich Sutton.

Speaker 1: 00:27

Jake Beck, what do you have any comments on on the lecture we just saw?

Speaker 2: 00:31

Yeah. I I love Rich. I really appreciated the talk. I think that he spent a long time talking about meta learning and, like, the bitter lesson, how we should avoid baking in structure and inductive biases when we can just learn those from scratch. And what I found a bit odd was he then went on to talk about all this structure in the Oak framework and how we should impose all this specific structure.

Speaker 2: 00:51

It it really seems to me that a, like, we can meta learn all these things, like, we can learn learning algorithms and have them learn all the things he said that, like, they want them he wants them to do. He they can learn to plan. They can learn to do model based reasoning. We've seen that in, like, early meta learning works. And we also have, like, large language models that are doing this.

Speaker 2: 01:09

Not that I think that, like, large language models are necessarily a thing that we have to start from if we want to get competent agents, but we know the the right recipe. It's lots of data and some sort of sequential algorithm like memory and trained over many many tasks. And like we already have large language models that can do in context learning, which is, you know, therefore meta learning. I don't know like why we need to go back in time and impose all the structure back on the algorithms that we have that are

Speaker 1: 01:36

already working. Alex Goldie, do you have any do you have any comments on the lecture

Speaker 3: 01:39

we just saw? I think that that's key. Right? So in general, most breakthroughs throughout the last say ten years of of deep learning have really come from reducing structure and leveraging data more. Right?

Speaker 3: 01:52

Like, I mean, that's what Rich says is is the bitter lesser. And you can try and make arguments about what is or isn't needed for intelligence, hierarchical skills, having a model, etcetera. But I think what has been shown through LLMs, even if

Speaker 4: 02:12

What the hell?

Speaker 3: 02:13

Is not necessarily what you think is is the correct route to intelligence, is that given a sufficiently expressive system and enough data, the kind of behaviors that you'll need to complete a task will emerge. And so exactly as I think Jake was saying, if you were just working to the goal of maximizing performance, but you ensured that you had an architecture that was able to plan and have a model, etcetera. All of this behavior would emerge without you having to apply the structure yourself. I think the key thing to me is the more that we rely on human intuition and handcrafted rules as in Oak, which tries to kind of avoid this by doing that, The sort of more we're just going to be able to limit our maximum potential really. Cornelius Brown,

Speaker 1: 03:13

do you have

Speaker 4: 03:13

any comments on the the lecture you just saw? Yeah. I think I'm a bit less spicy about it than the other two before, but I also don't do meta learning. So I think that everything that he said seems like a valid recipe of going very far. I think that it might not be optimal as Alex kind of just said.

Speaker 4: 03:34

But I think generally the idea of using planning, I think it's very valid. I'm I'm kind of a fan of it. And I think it's a lot

Speaker 3: 03:43

of what he said also reminded me of Jan Lucan's like big idea of how AI should work. I was actually surprised by that. I also think that something that is worth recognizing is I don't necessarily disagree with the idea behind Oak that many of these behaviors are what we need to be able to perform maximally in an artificial intelligence scenario. I just think it's worth thinking about whether we should actively be encoding that into the system or working to the assumption that as long as the system is smart enough, they will emerge because or if they are necessary.

Speaker 2: 04:25

I think to to build off what Alex was saying and also to tie it back to the, like, the keynote given by Dale, We just need to make sure that our algorithm that we are learning has the right computation set up so it can learn all the things that Rich was talking about. So it can learn to plan. It can learn to learn. But once we make sure that is in the, like, hypothesis class that we are focusing on, all of that can be meta learned, and we have demonstrations from large language models that it is meta learned. So I don't think like, we can sure, we can scrap large language models and do all this from RL, like meta RL from the get go, just many RL tasks and a system with memory.

Speaker 2: 05:02

And I think, like, all of this can be and will be learned in the dynamics of whatever architecture we give it as long as, as Dale said, it's it can support the computation necessary. We're not expecting a quadratic run time from a constant time algorithm.

Speaker 3: 05:14

I think that's exactly right. And I think the nice thing about this talk is the smart people thinking about what we need to ensure our system is capable of doing. I just think it's then when we design the systems, should we be doing that by hand, or as you say, ensuring that within the the hypothesis class of the models that we create, these capabilities can emerge. I think I think that's the key. Right?

Speaker 1: 05:42

And this is Robin Chohan, though. I I guess I'm surprised that multi agent didn't have a a bigger role, or much of a role to play in this, and I think the whole idea that human intelligence, you know, may well have evolved from our social needs. Seems like it could be relevant here. You know, maybe I maybe I see this as a scaffold, like, you know, who before LLMs would have predicted that, oh, the the most intelligent thing we've seen so far that we've constructed is this LLM. I think it wasn't many years ago when we wouldn't have imagined that.

Speaker 1: 06:14

So to think that we can envision the ultimate architecture today, Maybe, maybe that's out of reach, but maybe this is, you know, one part of a scaffold that'll let us envision the next thing. The same way LLMs were. I I I'm not sure if anyone really thinks that LLMs Anyone here thinks that scaling LLMs is really gonna get there. And I think that was that was a controversial point from yesterday. But in this conference, it seems to me that most people don't believe that just scaling up LLMs is gonna be enough.

Speaker 2: 06:45

I think that we are running out of data with LLMs. Like, it's projected I think that the median estimate is, like, twenty twenty eight by the time we will use up all of the, like, open data on the Internet in training for, like, large language models. And I think that RL is a really promising way to get around that data bottleneck. Like, Grock, I think, was trained, like, 50% with reinforcement learning in the last iteration. So it certainly seems like it's becoming a bigger portion of the pie.

Speaker 2: 07:10

It's unclear to me whether, like, that is mostly for, like, learning the correct chain of thoughts, or whether we can actually go beyond that and learn to do, like, kind of the difficult exploration problems that we hope large language models will be able to solve for us with the RL methods that we have. But, like, I think, certainly, are going to hit a wall if we don't do something. And RL, at least in theory, should be able to bridge that gap.

Speaker 3: 07:33

The thing that Rich sets in will never ever be wrong about is that to be able to improve continually, we have to learn from experience. And so I I think it's it's a pretty unanimous or a consensus nowadays that just increasing data and just increasing model size is going to saturate or maybe has already begun to saturate. And so we are going to have to start to think about how to incorporate reinforcement learning into the process. Whether that is using current techniques and whether training generalist agents will involve a language model or not, I think is up for discussion and different people will agree or disagree. But I definitely think that at some point, we are going to just be having to focus on making things better by learning from doing.

Speaker 2: 08:27

I think one thing that is, like, really cool about large language models, not only do they give us a useful starting point for text data, but also they just show us that this general recipe can work. Like, not only can large language models, I don't know, answer very complicated questions, but they can like, within the activations of their transformer model, they can do learning. So that's in context learning. They can do meta learning. They can do continual learning.

Speaker 2: 08:50

So all the things Rich really wanted, we've shown that they can happen already inside, you know, the sequence model with this given recipe, and we can just take this recipe and now apply it to RL.

Speaker 1: 09:00

So so I talked to Joseph Amedel, who has Open Mind Research recently, and he pointed out, you know, look at that squirrel over there. It didn't learn from reading the Internet. How did it learn to be intelligent? Do you guys find that a compelling argument, or is there something missing here? Do we need to replicate the squirrel intelligence before jumping to the textual intelligence and thinking that we're gonna get there?

Speaker 1: 09:26

Well, I

Speaker 2: 09:26

don't think we need to start with LLMs per se. I think that what they really do, as I was trying to hint at, like, they give us a recipe. It's a sequence model with memory plus tons of data and tasks. And we can do that in reinforcement learning. We don't need to read the whole Internet.

Speaker 2: 09:41

But it also seems like we're, you know, throwing away a present someone has given to us if we just completely throw LLMs in the trash.

Speaker 3: 09:48

I agree that squirrels don't learn from reading the Internet. I think it's worth pointing out that a baby squirrel also isn't a randomly initialized network, and there's been a long process of evolution, which encodes a bunch of behaviors from which the squirrel can begin to learn from experience. And, I mean, to me, that's one of the really compelling things about meta r l. Right? It's learning to learn.

Speaker 3: 10:13

I think in in many ways, biological evolution is just another example of meta learning. He talks a

Speaker 2: 10:24

lot about meta learning, and he talks a lot about like runtime learning versus design time learning. And or what we'd call pre training with, like, large language models. And I think that kind of the irony here is that the more you do meta learning, the more you're actually trying to chain like, shift into the paradigm of pre training and large scale learning before you deploy. Because you're learning the learning algorithm that will then be effective at deployment, and you do that by shifting training to pre training. So I think, like, these two things really felt in tension, and I don't feel like

Speaker 3: 10:51

that tension was resolved in the talk. I think it's worth saying that Rich Sutton is the guy who did give us RL, and so I am also a huge fan.

Speaker 1: 11:02

Hey, me too. Don't I don't think anyone disagrees on that part. Yeah. That's the one thing we can all agree on. I would like

Speaker 2: 11:07

to make it very clear that I am thankful to Rich Sutton, not only for the talk, but also for the field of reinforcement learning.

Speaker 3: 11:14

One of the the key things that I've learned throughout my PhD is when it comes to intelligence, even if you don't necessarily agree on the way to get there, it's never a good idea to kind of bet against what Rich Sussman thinks you need. And so I think kind of one of my key takeaways from the talk was not necessarily that we need to design our systems such that they can plan and they have options and all of this stuff, but always thinking about making sure that these capabilities are possible when we're designing our systems. I think that's the absolute key takeaway from what he was saying.

Speaker 2: 11:53

I think we should meta learn Rich Sutton.

Speaker 4: 11:57

I mean, to what you said, he also said that himself. Right? When people were talking about inductive biases, he said, he doesn't have a problem with people putting them in, but you shouldn't be proud about that, which I guess aligns very well with what you said.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Jake Beck, Alex Goldie, & Cornelius Braun on Sutton's OaK, Metalearning, LLMs, Squirrels @ RLC 2025

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere