TalkRL: The Reinforcement Learning Podcast

Ian Osband

March 7, 2024 / 01:08:26/E49

Robin: 00:01

TalkRL. TalkRL Podcast is all reinforcement learning all the time. Featuring brilliant guests, both research and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chauhan.

Robin: 00:22

I am very excited today to be joined by doctor Ian Osband. Ian is a research scientist at OpenAI working on decision making under uncertainty. Thank you so much for being here, Ian Osband.

Ian: 00:32

Thanks a lot, Robin. Thanks for having me. I feel like a real celebrity here today. Big fan of the podcast.

Robin: 00:37

Thank you so much. Well, you absolutely are a celebrity, and I've been looking forward to meeting you and talking to you for actually many years. I've been following your your work. I think a lot of people listening to the show will know your name, and, you know, I've always thought of you as as a thought leader, for dealing with uncertainty in RL and, specifically, epistemic uncertainty, which is which is really so central to the challenge of RL, and yet it gets glossed over so often as well. I've been a big fan of your work going back to bootstrapped DQN and your randomized value functions.

Robin: 01:14

I remember when OpenAI's randomized network distillation came out. That was back in 2018. I remember when that came out and thinking it was so cool. It was an it was a solution for for exploration. But I only recently realized, preparing for this episode that they're building on your work, and they and they say that in their paper, which I was excited to see.

Robin: 01:32

And I also remember when bsuite came out and seeing that your agents, did so well on the exploration tasks, the behavior suite. But I only realize now preparing for the show that you were the first author on that paper as well, which I did not realize at the time.

Ian: 01:47

Well, I don't know if that makes it more impressive. That probably makes it less impressive because I could pick pick what went into it. But, yeah, I I do like that work. Thanks.

Robin: 01:56

Yeah. You maybe have had a little bit of a head start by writing the paper, but, yeah. But I and I also remember watching your PhD thesis, which I came to a couple years late in 2016. And, and we'll link to that as well and these other things, in the show notes. But all this to say, super happy to finally get to talk to you, Ian.

Robin: 02:14

And, so I'll in your own words, how do you like to describe your area of focus?

Ian: 02:19

Okay. Well, yeah. Thanks a lot and great, you know, really great overview. I think I'm I'm interested in, learning agents that learn to make good decisions and take good actions. So that's what I view as like the key thing in reinforcement learning.

Ian: 02:35

And I guess for me, you know, reinforcement learning is commonly used kind of confusingly both as the solution methods and the problem. And I'm definitely much more interested in the solution I mean, sorry, in the problem. And the problem the key elements, I think, for reinforcement learning compared to standard supervised learning is you have to be able to generalize from datasets. That generalization learning, that's the standard supervised learning. But when you take actions, there are 2 other problems.

Ian: 03:05

So one of them people call exploration is that basically you may only learn about actions that you actually take. You don't necessarily get central, You don't get law of large numbers, central limit theorem, when you don't gather data. So you won't necessarily find out what's good just from observing the data. And then the second thing is delayed consequences or long term consequences. If I take an action now, it can influence the state of the system in the future.

Ian: 03:31

And I think in dealing with both of those extra challenges I mean, dealing with uncertainty, I think, is important in generalization as well. But I think it's critically important for dealing with exploration and long term consequences. And so that word you use there, epistemic uncertainty, I think it's a good word for people who know what it means, but it sounds a bit pretentious. I think there's this question. It's like, hey, when there are things I don't know in the world, are there 2 different ways of knowing what I don't know?

Ian: 04:00

So you might say that there's you know, I've got a fair coin. I flipped it a 1000000 times, and half the time it comes up heads and half the time it comes up tails. And so if I flip it, I think it's gonna be 5050. And I think that's I know it's 5050, and it's 5050 due to chance. And some people call that alia toric uncertainty because alia is Latin for dice.

Ian: 04:22

The idea is it's chance uncertainty. And then there's this other idea of, like, okay. You know, someone tells me they've got a, you know, they've got a pet, and it could be a dog or a cat. It's one of the or the other. Could be 5050, but I don't know which one.

Ian: 04:39

And that's sort of uncertainty due to knowledge. There's nothing chance about it. And episteme is like ancient Greek for knowledge, and so people call that epistemic uncertainty. Or it's like, I'm gonna flip the coin, but it's biased. I don't know which way it's biased.

Ian: 04:53

And you kind of need to handle those things differently depending on the source of uncertainty. And that's I'd say that dealing with that has been kind of the core of my work in reinforcement learning.

Robin: 05:05

Now you moved, from DeepMind to OpenAI recently, I gather. Did your your focus adjust, after that move, like, maybe more towards LLMs? Or or can you talk about, how your your focus is evolving?

Ian: 05:18

Yeah. I think I think I've always been interested in trying to get these methods. Even from PhD, it's like, hey. I wanna get these insights from maybe toy or simple, settings, but I wanna scale them up to the biggest challenges and the largest neural networks and, you know, the leading edge of AI. And, definitely, that's what I was doing.

Ian: 05:39

You know, we were working on that at DeepMind, and I think that's definitely still the focus at OpenAI. Actually, for the past few years, I, was working with a team called the Efficient Agent team at DeepMind, and we were working on these ideas around LLMs. So I think that, you know, broadly speaking, the interest is is still the same. Maybe I'm a little bit closer to some of the product work, and, you know, we kinda feel like more in the trenches. But, both both are amazing places, and I'm I'm really happy with the new team at OpenAI.

Robin: 06:11

So we've seen, various approaches to dealing with uncertainty and different types of uncertainty in machine learning. And, I think when I came when I was coming to the field, I realized there's this giant gap between this kind of world of of Bayesian methods, where uncertainty is handled in a in a very principled and correct way, but they're very costly for for even dealing with small problems. And then the deep learning crowd where, people wanna you know, wish they could do the Bayesian thing, but they can't because it's it's so costly, and they and they need to scale up. And, and then the quality of the uncertainty estimates are are just are not nearly as good. It seemed like almost 2 different cultures, and there wasn't that much in between.

Robin: 07:01

And the efforts to bridge them didn't seem very practical, But I gather you're trying to bridge these these two worlds. Is that is that one way of looking at what you're doing?

Ian: 07:10

Yeah. Yeah. I think so. I think so. And I think that although some of these things have kind of evolved a little bit separately, I think sometimes that happens because people really focus on specific solution methods.

Ian: 07:23

And so you go to the conference and like, hey. We're people who do Bayesian methods, or we're people who do this type of method. And I think that the fields that have made the most progress are where they really focus on the problems rather than the solution method. So, like, when they did ImageNet, right, a lot of people, if you're really focused on doing a lot of ImageNet, at some point, hey. I'm an SVM guy or whatever.

Ian: 07:44

But, you know, when when the results came out, everyone's moved on to deep neural networks, and I think they've made a lot more progress because of that. And so I think that, I think that the progress in LLMs and, you know, AI methods is just astonishing. And yet, I think that there's still a lot of untapped potential to make even better machines by kind of taking some of these insights from maybe that are maybe handled more elegantly in the Bayesian literature. And, you know, I don't want this this can sound a bit pretentious or high level, but I think it's really the core insights are pretty simple. It's like, hey.

Ian: 08:25

It's important when you're learning to be able to tell the difference between something you know and something you don't know. And I guess we're gonna talk about some of that today, but I'd say surprisingly, even some of our best models, at the moment aren't built with this from the the ground up or can't do it in a natural way. And And I think that if we can if we can improve on that, then maybe we're gonna unlock the next, you know, step change in learning algorithms.

Robin: 08:51

K. So today, we're gonna focus on, a few of your works in specific and also, your themes. But we're gonna start with, a paper reinforcement learning a bit by bit. This is from 2021. First author, Lou, and with yourself as a co author.

Robin: 09:06

For me, it's been interesting to see this relationship between information theory and RL, developing, and you made some really interesting connections here. My background is computer engineering, so we studied some information theory from the, like like the Claude Shannon perspective and communications, and noisy channels. But, I in RL, I guess the challenge is not just the noisy noisy channel. I I guess there can be noise in RL. Maybe rather the idea of, like, actively steering the agent towards whatever information is more valuable, which is maybe not what Claude Shannon had in mind.

Robin: 09:39

So can you talk about the connection a little bit? Is information theory obviously the the right framework for tackling exploration in RL? And and how do you frame the RL problem, in terms of information theory?

Ian: 09:50

Yeah. Well, that's a great question. So, I mean, I think this paper, by the way, reinforcement running bit by bit, I definitely wanna recommend it. I think, you know, I'm not the first author, but I think it's probably one of the best and substantial piece of work I've been involved in. I think that information theory gives you a really elegant way to think about uncertainty.

Ian: 10:10

And, actually, when I did my PhD, you know, if people are out there doing you know, have seen these square root t papers, I did a bunch of them. Epsilon, delta, concentration, bounds, or whatever. And I think that some of those techniques, they can be good, but they're a little bit, inelegant. Whereas sometimes, analysis through information theory just feel like they're so much nicer and allow you to think about things more generally in terms of, you know, generalization and, you know, whereas a lot of the other tools were specifically you know, they started with, oh, imagine you have a finite number of states, a finite number of actions, and then you go up from there. I think that information theory has really handled generalization in a a great way.

Ian: 10:55

Now as you say, the, you know, Claude Shannon kind of genius kind of brought brought this field almost in one shot. And what I think that RL brings to the table is like, well, we're not actually seeking out information in RL. I think what you're seeking out in the RL problem is this notion of reward. Right? And it's up to you to define what the reward is that you want.

Ian: 11:21

You know, that just defines the problem. But information is sort of a very important instrumental goal in getting that. Because if you know more about if you knew everything about the system, then it would just be a planning problem or something like that. And information is really a good way to think about learning. So I think this paper shows a, like, a more elegant and kind of coherent way to frame the whole problem and to think about things.

Ian: 11:46

Because sometimes if you're not careful, there are little details that don't fully make sense. And I think that if you read through this, I think it it's a better way to frame things.

Robin: 11:55

Okay. This this paper talks about the idea of information ratio. Can you can you explain what that what that is and what does it measure?

Ian: 12:03

Okay. Well, yeah, before we get to the information ratio, I guess there's a few ideas in this paper. I think one of the big ideas is to realize that whenever we're thinking about the world or the agents interacting with the environment, there are kind of 3 there's 2 key notions. There's one is this idea which we call an environment proxy. And so that's something like a model of the environment, but it doesn't need to be an explicit model like, oh, given a state, given an action, get a new state.

Ian: 12:33

So in the case of something at DQN, it could just be the q function and the replay buffer or the neural networks that you learn. But kind of formalizing that, hey, there's a separation or it could be a finite state MDP. Right? An agent, you can be interacting in the world and model, warehouse levels and inventory with, like, discrete states, but that doesn't mean it's the real environment. Right?

Ian: 12:59

So the real environment is the real world. And then you have this proxy that's your proxy for the real world. And so distinct you know, I think a lot of the framings, you know, in Sutton's book or something like that, they don't really make that distinction clear. They sort of act like the model you have is the real MDP, or it's sort of not clear. And then so this that's one notion, the environment proxy.

Ian: 13:23

And then the another idea is a learning target. So it's like what the agent is gonna try to learn about. Now, again, these are kind of complicated things, but it just it's pointing out more explicitly like, hey, when I'm learning, I don't wanna learn about everything. You may want to learn about the optimal policy. Right?

Ian: 13:41

Because when you talk about information gain, you need to say what are you gaining information about? And that's the learning target. So you might say, hey. I wanna learn the optimal policy. Or you might say, hey.

Ian: 13:51

I wanna learn the optimal q values. So this paper is like it's unusual because it's more of like a big picture kind of framework that a lot of sub algorithms could fall out of. But I think that thinking in this more general way helps to organize your thinking in a more clear way. And so then the information ratio is this term that comes out both in our analysis of certain algorithms and as something that you can optimize. And it's a ratio between how much you think you're gonna learn by taking an action and how much worse it is than what you think the best possible action is.

Ian: 14:30

And so it's quite a natural term. It's something like exploration versus exploitation. So exploration being trying out things to learn and exploitation trying to do well, given what you know. And this information ratio is this term that comes out as sort of fundamental to these analyses and also suggested for a new algorithm, which or family of algorithms which we call information directed sampling.

Robin: 14:55

Okay. And can you talk about, information directed sampling? What, what is it trying to do? And and maybe how does it differ from from other major formulations of curiosity that we've seen?

Ian: 15:06

So I think, okay, this at a super high level, this idea of trading off exploration versus exploitation is captured by a lot of different algorithms. And on in some sense, the solution is trivial. You can write down your Bayesian beliefs, and you just maximize your expected return. And that is the optimal, the Bayes optimal solution. The problem is that doing that is essentially impossible But it's got an infinite set of states, and it's, you know, it's impossible.

Ian: 15:41

And certainly when you think about, like, the real world or trying to make a chapel or something like that, it just it it's it it doesn't really make any sense to think about doing it. But it you can write it down on a whiteboard, but it's not something you can actually do.

Robin: 15:53

Mhmm.

Ian: 15:54

Now on the flip side, the other thing is like, hey. I'm gonna do the the greedy solution or what some people call estimate then optimize. It's like, hey. I don't know what the world is like. The world's too complicated, but I'll just come up with an estimate for, you know, the q values or something like that.

Ian: 16:09

A point estimate, and I'll say, hey. This is how good different policy is, and then I'll take the one that I think is best. And so that's called being greedy. And then people know that if you're just greedy, you gotta try out other things sometimes, so they add in a few random actions. And they're kind of random in a, you know, undirected way.

Ian: 16:27

Call that dithering. And so, like, epsilon greedy, Boltzmann exploration, all of these things are like that. Even if they sound kind of fancy or they might be sort of good heuristics in some cases, they can be like really, really bad. So it take like exponentially long to learn in general. Okay.

Ian: 16:44

So I think that but but at least you can run you know, it's more computationally tractable. So the game is kind of about trading off a statistical versus computational complexity. And something that I've worked on a lot and people might have heard of as these Thompson sampling, which is this idea that, oh, I don't know what I'm gonna do, so I'll randomly choose things according to the probability I think they have the thing to do. That's Thompson sampling. Or some of these curiosity or optimism approaches, which are like, hey.

Ian: 17:15

I don't know how good things are, but I'll assume that they're as good as they could possibly be. And if I assume that it's good, then I'll at least try it. And if it's good, then great. And if it's bad, at least I find out. And so those are a little bit better or can be good in some scenarios.

Ian: 17:32

But they miss out on some of the juice because they're like a very heuristic way of their proxy is if I try things that I think are good, then I'll learn a lot. But there are some things in the world that don't work like that. Right? So for example, imagine if I come into a new a new I'm late for my interview, and I'm in a new floor. I've never been there before.

Ian: 17:57

And I know that I've gotta be in one of the rooms on the floor, but there are a 100 rooms. What would the the Thompson sampling approach do? Like, I don't know anything else. I think that that you could reasonably argue and say, hey. I'm equally likely to be in any of them, so I'll just pick 1 at random and go into it.

Ian: 18:17

Right? And, oh, no. It's not that one. Okay. Go again.

Ian: 18:20

Try another one. Right? And I just do that through. Eventually, I'll find it. Okay.

Ian: 18:24

Great. I will find it, but it'll take me about as many rooms. Right? But the what you'd like to do is you come on, hey. I don't know where I am.

Ian: 18:35

Let me look on the map. Right? Now looking on the map is never the optimal solution. It takes a bit of time. You'll you know?

Ian: 18:43

I know that the, the meeting is not at the map. Right? I know that for sure. But you'd like an algorithm to say, hey. I really don't know what's going on.

Ian: 18:52

Let me let me get the information I need to make the right decision. And and that kind of thing, information directed sampling can do. And it's also a tractable algorithm. So that should be the idea.

Robin: 19:07

So is IDS a form of intrinsic motivation?

Ian: 19:10

Some of these things like curiosity or intrinsic motivation, it's hard to say exactly what they mean. I think that, yes, in in in a very high level sense, it it could motivate you to do things, but it's it's actually very it's very motivated by what you wanna learn about. So IDS is giving you this way to balance what you wanna learn about versus how much it's gonna cost you. And it kind of looks at those two terms and makes decisions based on that. And whereas some intrinsic motivation approaches, you know, they really struggle on how to do this balancing, and I think that's where the interesting thing comes from this approach and and that it's tractable.

Ian: 19:54

So if you look in that paper, we've got examples of actually running a version of this with deep neural networks. With, you know, open source code, you can have a go with it. And, you know, it's still proof of concept at the moment, but it shows it's something you can actually do.

Robin: 20:07

On a high level, how do you think about what you can learn in the very next step? Like, I I guess, Thompson sampling works very well in the bandit setting where this is only a one step environment or really there's no steps involved. Per as in RL, there might be something that you don't know that is a few steps down the corridor. Yeah. So how do you how do you think about the difference between what you're gonna learn in the next step versus what you may learn some at some point in the future, even if the next step is familiar?

Robin: 20:35

Like, how do you how do you think about time in that sense?

Ian: 20:37

Yeah. Actually, you've given me a bit of a layup here, but thanks for that. So the title of my, my PhD dissertation was deep exploration via randomized value functions. And this distinction you're making about what you can learn right now, you know, you take one action and you learn something from that versus sometimes you have to take a sequence of actions, and you won't learn anything until the end. And we call that deep exploration.

Ian: 21:03

You know? It's obviously a bit evocative of deep learning, and it's a good word because once you wanna be deep, you know, nobody wants to be shallow. But, yeah, it's deep because it is over multiple time steps. And just to give an example of that, it's like, okay. You know, imagine well, in I I hate to use the video game analogy, but because I think that people kinda get confused and think RL is only about video games.

Ian: 21:28

But, you know, in Montezuma's revenge, this is a sort of famous example, you may need to go you know, you mean that you need to take hundreds of actions, you know, 1 by 1 by 1 in order to get to a new room. And probably for the first ten of those actions, you're not actually learning anything new because you haven't gone into any you're just exploring what you already know. You're doing it because you're gonna set yourself up to learn something in the future, And that's what deep exploration is about. And I think it's a really important, it's an important aspect of the reinforcement learning problem and something that makes this exploration problem like even worse for RL. Like it it means I guess that's why I'm interested in in working on these exploration or uncertainty things because when you get this right, you can make the solutions, you know, thousands of times faster or millions of times faster.

Ian: 22:22

You can solve problems that you can even prove that even Mu 0 and it's not just proving, you know, it's it's also if you run it, they can't do it. You can you can solve these problems millions of times faster, even for the best agents we see here. And so that's, like, super exciting to me. This is potentially something that doesn't give you a 5% boost. It doesn't necessarily give you a 10% boost.

Ian: 22:46

It can be a absolute deal breaker. And so that's why I think it's an exciting thing to research.

Robin: 22:52

Okay. So let's move on to the, the idea of joint predictive distributions from one of your papers. This is from predictions to decisions, the importance of joint predictive distributions by Wen et al with yourself as a coauthor. Now this phrase joint predictions comes up, often in your work. And, the first time I came across this idea of yours, I think I wasn't sure how to interpret the joint here.

Robin: 23:17

Like, what is being joined over? And I gather that it's over something you're you're calling an epistemic index. Is that right? How what is what do you mean by joint predictions, and can you give us the gist?

Ian: 23:27

This has been a confusing thing. And and so, definitely, we didn't go into this thing. Hello. I'm super interested in joint predictions. Right?

Ian: 23:34

Absolutely not. So that was more actually only after kind of years of thinking did this emerge as a key concept. So so okay. Here's the thing. Most machine learning is single input, single output.

Ian: 23:49

Like, you get one image, you give me one label. You get one image, you give me one label. Okay? And so that's the thing that we're calling marginal prediction. Marginal because it's like there's 1.

Ian: 24:00

I'm gonna get one image. I'm gonna make one prediction. Joint prediction is I'm gonna get multiple images, and I'm gonna make multiple predictions. Okay? So I'm gonna instead of 1 image, I'm gonna get 10 images.

Ian: 24:13

And you're gonna have to give me 10 labels at the same time. And it's not obvious that that has anything to do with decision making, and that's why I guess we wrote these papers. But at least does that problem I can give you an example of this, but is that making sense for what joint prediction means?

Robin: 24:31

When you say there's multiple predictions given, like, there's some assumption about the independence of those different predictions or the information that each

Ian: 24:38

of those That's the point. So that's the point. That's entirely the point. So we're conditioned you know, so much work is, like, I got one image. I gotta make one prediction.

Ian: 24:47

But the importance of joint prediction is how independent or not independent the predictions are. Right? And so one the definition of what's you know, I think that this is what information theory, you know, makes so, elegant. They've got this thing, this notion of mutual information, and that's the KL divergence between the joint distribution over these multiple things and the product of marginals. So the mutual distribution the mutual information of x and y is how different, the joint distribution of x and y is from what it what it would be if x and y were independent.

Ian: 25:26

So if they're that if if imagine if x tells you everything about y, then they have high mutual information.

Robin: 25:32

Like, there's something independent of, some something that's making them independent. Right? Either they're judging in a different way or they have access to different data. Like, why would we why do we have multiple judges on a court on a Supreme Court?

Ian: 25:44

So it's still only 1. It's not that. So the idea is there's there's only there's still only one judge. Right? So you are the judge.

Ian: 25:51

But imagine I come to you with an image. I come to you with one image, and it's a small it's an image of, like, this small blue blue alien. And I say, hey. Do you think this is a bleeb or a blurb? Right?

Ian: 26:04

So you're a bleeb or blurb classifier. And you'll say, I don't really know. 5050. Right? You don't know anything.

Ian: 26:12

Right? But now I'm gonna come to you. I'm gonna say, okay. Cool. 5050.

Ian: 26:17

Great. Now I'm gonna come to you with 10 images and 5 of them are these blue, small, round things, aliens, and 5 of them are these yellow, spiky, bigger things. Okay? And I'm gonna ask you a different question. I'm gonna say, hey.

Ian: 26:37

Give labels. What are the possible labels for all 10 of these? Which of these are bleeds and which of these are barbs? Okay? So when you had only when you're only classifying 1, 5050.

Ian: 26:50

Right? But you could make a prediction for all 10. You could say, hey. Each one's 5050. 50505050505050.

Ian: 26:58

Great. But I think if you noticed that there were obviously like 2 types, you might make a prediction that's like, hey, either all these 5 are bleeds and the other ones are blurbs, or it's the other way around. Right? That's a very different prediction. Right?

Ian: 27:15

In one case, if it really works like that, in one case, you've got a 5050 chance of getting all 10 right. In the other case, you've got a 1 in a 1000 chance of getting all 10 right. So do does that make sense? What what the joint prediction means? One is this is back to this epistemic aleatoric thing we said at the beginning.

Ian: 27:34

So I've got one I've got one coin that I've flipped loads of times, and it's heads or tails. A 1000000 heads, a 1000000 tails came up. I know it's 5050. Right? And I've got coin b.

Ian: 27:48

Now coin b is like really weird. I bring it out of a bag. I've got a funny grin on my face and I say, hey. You wanna flip to this coin instead? And, basically, what I'm trying to suggest is that this is a biased coin.

Ian: 28:01

If you flip it, it will always be tails or always be heads, but you don't know which way around it is. Okay? Now if I ask you coin a or coin b, I'm gonna flip them once. Well, what's your probability of heads? It's 5050.

Ian: 28:14

Right?

Robin: 28:16

Makes sense.

Ian: 28:16

So it so if you have a classifier, a neural network that takes in coin and outputs probability of heads, you can't tell the difference. Right? Mhmm. But now you and I are gonna play a game. And the game is we're gonna have a 100 rounds, and each round you get to pick a coin, a or b, you get to flip it.

Ian: 28:36

If it's heads, you get $1. If it's tails, you lose $1. Okay? Or, yeah, or we can call it 0. But let's say let's say, what let's say you win 1 or you lose 1.

Robin: 28:55

Ian: 28:56

Now now I say, hey. What do you wanna do? What's your choice? And I I I I'm gonna ask you right now, like, what would you wanna do?

Robin: 29:04

I would have to think about it a bit more. Yeah.

Ian: 29:06

It's hard. It's hard. Okay. Well, I guess what I'm saying is, okay, if you pick if we say you're gonna gain 1 or lose 1, if you pick coin a, its expected value is 0. Right?

Ian: 29:18

So you're not it's not really good. It's not really bad, whatever. 5050. Now point b, if you flip it 1 on one flip, it's also either gonna be heads or tail 5050. Right?

Robin: 29:30

You're either gonna win every time or lose every time on the one coin. Right? And the other coin Exactly. It's gonna gonna even out.

Ian: 29:36

Importantly, you don't need to if you flip coin b if you flip coin a, you don't learn anything. Right? You don't learn anything from flipping coin a. But if you flip coin b and you get heads on the first time, hey, you're in the money, you're gonna make $100. Right?

Ian: 29:53

Great. If you flip coin b and you lose $1, you're not gonna flip coin b again. Right? You're gonna go back to a. Do you see what I mean?

Ian: 30:03

Because you can just flip a the rest of the 99. So the best decision is to first you flip coin b if it gives you money. Now I go to, I keep doing that. If it's bad, I go to a. So So it's clear that that strategy is a much higher expected reward.

Ian: 30:25

Right? Now a neural network that only can tell you the probability of heads, you can't do that strategy because you can't tell the difference between a and b because they had just over one flip, over one marginal prediction, it's just 5050.

Robin: 30:43

Makes sense.

Ian: 30:44

And I can ask

Ian: 30:44

you a different question. Yep. If I said, hey. What are the possible distributions over 2 flips or a 100 flips? But let's say 2 flips.

Ian: 30:52

For the one case, for case a, you say, hey. It's equally likely, you know, to be heads heads, head tails. You know, if the first one's heads, the next one could be 5050. If I do the other coin, you say, hey. Well, the first one could be anything, but conditioned on the first b one being heads, I know the second one is heads.

Ian: 31:10

And so then you can tell them apart. Right? That's another one like saying the mutual the information you gain, you can compute this. I want to learn about which arm is the best arm, a or b. If I pick a if I pick b, I'm gonna gain a lot of information about arm arm b.

Ian: 31:29

And that's why, information directed sampling would sample that. So it's like what joint prediction means. Joint prediction is asking, hey. What would it what would happen if I have 2 flips rather than just 1?

Robin: 31:43

So, yeah, I haven't heard joint predictions, from from other researchers, very often. Can you help us understand how these these ideas relate to to other other uncertainty approaches that we've seen?

Ian: 31:53

Yeah. So it's a bit of a weird as I said, it's not something we came into saying, hey. I need to care about joint prediction. But hopefully, that coin example with a or b, like, that's about as simple like, people will talk about it more as epistemic versus aleatoric uncertainty. So that coin, if we remember, a was the one it's 5050 and you know it's 5050, and b is the one that could be heads or it could be tails always, but you don't know which one.

Ian: 32:17

So people would classically refer to that as epistemic versus aleatoric uncertainty. Coin a is aleatoric and coin b is epistemic. Right? And so you want to seek out the things with epistemic uncertainty, and you maybe don't want the things with aleatoric uncertainty. The problem is I I think that's a great way to think about it if you have a if you can agree on a model.

Ian: 32:39

So if we can agree that this coin is a Bernoulli random variable with probability p, then condition on that model, I can say, hey. A has no epistemic uncertainty because p is known and it's 0.5. And b has a lot of epistemic uncertainty because I don't know p. But different people can have different models of the world. For example, you could say, hey, there's no such thing as aleatoric uncertainty in the world.

Ian: 33:08

There's only epistemic uncertainty. And if I could just measure every single atom and all the forces when you flip it, then I'd know exactly what would happen when you flip coin a. So coin a is not aleatoric at all. It's just epistemic, and I don't know the answer. Right?

Ian: 33:24

And certainly, as we start thinking about building more and more complex AIs, I don't think we want to think about building it out this much as epistemic, this much as aleatoric. It doesn't really make sense. There's not really the clear line that you think there is because it depends on your choice of model. Even for a linear regression, if you and I are using different features, we well, we can disagree on what's epistemic versus what's aleatoric. And so I think that because of that, a lot of, you know, the base in deep learning workshop in Europe, they say, hey.

Ian: 33:57

We've got a model and we've trained a ResNet, a Bayesian ResNet. We use loads and loads of TPUs. And this we used a certain Monte Carlo scheme and this is the true posterior. And that's the epistemic uncertainty. And so I'm gonna rank you on how well you approximate this method's epistemic uncertainty.

Ian: 34:16

So conditioned on a certain model class, how well do you do? But the only thing that's real is the data. Right? And it's kind of perverse to say, oh, it's this specific confnet and these specific parameters. Why this way versus that?

Ian: 34:30

The only thing that's real is the coin. Did you flip it? How many heads came up? And joint prediction is something you can measure on actual data. And so I think that this just it's not an obvious thing to come.

Ian: 34:43

Oh, I should measure this joint prediction. But one of the good things is that if you want to minimize this joint prediction, this joint prediction loss, so you can measure that just with the usual log loss, except now you're doing it on the joint predictions. The optimal solution is the Bayesian solution. Right? So that is the gold standard.

Ian: 35:03

So they agree on what's the best possible thing. But one of the nice things about measuring this is you don't need to, you don't need to, invert the commas, be Bayesian. You just need to make good predictions. Although our question, what does it even mean to be Bayesian? It might mean making good predictions.

Ian: 35:23

So I know I kind of the hairs on my neck go up a little bit when people talk, oh, is this Bayesian? Is that Bayesian? I I don't really know. But it's clear. I think that this offers a coherent way to think about these things.

Ian: 35:36

Like, what's at versus what's epistemic and what's aleatoric, which the more we went into it thinking about that, but the more we pushed on it, I think Rich Sutton, Dave Silver were really good about pushing us back on this. They were like, no. That's not a real thing. It's not a real thing. You know, the only thing that's real is the data.

Ian: 35:52

And so I think they're mostly right, and they were much more right than I gave them credit at the beginning. But this notion of joint prediction, I think, is novel and important.

Robin: 36:02

There's some kind of beta distribution there. And when you have a joint prediction, then there could be multiple beta distributions. Is that right or no?

Ian: 36:11

So I guess if you're gonna model this in the Bayesian sense, the conjugate prior is you say, hey, I've got my I don't know the probability p and that's following some beta. And so I could so if I wanna make a joint prediction, well, I know that it's gonna be the same p for all of the realizations. So first, I sample a p and then I sample a 100 flips. And I can do this multiple times. And that could be my empirical distribution.

Ian: 36:36

And that gives me my joint estimate distribution. Right? But I guess what we're saying is that's one way to make a joint prediction that isn't the product of marginals. Right? So so for example, okay, I if I'm bay if I'm Bayesian agent, a perfectly Bayesian agent, I say, I I wanna get a few samples from my posterior.

Ian: 36:59

I say, well, p is either 1 or 0 for b. I'm gonna sample it as 1. If p is 1, then they're all heads. Okay. Do it again.

Ian: 37:07

P is 1. They're all heads. P is 0. They're all tails. Cool.

Ian: 37:11

For the coin a, I know p is a half. So I sample p is a half, and I say, oh, 50, 50, 50, 50. Alright? And that shows a way that you can get make joint predictions. But, in terms of, like, the interface, we're not assuming that you have a beta distribution, right, or something like that.

Ian: 37:29

That that seems kind of backwards. What if you have you know, people only picked beta because the exponentials worked out nicely in the update. Right? What if you have this other weird intractable heuristic thing or whatever, and the world's way more complicated and happens to be windy on this day, and there's a magnet. You know?

Ian: 37:46

Who knows? Right? And the neural network might learn all these things. So rather, we're just gonna look at, okay, I'll just gonna look at your predictions and I'm not gonna care how you made them. Right?

Ian: 37:57

Obviously, later on, and we're gonna talk about that, that's our work on epistemic neural networks. It's about saying, hey. Can I make neural networks that make really good joint predictions in this sense in this sense of, like, getting the things that I want out of a Bayesian neural network? But I'm not gonna assume. I'm not gonna force them to do it the Bayesian way.

Ian: 38:19

And I think that's a more productive way to do it.

Robin: 38:22

Your epistemic neural networks paper from 2021, can you tell us about that paper and what is an what is an epistemic neural network?

Ian: 38:29

Yeah. Got it. So, yeah, it's it's this whole area is confusing. And I think the main thing that's confusing is this idea of joint prediction. But hopefully, if we go through those examples, then that that's kind of making sense.

Ian: 38:41

Like, you wanna be able to tell the difference between something that you know is 5050 versus something that you think has a 50% chance of being way better or 50% chance of being way worse. And that problem is like very relevant for large language models, right? Because a classic thing is these reward models that wanna tell, hey, is this response better or is that response better? We wanna be able to tell the difference between these two cases. Those are like mapping to these 2 coins, a and b.

Ian: 39:09

Something where you know they're like basically 5050 and something where you think there's a 50% chance it's way better or a 50% chance it's way worse. And the classic way, as we said, that people have been been taking this on as they said, hey, well, you've got a neural network that makes predictions, but actually it doesn't tell you your epistemic uncertainty. So instead of having a normal normal neural network where you have weights and make predictions, you've gotta have a base in neural network. And so instead of one set of weights, you're gonna have a distribution over plausible weights, and that's like your posterior. And for any one of these set of weights are possible, you make different predictions.

Ian: 39:50

And for the different plausible weights, I'll get these different outputs, and that's how I'll tell them uncertain I am. But that assumes a particular form where basically I'm gonna have uncertain epistemic uncertainty over these weights and then this aleatoric in the outputs. But what we just said before was the only thing that you really needed to do to make good decisions is this good joint predictions. So I need to make good joint predictions over the coin if it's gonna be all heads or all tails. But it's not super important that I have a good distribution over a parameter, certainly not for big neural networks like a resident where you don't really know what's going on.

Ian: 40:36

And maybe you don't care that much as long as you make good predictions. So an epistemic neural network is is a very broad it's like an interface for allowing neural networks to make these kind of predictions. And just to to stress again, you say, oh, well, don't neural networks already make these predictions? Well, think like ImageNet, ResNet. You give it an image.

Ian: 40:58

Is this a rabbit or is it a duck? You just get a probability, a marginal prediction, and you say 5050. It doesn't tell you which type of 5050 that is. If you wanna know what type of 5050 that is, you need to make joint predictions. And that's what an epistemic neural network allows you to do.

Ian: 41:17

Now any neural network is sort of by definition, you make it a epistemic neural network by making independent predictions. But clearly, those are not necessarily good. Right? And the Bayesian neural network is an epistemic neural network because you can do this thing that we said before. You sample the weights, then you make a prediction.

Ian: 41:35

You sample the weights, you make a prediction. But it's not essential that you follow that 2 step process for an epistemic neural network. We're kind of treating it, you know, at a high level. It's more of a black box and say, hey. It's It's a neural network and I need to be able to make joint predictions.

Robin: 41:50

I remember from a few years ago at NeurIPS, Dustin Tran at Google trained this giant transformer, Bayesian a Bayesian giant transformer and took a huge amount of TPU. And, I was both, like, very impressed and actually kind of horrified by the amount of, like, resources. Yeah. And I I'm not really sure what came out of it, like, how much better things were with the uncertainty. But I was just thinking, man, if these are if this is how things are going, where are we gonna get all this compute?

Robin: 42:16

And so but I understand your your approach here is a lot more, compute efficient. And you have this, this EpiNets component which, sounds like very efficient. So so is it is it true that you can kind of augment an existing large neural network, with your EpiNet and it it doesn't really cost that much

Ian: 42:36

to get this

Robin: 42:36

this type of uncertainty?

Ian: 42:38

Okay. So great. So, yeah, as you say, some of these Bayesian approaches, they're just burning the whole rainforest, and they don't really work that well. I think I think you kind of have to look around. Right?

Ian: 42:48

And you have to look at how much progress has been happening in deep learning. It's amazing. There's so much, Chat GBT, all the video models at Gemini. It's incredible. You wouldn't have thought this was possible.

Ian: 43:00

And pretty much none of it is happening with the so called Bayesian neural networks. Right? Oh, you know, fair play so far. Not much of it's happening with epistemic neural networks either, but but, you know, let's look around, and that's not happening. And I think one of the reasons is because one of the key things we're seeing are these scaling laws.

Ian: 43:19

Basically bigger networks, more compute, more data are better. Okay? And maybe some of these Bayesian approaches are a little bit better on certain metrics, but they don't match the scaling laws, right? So they're way too expensive. And one of the approaches people do instead is they say, oh, training an ensemble is a bit like being Bayesian or a bootstrapped ensemble.

Ian: 43:42

And actually thinking my PhD was was probably one of the first of that in the modern in the deep RL setting. So I'm not against it. It's a it's kind of a a nice idea. It's like, hey. I don't know how to make an uncertainty estimate, so I'll just train a bunch of different neural networks.

Ian: 43:56

And if they disagree a lot, then I'm uncertain. It's a cool idea. Sort of elegant.

Robin: 44:01

I've done it. It's expensive. My god.

Ian: 44:03

It's expensive. It works.

Robin: 44:04

But yeah.

Ian: 44:05

It works. It kind of works. It kind of works, but you ought to train a 100 networks. Who who's gonna train a 100 networks? You might train 10.

Ian: 44:11

And even then, well, why don't you just make the network 10 times bigger? Right? GPT, I can I can reveal? You know, GPD 4 is not made by making a 1,000 GPD threes. Right?

Ian: 44:21

You you you have a bigger network. Bigger is better. And so I think that that's one of the the problems with these approaches. So when we set out, we said, okay, look, those other approaches kind of said the baseline. It's like first I'm gonna have uncertainty over my weights and then I'm going to get uncertainty over my outputs.

Ian: 44:39

So if you think like that, then you basically need at minimum twice the set of weights, a mean and a variance for each one. Right? But that's even if it's independent. And, like, these things are obviously you know, the the neurons are incredibly interrelated. That's kind of the whole point.

Ian: 44:57

So these ensembles, you know, if you have a 100, it takes a 100 times as long. So we said, hey. Can we get an approach? We wanna get an approach that is only cost slightly more than the the single network, but gets performance better than an ensemble size 100. And that was like the the idea, the kind of moonshot.

Ian: 45:18

And I'm happy to say that, you know, according to this metric of joint prediction, we smashed that with this approach called an EPI net. And the EPI net, the way it works is you can take a pre trained large model and you can add on a few extra layers on the side. It's a really small compared to the original network. And because of some, like, kind of clever approach that we've got, some of it's a bit hacky. Some of it has we prove that, oh, if you do this in the linear case, it's, like, recovers the base optimal solution.

Ian: 45:53

But if you do this, we get way better performance than even on a ensemble a full ensemble size 100, but at a cost only slightly more than a single network. So So we're really excited about that.

Robin: 46:04

That is an incredible result.

Ian: 46:06

Yeah.

Robin: 46:07

How does it do that?

Ian: 46:08

Okay. So I think to to get the into the feel of it, you gotta think, okay. Well, what's weird about the ensemble approach? And, you know, I'm not picking on that because I view myself as, like, a proponent of it. But okay.

Ian: 46:19

Well, one thing that's weird is if I'm training a 100 particles and like, you should think about each particle, maybe think about it as GPD 4, Costs a lot of money. It costs a lot of TPs. And I'm coming on, hey, guys. It's fine. We're just gonna train a 100 GPD fours.

Ian: 46:33

Okay. That it doesn't sound right. But imagine I've trained 99. I've I've spent a lot of money training those 99. Now when I come to particle 100, I don't reuse anything that I learned.

Ian: 46:46

Right? It seems weird. Like, each one of these models, you know, learned all these important features and sentiment extraction and all these now you say, okay. Just do it again. You kind of it seems wasteful.

Ian: 47:01

Right? Now you say, okay. Just do it again. You kind of it seems wasteful. Right?

Ian: 47:07

So instead of doing that, and maybe one of the challenges of basing deep learning is how complicated neural networks are. We don't fully understand what's going on. Obviously, it's great. People try to understand. But then you say, okay.

Ian: 47:19

Now this thing I don't really understand. Now I'm gonna try to do Bayesian inference over this magical black box. It's a Bayesian inference hard to begin with even when it was a single variable. Right? So that that doesn't seem like a great strategy when you've got, you know, billions and billions of weights.

Ian: 47:34

So instead of that, I think you can look at rather than thinking, like, some of these Bayesian deep learning and say, hey. The generative the the model is a neural network, and we're gonna be Bayesian about the model. Maybe another way to think about it is saying, hey. I want to be Bayesian about maybe what you're really trying to do is I wanna be Bayesian about this image classification or I wanna be Bayesian about like, yeah, what what words to say. Right?

Ian: 48:01

And if I was gonna set up some really complicated equations and prior and Monte Carlo, I could work out the Bayes optimal solution, and that would be great. And then I'd have a really good answer. But I don't do that because that's not tractable. So instead, what I do is I train this really, really big neural network. I train it to predict the next token.

Ian: 48:22

I train it to whatever classify images. And so one way you could look about that is you could say, hey. Well, I really wanna be Bayesian, but someone handed me this magical black box that does approximate Bayesian computation. And the reason I say it's approximate Bayesian is because it's getting the lowest log loss. It's getting the lowest log loss on the next token, right?

Ian: 48:46

And obviously the base optimal is the optimal, but this is as close as we can get to it with whatever GPD4. So rather than saying, okay, I wanna be Bayesian about this magical machine. Say, hey. Wait. This machine is really good at that.

Ian: 48:59

Can I alter it in a slightly in a slightly different way so that I can get something like joint predictions instead of just a marginal prediction? Right? And that's what the EPINET idea is. And the idea is that we're gonna perform some kind of randomization. You mentioned this random network distillation.

Ian: 49:19

You know, it's a great paper from OpenAI. Some similar ideas we had. The kind of, I think, motivated that with this idea of these randomized prior functions, but the exact mechanisms are less important than the idea is to say, like, okay. Well, I've got this function that's in this certain high dimensional space, but I'm not exactly sure about what the function is. So maybe there are, like, slightly different versions of the function that would also be plausible.

Ian: 49:47

And this EpiNet, by adding this on, you're kind of altering your predictions, but you alter it in this consistent way that that leads to joint good joint predictions. And and that's the idea. And the idea is that, okay. Well, you know, imagine if I imagine if the last layer of GPD 4 extracts some feature that says, oh, how the sentiment. Right?

Ian: 50:09

Will it be very hard to, you know, you you train an ensemble and maybe the different ensemble particles would judge the sentiment slightly differently. But instead of doing that, we could just add on these small, you know, NLP's to some hidden activations. And if you do it carefully, you might say, hey. I'm not exactly sure about how the sentiment is being calculated. And that's the kind of thing in EpiNet we hope it is able to do.

Ian: 50:35

This is hand waving, though I don't put that in the paper because it's not proven or whatever, but that's the idea. What is in the paper is that that that was the idea. We've got open source code. It's an easy thing to do, and we get way better joint predictions than even a size 100 ensemble.

Robin: 50:50

In your Stanford RL forum talk, which we'll link to, at one point, you said, you're not sure why ENNs work. I think if that roughly roughly paraphrasing your quote. Can you say more about that? Like, what what is the, is it just the the fact that I mean, this is an empirical science, I guess. Right?

Robin: 51:08

AI and machine learning. So is it is it more about running more experiments?

Ian: 51:12

So this is this is, I I think, a confusing thing about the paper is that the paper is called Epistemic Neural Networks. And so basically anything can be an ENN. It doesn't mean it's good. It doesn't mean it works. It doesn't mean it doesn't work.

Ian: 51:23

And then we've got this specific ENN architecture, which is called an EpiNet. And we say, oh, that that works well. And I guess I was at a Stanford, you know, you know, talk and I say, oh, we don't really know why it works. There's different levels of knowing. But, yeah, I don't know.

Ian: 51:39

I I kind of just gave my best intuitions for why I search in that area. But, no, I definitely don't know.

Robin: 51:46

I mean, people say they don't know why neural net neural networks work at all. So that's Exactly. I think you're at

Ian: 51:51

we're at great company. So I I think you should but I think maybe when you zoom out okay. Put it this way. A lesson from deep learning and all these things has been, like, make things more end to end. Right?

Ian: 52:02

And, there's been but first people were doing, oh, you first gotta get good features, and then you gotta do this, and then you're gonna predict the image, and then you're gonna auto encode. And then but so much of the deep learning success has been like, no. Get more compute, get more data, train end to end. You know, I think GPT is like the culmination of this next token prediction and just how much has come out of that. They don't have, oh, have an auxiliary, you know, sentiment loss of this.

Ian: 52:30

It's like end to end. Go for the thing you want. Now, obviously, next token prediction isn't exactly what you want, but it's it's pretty cool. So I think that with with kind of us realizing the importance of this joint prediction in decision making, we said, hey. Let's make something that's just really good at doing joint prediction.

Ian: 52:52

And so we really targeted that, and that's what this ENN and EPI network does. Whereas other approaches like Bayesian deep learning, maybe they've got these other things that they're aiming for. I'm gonna do this. I'm gonna do that. I'm gonna have uncertainty over the weights.

Ian: 53:05

Ipina does not give you good uncertainty over the weights. There's no such thing as that. It's terrible. Right? Does not even that's not a thing that it gives.

Ian: 53:13

But by focusing on this joint prediction, we're much, much better at that. And so maybe that's where I think the the the key observation is. And I guess we'll go on to show, you know, I guess okay. I also view this as very preliminary work because no one actually really cares about joint prediction just zooming out again. The thing I want to do is to be able to make good decisions in the real world.

Ian: 53:35

Right? I wanna be able to come to that question I had, hey, coin a or coin b, and pick the right one and get really good performance. I wanna make a, you know, chat GPT that gets labels from humans on the right things, that tries out the right things with users and learns to be way better, way quicker. And so, ultimately, that's gonna be the proof of the pudding, and I think some of that is coming out. I've been involved in some papers.

Ian: 54:02

Some of my other coworkers authors have done more. Hard at work, you know, trying to get it for the next iterations here. But, so I just do wanna clarify this. I'd view this joint prediction as, like, an a kind of intermediate goal. It's a good sanity check.

Ian: 54:18

And but the proof of the pudding is gonna be, you know, when you release something like SORO or CHAT GBT. It's like, wow. This is so much better than what people had before. And I don't think that we've done that yet.

Robin: 54:29

So I see you use, epinets, I guess, to do something like Thompson sampling in RL?

Ian: 54:35

Exactly.

Robin: 54:36

Yeah. Okay. So I gather that, like, doing fully Bayesian Thompson sampling in RL is is unobtainable. But it's, like, the best thing to do it with bandits, but we just can't do it with RL. We wish we could.

Robin: 54:46

But here, you're able to kind of approximate Thompson sampling in RL. Is that right?

Ian: 54:50

Yeah. Exactly. So, What

Robin: 54:51

does that mean, by the way, to do Thompson sampling in RL?

Ian: 54:54

Exactly. So I think I think rather and maybe this is linking back to, a piece of work that you you said before, which is b suite, this idea of behavior suite for reinforcement learning. What does it mean to do, Thompson? What does it mean to do good exploration? Well, one of the ideas behind that work was that we would curate little, like, toy toy experiments, admittedly toy, but that are somehow evocative of, like, the key issues.

Ian: 55:20

Right? So, for example, if you wanted to test, is the, can the agent remember things over multiple time steps, then you would set up this, like, very synthetic task of like, oh, observe a digit, wait one step. Can you remember it? Observe a digit, wait 10 steps. Can you reserve it?

Ian: 55:38

And you can just empirically look at the scaling of memory or something like that by this definition, and then you can compare agents like for like. And so one of that there are a bunch of different attributes and p the idea was that it's an open source thing different people can contribute. But one of the tasks which we were using to test if you could do deep exploration, this idea of temporarily extended exploration, We called it deep sea. We're kind of imagine that you're a diver and you're diving down kind of like this Tetris environment. Basically, it's a needle in a haystack problem.

Ian: 56:15

It's set up so that there's a grid size of n by n, and there's basically only one policy. There is one policy that will get you the optimal reward of 1. So it's just set up to be the, like, kind of the maximally, like, perverse but hard. But it's it's not meant to be a real problem. It's meant to be symbolic of something else.

Ian: 56:33

And for size n, it has a natural scaling factor. For n equals 1, it's easy. Right? You just pick either right or left.

Robin: 56:40

It's a bandit.

Ian: 56:40

It's a bandit. Exactly. But then for 2 and whatever, 2 is pretty good. But the point is is that if you just select policies at random or any of these dithering approaches, it will take you 2 to the n episodes to learn. So even for n equals, 20, that's gonna be a 1000000, even a 1000000 episodes for picking randomly.

Ian: 57:04

But because the actual world is like an n by n grid, you should be able to learn it in in a polynomial in n. Right? There are algorithms that can learn it more like an n squared or whatever. And it turns out that if you correctly do Thompson sampling with a reasonable, prior, then, yeah, you'll learn it in in something that scales like that. Now this problem is not super interesting if you if you just encode it with all these things in a in a tabular, you know, setting, it's not super interesting.

Ian: 57:36

What's interesting is a thought experiment. But b suite kind of tries to bridge this gap between the deep learning world and the theory world by turning this into, like we make a version of this game with, like, a Tetris like interface. And you try to solve it with deep neural networks. And now we say, okay. Can I make a neural network approach that can solve this?

Ian: 57:57

And I'll tell you for free that even MuZero, even, you know, MuZero that solved everything, even you throw all the TPUs in the world at it, it's gonna take a huge huge you know, 1,000 and thousands of years to solve it. Right? And, and it's just because this is this is how it's set up. And even if it did solve it, I could make the n one bigger, and it would take twice as long again. Right?

Ian: 58:21

It just doesn't work well, even more than twice as long in practice. But if you do some of these, like, Thompson sampling approaches, like I studied PhD, then you can do it. And so, basically, in my PhD, maybe we made approaches as bootstrap DQN that solved it. But instead of having a a neural network q value, you've got an ensemble of q values, all neural networks. And what we showed in in some of this paper is like, hey.

Ian: 58:46

You can replace you know, hey. You know, we said before that the EPINET could get as good joint predictions as an ensemble of size 100. And you're like, oh, that's kinda cool, but I don't really care about joint prediction. But, yeah, I guess if you like that. And I say, okay.

Ian: 59:01

Now you can drop in the EpiNet for the ensemble and it gets just as good performance on this comp on this exploration task. And now maybe you're more interested. Just say, hey. Wait. I can do as well as I did before, but it cost me a 100 times less.

Ian: 59:17

And in that paper, we also compare with some other methods like a dropout or this or that. And we also show there's kind of an empirical relationship and how well how well you're doing on the joint prediction and how well you do on these tasks. So that's kind of cool.

Robin: 59:30

So you've developed this whole framework for uncertainty, and, and I hear you've brought it to the world of LLMs now as well. So can you tell us, more about that? How do how can LLMs benefit from from epiNets or your your approach to uncertainty?

Ian: 59:44

Yeah. So on this, I can definitely point you to some, you know, published work, and I've been involved in some. But I guess I left DeepMind, a really great paper came out from some of my co authors. I think Vikram is the first author, Ben, Mohammed. Really good work on efficient exploration in LLMs.

Ian: 01:00:04

And I think that this modern regime of, like chatbots and LLMs and RLHF, reinforcement learning from human feedback, I think that that's a scenario where this idea of efficient exploration. And in this case, the exploration is, well, what do you choose to show to the either the human users or the human raters who are gonna give you feedback? And the idea being that if you prioritize what you show to the rater, hey. I wanna I, you know, I want to maybe show completions where I'm more uncertain. Right?

Ian: 01:00:38

Or I can learn more about what the best solution is. And I guess they show very clearly and very nicely that using these styles of approaches, using these sort of ENNs for uncertainty, you can do much better. You can get much better performance using much fewer labels. And the the the emphasis isn't only on using fewer labels. The the emphasis is on scaling better to, like, bigger systems and getting better performance, and this allows you to do that.

Robin: 01:01:09

So, hopefully, the next chat GPT might have your happy nest inside?

Ian: 01:01:14

Top secret. But, yeah, of course, we're all hoping for that.

Robin: 01:01:17

You heard it here first, folks. Do you feel like the the framework you've developed here is, is really the solution that you need for epistemic uncertainty, or or do you feel like there's a lot of loose ends? Or do you feel like it's more just like building this out and and proving it that that that this is the way?

Ian: 01:01:33

Are there a lot of loose ends? Yes. There are a lot of loose ends. I do think that who do I think? I think that some of these ideas of thinking about joint prediction, thinking about information gain, I think that is really a good idea because I think this whole we started by saying, oh, you're interested in epistemic uncertainty.

Ian: 01:01:53

I think that that's actually a bit of a a quagmire because, you get bogged down in, well, what's your model? Who's to say what's epistemic? Oh, I've got this model. You've got that model. What's your prior?

Ian: 01:02:05

And I think that actually looking at joint prediction is a, more healthy way to do it because you're kinda focused on solutions. I definitely am bullish on some of these ideas like EpiNet and, you know, the potential for different ENN architectures for sure. I think that's a really cool thing because it opens up, you know, the ingenuity and network design and, you know, the same way that there were convnets and whatever. It's like, oh, I'm gonna design a different kind of network architecture for uncertainty modeling, you know, the transformer or whatever, stuff like that. I think there's a lot of potential work that could be done there, and I'd be really excited for people to I'd be super excited for people to blast out the EpiNet approach out the water, and then I'll switch over.

Ian: 01:02:52

So, I think it's pretty cool. A cool space to be in to kinda focus on the solution rather than focus on the problem, approach.

Robin: 01:03:01

So you talked about some some things you're looking forward to. Do you wanna say more about, your plans for for your own future work?

Ian: 01:03:06

Broadly speaking, you know, it's kind of, I think, my old adviser, Ben, has been really great in, you know, setting an example or showing us, like, if you wanna work on a problem, I think it's good to work on problems that you'd be happy to work on for a long time. And so for me, this problem of, like, building artificial intelligence or dealing with uncertainty and how to think about what to learn, I think it's just, like, the biggest and most exciting problem around for me because, you know, whatever. I'm not that smart, and, ideally, I can make something or help make something that will help us all learn better and will solve those problems. So in a way, it's like a meta meta solution to every other problem. That's why I think AI and intelligence is so exciting.

Ian: 01:03:54

But maybe what I'm excited about is making AI systems that prioritize their own learning. At the moment, it's all the kind of experts and geniuses at OpenAI, at DeepMind coming up with different approaches. And I think if you could really have these kind of superhuman AIs, you know, optimizing their own learning, maybe that opens up a whole new level of progress. It also helps with safety. So I have not talked about safety or all these things.

Ian: 01:04:26

But, you know, a lot of people talk about the worries about AI safety. I also think I don't view them at odds with each other. I think that the idea of, one way to talk about this uncertainty and this RLHF, you could call it aligning the AI, aligning it to do what you want it to do. And I think that dealing with uncertainty and kind of a lot of these weird thought experiments of like, oh, it's gonna turn the whole universe to paper clips. I think that if you add in a bit of uncertainty, like the model, hey, they said to make as many paper clips as possible, but did they really mean like as many as possible?

Ian: 01:05:04

Like, maybe that was a linearization in this regime. Right? I think that if you get into stuff like that, a lot of those weird thought experiments might might I think that this is the way to solve them. I don't think they go away, but I think that working on this is actually these type of capabilities will help us with safety rather than not.

Robin: 01:05:22

So if ApiNets can keep GPT 7 from turning us all into paper clips, I'm all for it. I'm all for it. Exactly. K. So, besides your your own work, Ian, can you talk about any other things that are happening in RL these days that that you find pretty interesting?

Ian: 01:05:35

I am so enthusiastic about the work in, yeah, RL in general. I mean, by far, the thing that got me most excited probably was Chant GPT. You know, that's really kind of why I had it was difficult decision. You know, it left left Google, left DeepMind. But, I think there's a feeling for some people in the field like, oh, RL had its heyday, and there was a the Deep RL workshop was so exciting.

Ian: 01:05:59

Everyone did Atari, and you had this I had this PPO variant, and that was exciting all that. But, for me, those were those were kind of approximate policy solutions or or it's just like optimizing to get these policies on these games. And kind of the real RL are these systems. You know, the real problem is interacting in the world and accomplishing goals. And so it's it's almost like the game is finally real, and RL, the problem has never been more relevant.

Ian: 01:06:28

Right? So by that, I mean, hey. We need to be able to generalize from this this kind of data. We need to be able to explore and take actions in a space that's really high dimensional. Hey.

Ian: 01:06:39

We need to worry about long term consequences. Because when I have a conversation, you know, all the models that people are doing for this RLHF, they're basically bandit models. Right? They try to optimize the one step return. They wanna get an answer after 1.

Ian: 01:06:53

You you type in something, they give you the answer at one time. But if I talk with you, you try to, like, set up a really good conversation. And if someone's trying to teach you something, they don't just jump to the answer among go. Right? We're not doing that with our current chatbots.

Ian: 01:07:07

So for me, it's obvious that RL, the problem, has never been more relevant, has never been more exciting. We're kind of I'm amazed that this is really hitting the big time. And also these RLHF solutions are doing better than, you know, just pure supervised learning. So that's amazing. And I like, you know, I'm not super worried.

Ian: 01:07:30

I think that worrying too much about, oh, am I doing a to c? Am I doing PPO or whatever? You might be missing the forest for the trees. So I'm super excited about, yeah, this RL, AI, and the the next generation of AI systems.

Robin: 01:07:46

So, Ian, while you're here, is there anything else you wanna share with the audience today?

Ian: 01:07:51

Well, you know, thanks so much for having me on again. This has been really fun, and I'm definitely open for, you know, feedback or questions if if this gets you interested or you think, hey, Ian, you're you're full of shit. You know? Maybe I shouldn't

Robin: 01:08:05

have

Ian: 01:08:05

sworn there [Editor: lol]. Anyway, but, yeah, reach out to me on Twitter or, or send me an email, and I'll be happy to chat.

Robin: 01:08:11

Doctor Ian Osband, this has been an absolute pleasure for me. A long time coming. Thanks so much for taking the time today.

Ian: 01:08:16

Thanks a lot.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere