TalkRL: The Reinforcement Learning Podcast | Transcript: Pierluca D'Oro and Martin Klissarov

Pierluca D'Oro and Martin Klissarov

November 13, 2023 / 57:24/E47

Robin: 00:04

TalkRL Podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talkRL podcast. I'm your host, Robin Chohan. I'm very excited to welcome our guest today. We have Pierre Luca Adoro, a PhD student at Mila and visiting researcher at META.

Robin: 00:29

And we have Martin Klisseroff, a PhD student at Mila and McGill, and a research scientist intern at META. Welcome.

Martin: 00:36

Thank you.

Speaker 3: 00:36

Thank you for having us.

Robin: 00:38

It's it's absolutely my pleasure. So we're here to talk about Motif, your recent work. The title of the paper is Motif, intrinsic motivation from artificial intelligence feedback. I was very excited to see this work on Twitter. I understand you're gonna be presenting it at at NeurIPS.

Robin: 00:54

So let's talk about some of the details of this work. So Martin, can you give us a brief overview of this work? What is what is Motif?

Martin: 01:02

I guess the starting point of Motif is that we don't wanna start from scratch in reinforcement learning because that's something that is very difficult to do. And we have these language models out there that know a lot of stuff about a lot of domains that we care about. So they have this kind of prior knowledge that we could leverage, but it's not trivial to use that directly in a reinforcement learning kind of situation. So I guess the idea with Modif is how do we distill that knowledge for decision making without having the language models to actually directly interact in the environment, and instead use reinforcement learning to discover, a lot of stuff through that prior knowledge.

Robin: 01:48

So I understand you use an intrinsic reward here, and you had a separate reward model. Can you tell us about where how that reward worked? What was the intrinsic, reward, and how was that trained?

Martin: 02:02

Right. The idea is to distill this knowledge, for reinforcement learning agent. And the thing that I haven't said is that the big question is how do you, how do we give this knowledge to the, and one of the most natural ways to do that is actually put it in the reward, in a kind of intrinsic reward. So what do I mean by intrinsic reward is essentially, it is a reward that is proper to the agent. It kind of drives its own expiration, and possibly credit assignment.

Martin: 02:34

So this is a reward that you add to another reward, which is the extrinsic reward. And this reward is essentially the reward that you get from the environment. So if we take an example of of an agent that, I don't know, has to reach a certain goal and that's that's the extrinsic reward you have to get there. Let's say you only have plus 1 and everywhere is 0. Well, the intrinsic reward is gonna help you do a lot of other stuff because it's gonna give you it's gonna let you understand about how the environment works.

Martin: 03:11

It's gonna let you reach that goal, and it's gonna let you explore a bit better.

Robin: 03:16

I am used to hearing about intrinsic reward often associated with exploration, curiosity, and things like that. But I understand that this intrinsic reward is a little bit different, and all it relates closely to this specific environment that you were working on here. So can you talk a little bit more about that, about the environment and how the intrinsic reward relates to it?

Martin: 03:37

Usually when we think about intrinsic reward, we we think about curiosity, like you mentioned, and this is based on, for example, the prediction error or count based methods or these kind of things. Some kind of statistic about what the agent experiences in the environment. So in this case, what's different is that this intrinsic reward comes from the prior knowledge that the language model has about a certain task. So in the in the case of NetHack, NetHack is essentially the game where we instantiate this method. So NetHack, there's a lot of information about NetHack on the environment.

Martin: 04:19

And since language models are trained on the Internet and that information is available in the Internet, essentially the language model knows some things about the NetHack game. So now the question is, how can we use as an intrinsic reward? How do we get from this abstract high level knowledge into a step by step reward? And essentially, the the method is is pretty straightforward. It it bases itself on RLHF, but in this case, since there's no human feedback, it is AI feedback, so RL AIF.

Martin: 05:03

And essentially, it goes like this. So you have this language model that has prior knowledge about what, what is good or what is bad in a game of NetHack. What are good situations and what are bad situations? You essentially present the language model with 2 situations in the game, and these situations in this case are described through captions. So when you play in a hack, you you go through a lot of situations and some of them there's some messages that appear on top of the screen.

Martin: 05:43

Let's say you have killed a certain monster or you have found a hidden door or things like that. So through these captions, we essentially get information about what might be happening in the game. So you get 2 of these captions. You give it to the language model. You you use some kind of chain of thought prompting.

Martin: 06:08

You let it reason about what each of these messages or captions represent in in terms of what is the underlying situation. And at at the end of all that, you ask it for preference. You ask it to say whether it prefers the first situation, the first caption that it sees, or does it prefer the second situation?

Robin: 06:35

Sounds very similar to RL, HF of course with just the LLM in place of the human giving preferences?

Martin: 06:42

Yeah. It's it's very similar. It is very similar. So essentially you ask for preferences on pairs of observations. And so, yeah, there's there's a lot of work on on that in the literature, learning from preferences.

Martin: 06:57

Most of the work is about RLHF. There's been some recent work about RL AIF, mostly in the space of language models. But I guess what is important here is that this the the same idea, the same kind of flavor of doing of doing research can be applied to, a very different setting, which is reinforcement learning, where you don't, you cannot actually, you know, get the whole a whole episode a little bit. So why I'm talking about the episode is because usually when you do RLEIF with language models, you ask for a preference over complete generations to complete generations. In this case, you only have events.

Martin: 07:48

So it's a very very short description. There's a lot of missing information.

Robin: 07:53

Can you give an example of what type of captions we're we're dealing with here? What is the reward model seeing in terms of caption? Like what are a couple

Martin: 08:03

examples of captions? Maybe the first thing about captions in the hack is that they don't appear very often. Most of the time, there's empty messages. We don't filter for these. We still give them.

Martin: 08:13

But in the other cases where there are some messages, you have like a wide variety of messages that appear. So sometimes it's about, you have killed a certain monster or you have found a certain item. In other cases, it's messages that are completely useless like this is a wall just because you're facing a wall. And in some other cases, it's it's messages that appear because you're interacting with characters in the game. For example, shopkeepers or or things like that.

Martin: 08:46

So there's there's a wide variety of, messages that you can get.

Robin: 08:50

So I understand from, so I I have not played NetHack myself. I gotta admit, you know, when I got to university, when I saw NetHack, I said to myself, Robin, I better never play that game because I'll get so addicted. I've been addicted to many adventure games, and I and I could see how deep this was. So I don't know the details of NetHack, but I can see enough to see that it's it's it's quite deep. There's a lot going on.

Robin: 09:11

It's it's there's a lot of complexity. It's it's not an easy game. It's not it's very different than Atari in that sense. But it's also text based, which I And and it seemed like for, there was a section in your paper where you mentioned the LLM knows a lot about NetHack. Like you could ask you could ask the LOM about NetHack, and it knows what is NetHack, what is and a lot of the details about it.

Robin: 09:33

Was that the secret sauce that allowed the model to produce a meaningful reward function? Because it because it really kinda already knew the game.

Martin: 09:41

Yeah. That's a great question. Yeah. So first, NetHack is definitely an extremely difficult game. It is an open ended game.

Martin: 09:49

It is procedurally generated. I tried we we tried to play the game ourselves, and, honestly, we're not much better than the motive agent.

Speaker 3: 10:00

Yeah. Even though even though I have to say that Martin is better than me at playing the hack, to be honest.

Martin: 10:08

I guess I spend a bit more time for that.

Speaker 3: 10:11

They got better recently.

Martin: 10:13

Yeah. Yeah. I'm sure we could, we could reach pretty good levels if we tried, real hard. But, but yeah. So so NetHack is actually great for reinforcement learning because it has all these things that we care about, exploration, credit assignment.

Martin: 10:35

It is always different. It is stochastic. It is continual learning. It is a great benchmark and it runs extremely fast. So that's for net hack.

Martin: 10:47

To go back to the the question of, how much does the method rely on the language model knowing about n hack? Well, actually, the the language model that we use, lemmat 2, knows definitely knows about n hack. It is definitely part of the training set, but I wouldn't put the the knowledge of the language model on the level of expert. It is very far from that in my opinion. Perhaps Peralta also has tried prompting it many times.

Martin: 11:24

I'm sure you could give some insights.

Speaker 3: 11:27

Yeah. I totally agree with that. So, yes. You as you've said in the paper, we have put some, let's say, answer from the language model, and the language model knows about like, knows what the game is and knows, what what is the kind of object you can find in the game, what is the goal of the game. So this is high level information, for sure, the model knows about it.

Speaker 3: 11:53

But then at the level of, you know, NetHack is an incredibly complex game, so sometimes experts, expert players really have to think about long term strategy in ways that are completely counterintuitive and just NetHack specific. And the model sometimes doesn't know that kind of, doesn't have that kind of knowledge. But at the same time, it has a unreasonable amount of human common sense. So, many of the things that happen on in NetHack, of course, deal with everyday objects, like opening doors or, killing I mean, encountering some monster, you know, it's like, monsters are generally bad even if you don't know anything about NetHack. So we harvest a little bit of the knowledge about NetHack, this NetHack specific, but then a lot of the just the common sense that the model has about language and the physical word and error of this.

Speaker 3: 12:57

And, We believe this is one of the, let's say, yeah, this is the reason why the method works well. It's not free for the, like, NetHack specific knowledge that the model has. It's a little bit about that, but it's mostly for this common sense.

Robin: 13:14

Okay. And just to be clear, it's not that you guys just hooked up the Llama LLM straight to NetHack and are making progress that way. You did a lot more, There's a lot more mechanism here in terms of separately training the reward model and separately training the RL agent. But would that simple setup work? Like, if you just plug Llama into NetHack and try to make it play, do you think you'd get anywhere, or would that be just hopeless?

Speaker 3: 13:38

Yeah. We tried, and, it doesn't seem to work as, easily. So we don't think it's impossible to get something out there because as you said like NetHack is is a tax game in the end. So even if the text is used to represent, like, physical like, a visual space, but it's a text game. So we tried, and it doesn't seem to be to be easy because NetHack is, you know, it's partially observable and it's, like, sometimes you see things that are, like, you need an incredibly long context first, but also it's not easy to interpret for a language model that maybe knows about NetHack, maybe has seen something about NetHack, but it's different, likes to interpret each individual symbol.

Speaker 3: 14:26

So we, we went, with this route with, with Motif of, like, building a bridge between the low level word that is might be arbitrary complex and the high level knowledge of the language model by not, like, giving all the details to language model, but instead, like, creating this bridge through, the reward function. So so that you just need the event captions, but you don't, like, you don't need the language model to fully understand the observations.

Robin: 14:58

I'm I'm interested in the fact that you're just looking at the captions, and you mentioned that a lot of times in NetHack, you don't get a caption. So you're just seeing a text rendered map. Is that right? Or text rendered screen.

Speaker 3: 15:10

Yep. Correct.

Robin: 15:11

Your your reward model is being trained on just the captions. Is that right? Not the screen not the, map and the rest of the observation?

Martin: 15:18

Yeah. That's right. It's just the messages. Just the captions that appear on top.

Robin: 15:23

And then does the agent itself get the full observation?

Martin: 15:26

Yep. The agent does get the whole observation because that's necessary to make, like, these fine grained actions step by step.

Robin: 15:34

How does exploration work in this setting?

Martin: 15:36

So in NetHack, you you do have sometimes some messages that appear because you've done exploratory action. So, the maze is generally procedurally generated. And, at some points you hit a dead end and then you have to use the search action many times in a row to reveal the next, tile and then you can proceed and and, you know, continue exploring. So when you do these kind of things, there are some messages that appear that say, oh, you have found a hidden passage or you have found a hidden door. And these kind of things are actually one of some of the things that are the most highly rewarded by Motif because it understands that, you know, you have done some something that is very useful for progress in the future.

Robin: 16:25

Right. And I think you mentioned in the paper that sometimes the reward function acts more like a value function, like it's actually telling you there's value in the future as opposed to this, this particular event is has high reward. So that's kind of that's kind of quite interesting. You know, reward function that's kind of in between a reward and value function. Is that how you think about it?

Martin: 16:48

Yeah. Absolutely. Yeah. Yeah. Yeah.

Martin: 16:49

I think I think this is probably one of the quite interesting things that we found in in working with Motif is that usually we think about rewards as, reward functions as being something very different from a value function. But, when you ask the language model to give you preferences to extract the reward function, the way that it reasons about its preferences is almost always with respect to the future. So there's there's this debate on on Twitter about whether language models have a a world model or not. I think this what we found is a pretty compelling evidence that they do. It is definitely an abstract world model.

Martin: 17:40

It's not like a step by step it's not in a step by step sense of a world model. It's more like you have shown me this message and that other message. I prefer the first message because if you find a hidden door, then there's possibly other rooms that you're gonna find. So it assumes that if you find a hidden door, you're actually gonna, you know, open the door and you're gonna continue exploring. So it thinks about the future and it also conditions on some kind of reasonable policy.

Martin: 18:16

And this is probably one of the, I think, one of the the key reasons, for for why the reward for modif is especially useful. It's closer to a value function. And and maybe just to put it into a theoretical, more theoretical kind of ground. Like if you look at the literature on potential base reward shaping, you see in that case that the, you know, the optimal, potential function is the optimal value function. So in a sense, the the best reward that you can hope for is the reward that guides you for the future.

Martin: 18:55

So it's the value function, which Modif tends to give.

Speaker 3: 19:01

If you remember, we have in the paper this result that is about, the agents trained with the intrinsic reward only. The one coming from the language model are better than the ones that are trained with the game score at, like, collecting the game score itself. This is quite a surprising result when you when you see it, but but at the same time, if you think about what Martin just said about, like, putting the value from the future to the present by, essentially having a reward that behaves like a value function, then your agent doesn't have to explore as much and doesn't have to do as much credit assignment as it would normally. So, in other words, the resulting reward function is way, way easier to optimize compared to the reward function that comes from the environment. So the the language model can do 3 things, essentially.

Speaker 3: 19:58

Like, the the first thing is to replicate a little bit the score, reward function that comes from the environment because the language model knows what will give you score in the game given its knowledge of NetHack and common sense. And the second thing it's like it behaves like a value function so it helps the agent the agent in doing credit assignment. And the third thing is it's, it's gonna guide the agent to do exploration because the language model, as we mentioned, prefers actions which are explorative. So for instance, it would tell to the it will prefer a message that says, you open a door, just because when you open a door, there is a high probability for an agent that follows a reasonable policy to, follow to find new information. So it has these three levels, and the combination of 3 levels, leads to this kind of result.

Robin: 20:54

Okay. Very cool. So, like, just kind of like the, the LLM is almost providing some pretraining by bootstrapping with this common sense. Is that is that a way to look at it?

Speaker 3: 21:06

Yeah. Yeah. It it is a little bit. It's like it's pretraining that doesn't come in the form of, like, training of the parametric policy or the, value function itself. So the all data information is encoded into some neural network is the, reward function.

Speaker 3: 21:23

But then, this like, we we use some form of, PPO. And then, like, these policy optimization algorithms, if you give them, very good reward function, they are very nice. They really find the solutions that leads to good behavior, like, given that very nice, and easy to optimize reward function.

Martin: 21:46

Yeah. Perhaps another thing with respect to this is that, the the kind of pre training that we can see is can also be interpreted as, how do you align the agent. So the kind of alignment that you see is very kind of human oriented. The behavior that the resulting agent gets from this reward is a behavior that you would imagine a human exhibiting when they play the game. So maybe just to give us slight slightly more detail about that, usually, the the the baseline, the RL baseline that maximizes the score is very greedy with respect to going down the dungeon.

Martin: 22:28

And that in that hack is essentially super dangerous. As soon as you go down a few levels, things start to be very complicated and you die very easily. If you haven't increased your experience level, if you haven't, if you have if you don't have her armor class and those kind of things. If you haven't improved your skills, you shouldn't know the the reward from the environment tells you you get rewards by going down. So what happens with an agent that the the Motif agent that learns from language model feedback, it is much more conservative.

Martin: 23:06

So NetHack is a survival game. It acts much more like it's trying to survive. So it doesn't just go down blindly. It tries to stay in levels, try to fight monsters much more And because of that, it survives for much longer periods of time.

Robin: 23:26

And why is that? Why why does the other agent wanna go down to the difficult levels right away and this agent is is happy to slowly work through the levels?

Martin: 23:35

That's a great question. I I intuitively I felt that it's, is a matter of, the language model gives a more dense reward. A reward that to find the objects, To do a lot more things in each

Robin: 23:51

of the world. For it to be on the first level. Because it knows there's more reward up there. Or it knows the value function has lots lots of Easter eggs for it up there.

Martin: 23:59

Exactly. Exactly. There's some yep. That's right. That's right.

Martin: 24:04

In the case of the the the RL baselines, the the limit the the reward function, the reward function is much more limited, in terms of what it rewards. And one of the biggest reward is to go down a level. So, you know, naturally, you can expect the agent to just wanna go down levels once it sees a staircase.

Robin: 24:26

Can you can you talk about some of the unexpected things? Are there other unexpected things that you found, interesting anecdote about hallucinations and and not in the usual sense of LLMs hallucinating, but something else.

Speaker 3: 24:39

One of the tasks, that we use in our favor is, a task called the Oracle task. And in this task, the agent has to find this character that is called the Oracle, and it's in one of the levels that come a little bit deeper into the dungeon. So, usually, in levels that the agents trained with the reinforcement learning using the score as a reward function cannot even reach. And so we discovered that our agents trained with the combination of intrinsic and extrinsic reward can find that. And practice, we wanted to understand and to dive deeper into the behavior of the agent because we we are surprised that it had a pretty high success rate of 30%.

Speaker 3: 25:31

So we wanted to know, like, how is that possible? How can it freeze so easily? This is very, very difficult, like, oracle, character to reach. And so we looked at it, and we found out that it wasn't going deep into the dungeon at all. And what it was doing was, basically, it was exploiting some of the features of NetHack, which the the fact that the agent can hallucinate when it eats a particular, like, substance that comes from a monster.

Speaker 3: 26:08

It was exploiting this feature for, hacking the reward. So the we use the NetHack learning environment and, we call the the environment simply thought, okay, You're gone like, the agent completes the task when the character that is the oracle is near, to the, symbol that denotes the the agent. So the task is completed when this situation is, is encountered. But they didn't think that when you are in the state of in the game, what happens is that you start seeing all the other, like, monsters and objects as random, objects and monsters. So the policy that the, our agent learned is basically, like, to get on drugs or to hallucinate as fast as possible and to wait for something to become the oracle, so to become the gold.

Speaker 3: 27:02

And so instead of going deep into the dungeon, which is pretty difficult, task to achieve, like, it found this solution that leads to a pretty high success rate. And we also, like, fixed the task and so that the agent could solve also the the original Oracle, like, the original intended Oracle task, but we were very surprised by finding this case of, let's say, misalignment. And we actually define like, we we try to think about the general phenomenon that is behind this, this misalignment, and we call that misalignment by composition because, like, we Martin has discussed that the intrinsic reward actually generates behaviors that are pre aligned to how a human would play the game and, of course, if you train the agent with just an extrinsic reward on this oracle task, it doesn't get any useful behavior that you could say, it's still aligned, like, it's it's not good, but nothing else surprising. But then you combine the 2 reward functions and the, optimization process, the RL algorithm finds a behavior that is misaligned. So by just composing to rework functions that when optimized by themselves leads to aligned behavior, you got a misaligned behavior.

Speaker 3: 28:21

And we believe that this kind of, misalignment or, like, this kind of dynamics could be present also in other contexts, like when you train chat agents, with our LHF, with multiple criteria, for instance.

Robin: 28:37

So it looks like a kind of a classic instance of, reward hacking or how our all just kind of cuts to the chase or finds a bug or some shortcut, you know, skipping the hard work of the real world and just getting enlightened from some magic mushrooms to go straight to the goal. That's pretty funny, actually. Yeah. So with a big big project like this, how do you guys split the split the work between the 2? I think you were 2 first authors.

Robin: 29:04

Is that right? And then you had other co authors?

Speaker 3: 29:06

So we are really believers of, collaborative research. So in the current research environment, it's very, very important to do research in a way that, let people, like, easily share ideas and work really together to really understand things deeply and build, things that are a little bit more creative. And so what what we did in in this particular case, like, Martin joined as an intern. It was in June. And I have to say it was a moment of confusion for reinforcement learning researchers because, you know, ChargeGPT and powerful language models are out there, and it was, like, it wasn't clear what, the relationship between reinforcement learning research or, like, research decision making should be with these, with these kind of systems.

Speaker 3: 30:00

And so what we did with marketing initially was simply, like, to brainstorm all the possible ways you could use a language model for decision making. So there were a lot of papers at at that time about, like, you know, just you have a text game, you can use the language model as a policy, or, like, you can, build a curriculum curriculum with the language model, and all of these things. So we read all of these papers, and we brainstormed together for for a few days. And and, yeah, and, basically, the the idea for Motive came came out from these brainstorming sessions are as, in a way, the most natural way to connect the high level knowledge from the language model and the low level skills that you wanna learn, using reinforcement learning. And so we yeah.

Speaker 3: 30:49

So it started from just brainstorming and and thinking about things together. And then in practice, yeah, in the initial phase of the project, I don't know if you noticed, but the the algorithm is pretty pretty modular. So we have this phase in which we use the language model to annotate the dataset, then we have this space in which with the annotated dataset, you can train the reward model, and then we just do all line, reinforcement learning training with the resulting reward model. And, these are pretty nice property from the engineering standpoint because then, we had the possibility in some specific moments early in the projects to split the evaluation of these parts. So I was working a little bit more on the language model side at the beginning, and Martin was working a little bit more on the reward, training side.

Speaker 3: 31:44

And we, so we we were able to split this work. But, eventually, yeah, the design of the experiments that went in the paper, and the the final experiments, like, we we simply, like, designed them all together, and we try to to to take all these decisions together, and the writing was done together.

Robin: 32:06

Speaking of teamwork, for the listeners, I just want to note that, one of the co authors of this Motif paper, Amy Zhang, we interviewed in episode 29. We've also been lucky to feature a couple of your PhD advisors from both of you, including professor Belmer in episode 22, and professor Machado in episode 20. Listeners might wanna check out those episodes as well.

Martin: 32:25

We were we were starting this, project that seemed a little bit crazy at the beginning. And and it and it was kinda crazy because for us because it's it's so different from what we did. And I just remember, you know, after the first few couple weeks and we're we're trying stuff and we just we spent like a month coding the stuff before we can even run anything. And, like, some of these evenings which we're just, you know, going out of the office and, like, are we crazy? Like, are we do you feel uncomfortable?

Martin: 32:55

This seems like a big risk.

Robin: 32:57

So what you're saying is you weren't sure this would work? What did you think would not work? Like you didn't know if the, reward function would would would be sufficient to make a good agent?

Martin: 33:08

Yeah. Yeah. Exactly. So, I mean, I think we had the vision was was there. We we definitely believe that this would be probably the most natural way to distill abstract knowledge for decision making agents.

Martin: 33:24

But the specifics of it is is definitely another set of, difficulties. And the first few, things that we tried actually didn't work, for details that that maybe is is too, you know, too deep at the moment. But, there's there's some choices that we had to to make, some things we had to learn that you not necessarily will learn by reading RLHF papers or RLAIF because the setting is is very different, they're very related. The setting is different and presents a lot of, challenges.

Robin: 34:04

So you mentioned the gap between the world of LLMs, and the world of RL agents, especially in non text settings, and the idea that it's not very clear how to combine these two two things. And I guess here because NetHack is text based, it was more easy to make direct progress. Not like I guess because NetHack is is text based that made it a bit easier to incorporate the LLM, and the LLM didn't know something about NetHack. But, for other settings where where that might not be the case, whether either not in text, or the LOM has no idea about the the environment, or I mean, maybe maybe the environment is somehow similar to NetHack, but not the same. Like, I I I realize NetHack, there's some sense of optimism.

Robin: 35:01

Right? When you when you find a hidden door, you are not afraid that there's a mine behind it that will explode and you will die. That's just not how NetHack works. Usually, when there's a hidden door, there's something nice behind it. So there's some kind of bias in there that the the model, I think, thinks assuming that surprises are maybe more positive or something.

Robin: 35:22

There's some of that there in in the that's where the environment is somehow in a is somehow similar to what the expectations of the language model are. But if those things aren't the case, can we talk about other ways that that you've seen people bridge that gap, between the world of RL and the world of LLMs? I I can say that on the show we've had, you know, the SACAN authors, Carol Houseman and Faye Shaw on the podcast, and at the time they were designing hand selected value functions for certain things as their way of bridging the gap. But I wonder if you could talk about other ways of bridging that gap that you've seen, or or what you think is promising in in bridging that gap.

Speaker 3: 36:12

On NetHack, you can have very, very bad surprises behind doors, but it but it's just, like, we we are still not we are still not at the level of agents that, you know, play the game as a human expert. But a human expert would know that something incredibly bad can be behind a door or something crazy bad can happen if you kill a like, if you attack a specific type of monster. And, yeah, this is a little bit of a drawback of the current, LM configuration knowledge. But I think on this point about, like, domain knowledge well, if what do you do if you have an LLM, that is pretrained and, you have a task, that's that you want to perform and maybe you have some data on that task, well, you just fine tune your language model. And so we believe that, if you wanna apply Motive to to another environment or even just your NetHack, if you wanna have a more accurate reward model, whatever, you can just fine tune fine tune the language model to the information of interest.

Speaker 3: 37:18

So these are way to bridge this gap in term in terms of knowledge, and we we think this basically the most natural way, you could have you could have to to bridge it. And, also, if I, yeah, if if I can make a comment on the, yeah, let's say, on the text based nature of NetHack. It's true that it's text based also in the observation space, but as we discussed, it's not a feature we use in Motif. And, actually, if you try to give it to language model, it's not that good at interpreting that kind of input. So what we use is is really an event caption.

Speaker 3: 37:54

And we believe that this kind of event caption is actually not so uncommon to obtain in many environments. Like, it's a high level happening. So, imagine, like, in the real world, it would be something like you open a door or, like, you're you're going down the stairs. So, there are very, very good captioning models right now that could produce such event captions without any any problems. And, also, if you take the current motif architecture, let's say, and you, use a vision, vision language model, so a model that can give you a preference on, images, for instance, you could apply exactly the same algorithm with vision visual based observations.

Robin: 38:43

So I see what you mean. Cool. Okay.

Speaker 3: 38:46

So so we we we are not the ones, like, creating the vision language models right now, but, you know, there is thousands of people probably, training SAS models. So if you if you, we do it by next year, there's gonna be a model that is good enough to try something like this in in a different environment that is just based on images.

Robin: 39:07

But then, that on that one point about fine tuning, language model, I'm trying to imagine how that works because if the the fine tuning would have to include some notion of what is better or worse. Right? Like, how how would how would a fine tuning on a bunch of text, it, like, help it help it change its value function or its reward model? I think that's that's not clear to me.

Speaker 3: 39:30

Yeah. So the that's a good question. Like, the fine tuning I was talking about is a fine tuning just of the language model. So the thing you wanna keep do is just to give the, knowledge about the task to the language model, but you don't want to necessarily give like, the model will learn what is good and what's bad, with its, like, you know, low level reasoning abilities or, like, not incredible, but reasonable reasoning abilities. It will learn what is, what's good and what's bad by just having the knowledge about the task.

Speaker 3: 40:04

It's not like you have to explain to the model, okay. This is good. This is mad. This is good. This is bad.

Speaker 3: 40:11

The model will learn the high level knowledge about the new task, and then the model will be able to say, okay. Given this my high level knowledge, what's the goal in this task and what's the, like, the typical, like, happenings in this task, then I can tell you, yeah, like, if you're playing soccer, it's kind of it's kind of good if you are, scoring a goal. Like, if you're scoring, if you're, like, it doesn't need to see, like, cases of score, not score. It will learn it from just reading about about the game or about the the new environment, which is what you could say is what humans do, many times. Like, imagine you you're you're about to play a game that you don't know anything about or, like, you're about to do a sport you don't know anything about.

Speaker 3: 41:01

Probably the first thing you're gonna do is to at least read the rules of the game or the the rules of sport, like, of the yeah, of whatever sport you're doing and and then figure out ways to integrate that knowledge into behaviors.

Martin: 41:16

I guess, one of the main motivation behind Modify is that this language model knows things at a high level. And we wouldn't want to use these large models to fine tune them for step by step action Because by definition, this is contrary to the way they're being trained, at least language model as we know. They're trained to predict next words and and these things are are quite high level. Whereas when you when you actually do reinforcement learning, you have to act as a step by step, kinda setting. So if we if we do fine tuning of these large models, I guess we can see it like from, by taking the place of the language model as humans.

Martin: 42:03

Like, let's say you you have your prior knowledge about all the how the world works more or less, and then you hear about a certain game. You're gonna try things, that would perhaps make sense in the other games that you know about. You're gonna try them in this one. You're gonna try different variations And that's the the hope here that you fine tune these large models, but you you don't necessarily fine tune them at at a very low level. You you you try to keep that as a as a high level kind of knowledge, on which you can build.

Martin: 42:40

And and once you have that high level knowledge, then what we propose with Modif is is to, you know, just let the reinforcement learning agent, take that reward function and run with it for billions of iterations. Try to find all the cracks. Try to find all the the possible things that it can do. And and that's the the the strength of reinforcement learning, really. It it it's able to find stuff that we couldn't otherwise.

Robin: 43:07

So you guys definitely have some interesting ideas for follow-up work. Do you do you plan to do follow-up work on this, or are you planning to switch channels?

Martin: 43:15

So we're currently working on follow-up on this, around NetHack. We definitely wanna focus a bit more on performance at the moment. So the this motive paper, is you know, there's there's like one page about performance where we show that it does pretty particularly well. But we focus most of it on trying to understand, you know, the the alignment and and analyze the behavior of the agent. But we would like to do a follow-up on in hack to really seal the deal and and and try to optimize, most of the components, that we haven't in this first paper.

Martin: 44:04

So So hopefully, in the next few months, we possibly, we we're gonna find something that works even better.

Robin: 44:12

Awesome. Looking forward to it. There has been, some tension, you know, as you mentioned around the role of RL in this world where, I think artificial intelligence is increasingly being defined in terms of of these new LLMs. Even if LLMs themselves are trained with RL, And and and, Pierre Luca, you wrote an interesting blog post about this about this topic where I gather you're suggesting that that RL people kind of reframe themselves a bit or reframe the problem a bit. Can you can you speak to that?

Speaker 3: 44:44

Yeah. The blog post was sort of extreme in in a sense. Like, the blog post was superficially, it was just saying, okay, we shouldn't use the name reinforcement learning because name reinforcement learning will be associated to the old techniques that we used to use in reinforcement learning. So people would think about, oh, these people are just training policy gradient algorithms on Atari games. And so it was a little bit thought provocative, but the the the underlying truth that I was trying to convey is that, a researcher in reinforcement learning is actually not just a researcher in reinforcement learning, but it's a scientist that tries to understand the science of sequential decision making.

Speaker 3: 45:27

So you're trying to understand, what a decision making agent should do, like, to learn how to interact with the with an environment to achieve some kind of goal. And so the fact that you used that now we use large language models to, bootstrap the knowledge of an agent or, like, to think about things. It's not like, it it doesn't mean that all we learned with reinforcement learning research is, is lost. Like, we still learned a lot of things about creative assignment, about exploration. And when I wrote that post, it was at the the beginning of the summer, and, actually, it was part of the inspiration of why, like, we started working on Motive because we wanted to show that, actually, like, the two worlds of, let's say, language models and reinforcement learning as, like, traditionally taught.

Speaker 3: 46:22

They they are connected, and we we found this connection of the reward function, and we we we think it's one of the most natural ones. And, I also want to add something, which is so, there's a lot of, again, there's a lot of work, and a lot of people are thinking about these AI agents. So when talking about the AI agents means you have a large language model, and this large language model interacts with some of, with an environment which is usually in a computer. And so the a lot of the work that is being done right now, it's a little bit more, rough or, like, explorative in terms of, yeah, I wanna show that maybe GPT-four can do this or GPT-four can do that or, if I join these pieces together something like this happens. But what what I see as a scientist is that we need a science of AI agents right now.

Speaker 3: 47:22

So we need to not to lose all the progress that we made in the science of decision making in the in the, last few years, and to use that progress, that rigor, and that understanding of decision making to analyze these systems. So we need to in other words, we need to build a modern science of AI agents that incorporates language models into the mix.

Robin: 47:45

Yeah. Interesting take. I I I think for a while there, when people were hearing intelligent agents or or just AI agents, they started thinking about langchain and baby AGI, and these things which to me were not had nothing to do with that. And so I was so confused when I would talk to people in the mainstream. They say, oh, have you heard about agents?

Robin: 48:06

And I was like, what do you mean? And they would point at baby AGI that's just spewing out all this text. And I was like, oh, wow. I guess the the words are kinda proceeding the understanding of what of what's happening here. So I was excited to read your blog post, and I encourage people to read it, and we will link to it in the episode notes.

Robin: 48:23

There's one other paper we wanna touch on, the policy optimization in a noisy neighborhood on return landscapes and continuous control. Pierre Loke, can you give us a high level overview of this paper? What what is happening in in this paper?

Speaker 3: 48:36

Yeah. First of all, I wanna thank Nate Run because he he he he co led this project with me, but also he worked on, a lot of those incredible visuals that you've seen, in the Twitter thread. So thanks, Nate. And so the this paper is again you you can think it's about this science of AI agents. So it's about, like, the empirical, science of different reinforcement learning.

Speaker 3: 49:01

So apart from, you know, building new algorithms that have good performance, it's good for science to do also something else, which is to build understanding of, of of the algorithms we have and of the behaviors that they generate. So in order to build this understanding, the approach that we take in this paper is the one of looking at the return landscape. So it's the mapping between behavior or a policy, and the performance of this policy in a particular environment. So you can use our particular lens on this, on this landscape to understand, when you use a deep reinforcement learning algorithm like PPO, soft talk or critic, or TD 3, what is the kind of, part of this landscape that is visited? And, also, we take some kind of distributional view on this, landscape, and we are able, like, to characterize each policy, each behavior, as a with a distribution.

Speaker 3: 50:00

And then, if you use the metrics, like the statistics of this distribution, you can, like, describe the behavior. So you can build a map, And we we have this colored plotting. We literally build a map of the behaviors that are visited by deep reinforcement learning algorithms. So this is a type of finding that we have, and we also show that you can move between neighborhoods. So some neighborhoods are very noisy.

Speaker 3: 50:27

So sometimes, like, you move a little bit and you have a completely different policy that is failing, but some other neighborhoods are, pretty smooth, and we build, simple algorithms to move, from one to, from one type of neighborhood to the other. And, we also study this return landscape with a global perspective, in which we show that, basically, like, if you run your policy optimization algorithm, you can interpolate between the neural networks, parameters that you obtain, just linear interpolate, and the resulting return, never drops in in expectation. So you can, basically, we show that, a phenomenon like, linear mode connectivity also exists in, in reinforcement learning. So, you can simply connect the policies with linear paths, and a large part of the optimization that is done by deep RL algorithms happens in this, like, passing, that is very, very smooth, which is which is a bit surprising for us.

Robin: 51:38

Earlier, I'm I misspoke. I think I said lost landscapes. And and really you're mapping the return landscapes. Can you just clarify that lands when you say return landscape, is that the value function?

Speaker 3: 51:49

The lost landscape in, reinforcement learning is typical the value function because you're trying to, maximize, the expected return as estimated by the approximated by the value function. Instead, the return landscape is the actual return that your policy gets in the environment. So it's not related to like, it's indirectly related to the optimization process process, but it's not relate like, it's not that. It's, just the the reward that you like, the, cumulative reward that you get in the, environment. And so the 2 are related.

Speaker 3: 52:25

And and, actually, like, one of the relationship between the 2, I could have a comment on the, again, on the mode connectivity, result, which is when you, so when you, go from a policy to another and you used a deep reinforcement learning algorithm to do so, Well, you optimized a policy using even a non stationary objective, which is the value function that you were learning, during the policy optimization process. So it's even more surprising that with our non stationary loss, landscape, you get something that is very, very smooth in term in terms of optimization.

Robin: 53:03

So what you're saying is return landscape. I mean, you mean, you're evaluating the policy at every single point. That means you're actually running the policy to see what the total return is from that point?

Speaker 3: 53:13

Yes. And so

Robin: 53:13

it's a very expensive landscape to compute. Is that right?

Speaker 3: 53:16

Yes. Exactly. Yeah. Yeah. It is theoretically very expensive.

Speaker 3: 53:19

In practice, we we wrote code that is, like, super parallel. We use I don't know if you ever heard of this simulator called Brax that runs on the GPU. So we we use these, we have all of these calls to do all the evaluations of the policies in parallel, and that allows us, like, to, evaluate many policies, many, yeah, many policies in many moments in training, also, with many episodes. So that's the secret.

Robin: 53:51

And your theta is your policy parameters. So you're saying that if it's bumpy, it's saying that slight change in policy parameters is resulting in a big change in performance. Is that what a bumpy landscape means?

Speaker 3: 54:02

Exactly.

Robin: 54:03

All the magic in this in deep RL seems to be in these black boxes of what these neural networks are doing, and it's so hard to know what's happening inside. So you have a new way of looking in them and figuring out how to produce stability from these otherwise very unstable black boxes. That sounds like it could be very important. So, yeah, listeners will will be linking to this paper as well in the in the show notes. So, does this does this paper bring us a little bit further towards understanding how to get stable stable agents in RO?

Speaker 3: 54:35

I think we are still, far from having, let's say, a very, very, good solution for that. But we one thing we do is, like, we build understanding about, about, like, what is the underlying phenomenon that might be, going on and might prevent, agents who have stable policies.

Robin: 54:55

What is happening? What else is happening in RL aside from the great work that you both are doing? But what else is happening in in the world of RL lately that that you find quite interesting?

Martin: 55:05

I think for me, one of the things that is the most interesting is the idea of diversity. Going back to quality diversity algorithms, I think these kind of ideas are extremely promising these days, given that we have language models that can have some kind of prior about what is a good measure of diversity. So, like, how do you learn diverse skills? How do you select diverse goals, I think this is extremely promising. And I think through diversity, we can really reach we can really reach surprising behavior.

Speaker 3: 55:52

In the last few years, I've been excited by this direction that is called, decision awareness or, like, decision aware model learning. So the idea that model based reinforcement learning, you learn some kind of, model that is instead of simply being learned by maximum likelihood is oriented to the, ultimate objective in reinforcement learning. It's like maximizing the reward. So I had some work a few years ago, and, recently, I've been working on some, policy gradient related work in this space, and, it's gonna be out in the next month probably, but it's it's quite exciting. So it's about, like, how you could use neural networks and their shape and the way they are conditioned to have an estimate for the policy gradient that can even be better than the one that comes from the real world.

Speaker 3: 56:47

So even if you assume you have the real, like, the real policy gradient, the the resulting policy gradient that you have from a method can be even even better. So I'm quite excited about this.

Robin: 56:58

Pierre Lucadoro and Martin Kliserov, I wanna thank you both for joining us today and sharing your insight with the TalkRRLA audience. Thank you both.

Speaker 3: 57:06

Thanks to you again for having us.

Martin: 57:08

Thank you so much for having us. It was great to chat with you.

Robin: 57:11

Yeah. We look forward to seeing your work at at, NeurIPS in New Orleans.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere