TalkRL: The Reinforcement Learning Podcast | Transcript: Jakob Foerster

Jakob Foerster

May 7, 2023 / 01:03:45/E43

Robin: 00:04

TalkRL Podcast is all reinforcement learning all the time, featuring brilliant guests, both research and applied. Join the conversation on Twitter at talkRL podcast. I'm your host, Robin Chohan. Today, we are lucky to have with us Jacob Forrester, a major name in multi agent research. Jacob is an associate professor at the University of Oxford.

Robin: 00:29

Thank you for joining us today, Jacob Foerster.

Jacob: 00:31

Well, thanks so much for having me. I'm excited to be here.

Robin: 00:33

How do you like to describe your research focus?

Jacob: 00:36

My research focus at a high level is about finding the blind spots in the research landscape and then trying to fill them in. And I know this sounds awfully general but I can give you examples of what this used to be in the past. And maybe I can also talk about what I think this will mean in the future. So in the past when I started my PhD, this was all about multi agent learning. When I started my PhD, at the time, deep reinforcement learning was just becoming the popular thing to do.

Jacob: 01:04

But folks hadn't realized that putting learning agents together was an important problem to study. So that was a big gap in the research landscape. And in my PhD, I started making progress on this. And what might this mean going forward? It essentially means understanding what the boundaries of current methods are where the limitations are either already arising or will be arising in the future and then trying to address those limitations.

Jacob: 01:34

For example, being able to utilize large scale simulations to address real world problems as opposed to relying just on supervised learning methods or self supervised learning That's at a very, very high level what I'm excited about, but fundamentally, it's about painting in the big gaps in the research landscapes.

Robin: 01:50

Most of your work I've encountered is on multi AsianRL. And on your lab's website, it mentions open endedness as well. Can you tell us, about what type of work you do in open endedness?

Jacob: 02:02

Yes. That's a fantastic question. There's a really sort of up and coming research area called unsupervised environment design, UED for short. And I believe you had somebody on your on your podcast as well from, who's who's leading in that area. And I've been fascinated by this question of how we can discover environment distributions that lead to specific organization results or allow agents to to generalize to sort of the corner cases of the distribution.

Jacob: 02:33

And as it turns out, this is a multi agent problem because it's often formulated as having a teacher and a student. And immediately, you're in the space of general sum learning and, or zero sum, learning in this case, whereby we have to consider the the the the interactions of learning systems. So that's one thing that makes it very fascinating for me. The other aspect is it's clearly a crucial step when wanting to bring methods to the real world. So, being able to bridge the the the the sim to real gap effectively.

Jacob: 03:07

Right? It's one of the the key questions in bringing multi agent learning systems, to the real world. And then lastly, we have papers now where we're actually using insights from multi agent learning and bringing them to unsupervised environment design. So for example, there's a method called sampler that we published last year at NeurIPS I believe. And, here it's it's essentially a transfer of a method I developed of belief learning for multigener learning.

Jacob: 03:38

Where then the same problem also appears in this very different problem setting of unsupervised environment design. So for me, overall, this has been a fascinating process. And I would say now, probably so almost a majority of the papers coming out of FLIR in the future expect to be having some type of element of open endedness or unsupervised environment design?

Robin: 03:57

I'm sure you get this question a lot or have to answer this a lot. But, for the for the listeners, can you remind us of what are some of the main challenges with multi agent RL, and multi agent in general? Like like, why do this the methods designed for single agents, not work very well for multi agent problems? And what and what makes it hard?

Jacob: 04:18

Yes. In in a nutshell, supervised learning is easy. And this sounds a little flippant. It's supposed to be flippant. But if you have a data set and you have supervised or self supervised loss, ultimately this is a stationary problem.

Jacob: 04:35

And what and what that means is as long as the learning algorithm is going to converge to a approximate global optimum, you will get a model that that works well. And we've really gotten to a point now where large scale supervised learning, even if it's, GPT 17 or whatever we have, can converge stably to good solutions. And effectively, we don't care about the exact weights or the exact solution found because we can simply look for generalization performance. Make sure we we don't overfit. And that's really the only concern that we have.

Jacob: 05:10

We can, look at the training loss, test loss validation loss, and make sure that we're in good hands. In contrast, when we have multiple learning agents, then suddenly all of these guarantees break down. And that's because the other agents in the environment continuously change the problem that we're addressing. So for example, if you're looking at the world from the perspective of 1 agent, then suddenly the actions that other agents take will change the environment that's being faced by that agent. And that's called non stationarity.

Jacob: 05:43

To make matters worse, when these agents are learning, we get extremely hard query assignment problems. Whereby, suddenly, the actions that an agent takes in an episode can change the data that enters the replay buffer or the training code of another agent. And the other will change the future policies. And suddenly, doing extremely sensible things like each agent maximize their own rewards can lead to rather drastically unexpected phenomena. And, one example for is is playing the iterated prisoners dilemma whereby, there are a lot of different possible Nash equilibria that could be reached during training.

Jacob: 06:27

But naive learning methods have a strong bias towards solutions that lead to radically, bad outcomes for all agents in the environment, such as defecting unconditionally in all situations. So to put this into one sentence, the problem of multi agent learning is non stationarity and equilibrium selection.

Robin: 06:48

So I think, right now, you know, with GPT and ChatJPT, I think a lot of people are associating, you know, that approach and LLMs and all that with, AI and with in general, and and maybe as the path to AGI, potentially. And then, you know, there's people who have said that, the reason humans' brains are so, are so powerful is has to do with our social learning and the fact that we had to deal with with social situations and, and basically that the multi agent setting was central to the evolution of of really powerful intelligence in the human brain. So I wonder, if you have any comment on that in terms of what the role of multi agent learning, might be in the path towards really powerful a g AI and potentially AGI.

Jacob: 07:38

Yeah. This is a fantastic question. So when I started my PhD, I had that exact same intuition that, indeed, the interaction of intelligent agents is what has driven intelligence, and that the epitome of that interaction of intelligent agents is language. So that was my intuition for studying emergent communication at the time. It was essentially sort of my take on bringing agents from playing Atari games to being able to discuss things and ultimately get to abstraction and intelligence.

Jacob: 08:17

Right? And this intuition looking back was was good in some sense. In that indeed, language is now looking back, obviously crucial for abstract reasoning and for, sort of social learning. It now turns out looking back that simply training supervised models, self supervised models, large scale language models on large amounts of human data is a faster way of bringing agents to current or approximate levels of human abilities in terms of simple reasoning tasks. And probably that makes sense looking back.

Jacob: 08:57

Now that doesn't mean that this these methods will also allow us to radically surpass human abilities. Because again, these abilities emerged in the human case through multi agent interactions, through a mix of biological and cultural evolution, and ultimately, led to that corpus of cultural knowledge that we've, currently sort of codified in the existing text and other media. So I'd like to distinguish a little bit between being able to get to, broadly speaking, something that matches human level at a lot of tasks, give or take, versus something that can radically surpass human abilities. And I can imagine that to radically surpass human abilities, we will need systems that can train on their own data, train in simulation and and and and and also might need systems that, can interact with each other and train in multi agent settings to sort of like drive meta evolution and something like that. Does it make sense?

Jacob: 10:10

So I think the ace of which should be a bit nuanced about, getting up to human levels, which I think current systems can do. And it's an open question how much further we can push it without going to more data learning. And also if you want to push it that way.

Robin: 10:24

Okay. And this, definitely brings up your paper from years ago. I considered a classic in deep RL. It was learning to communicate with deep multi agent reinforcement learning. I believe that was 2016.

Robin: 10:37

And, and there in that paper, you mentioned end to end learning of protocols in complex environments. And, and I think that was a groundbreaking paper in terms of figuring out how to get Deep RL agents to communicate and to invent languages. Is that correct?

Jacob: 10:54

Yeah. So this is as I as I said, I mean, this was driven by my desire to, get agents to be less, sort of, singleton and actually have them interact with each other, talk to each other, and ultimately get to intelligent systems. It was a it was and it was fantastic in many ways, this paper, that it really started the community and started showing people what is possible with modern techniques. And I do believe that looking back, the currently more successful approach is rather than relying on emergent communication to actually seed these systems with large scale language models. And they'll be very excited to see, how we can combine these 2 now.

Jacob: 11:36

So standing on the shores of giants, starting with large language models. Can we now ask those same questions again about having agents that, can develop protocols starting from what is already seeded with all of human knowledge, from from a large scale language model. But then apply that same rationale that multi agent interactions can lead, to novel skills and the emergence of novel capabilities. But, again, not starting from scratch anymore like we did in 2016, but, instead starting with with already sort of quite a lot of, let's call it, reasoning abilities, to to, within these models. So I think there's an exciting line of work here, which is picking up those initial ideas again, but combining them with what we have in 2023.

Robin: 12:24

That does sound exciting. I I still wonder though, is there not still a role for the learning from scratch? Like, if you had maybe small, IoT devices or something that have to talk, they might not want an LLM, but they might have some very specific thing they need to collaborate on. Do do you think that there's there's still a role for that, that very simple level of learning from scratch?

Jacob: 12:46

Oh, absolutely. I think there's there's a no. So this is so so like about about the path to AGI, basically, or to really intelligent systems.

Robin: 12:53

Okay.

Jacob: 12:53

That's that's one axis. From a practical point of view, I I always joke that I don't actually want my Roomba device to be able to build a dirty bomb. I want it to be able to clean the floor reliably, maybe certifiably. And I want to be able to be able to do almost nothing else. And, so there's many instances where even if it was possible to prompt some large language monitor the task, I much rather not be deploying the large language model for safety reasons.

Robin: 13:29

Mhmm. And performance. So Like we yeah. Absolutely.

Jacob: 13:32

Correct. Safety and cost, performance guarantees. Right? So I think reinforcement learning from a practical point of view is still going to be required, especially think about robotics and so on. But I think in the training process of these policies, we will be using large language models to generate data, to help the exploration task, to perhaps generate environments, to come up with high level plans, and so on.

Jacob: 13:59

And that's broadly speaking how I think about it and obviously in that same context, learning communication protocols from scratch when it's required as part of the task specification is still going to be a necessity in many situations where communication is costly, communication is noisy, and it's not obvious what these devices should be communicating. But that's a very different sort of motivation what was, back in the day, guiding my decision making to work on communication protocols in the first place. Obviously, you know, if you read the paper, it will be all about the down to earth things like, actually solving practical problems.

Robin: 14:37

That makes sense. So we had Carol Hausman and Faye Shaw, co authors of the Seikan paper, on earlier. And I was asking them, do we really want, our kitchen appliances having read all of Reddit and things like that. Although they had a good answer for how, they kept things safe. They didn't they didn't let the, models just generate whatever.

Robin: 14:54

They were just evaluating probabilities. But on the other hand, they couldn't run the LLM locally. They had to the the robot had to talk to the LLM in the cloud. So so there's all sorts of trade off there. But let's move on to, cooperation, versus competition.

Robin: 15:06

And so you kinda talk about these, or these 2, 2 axes sort of. And, but is that the whole story? And how how are they different, when it comes down to to learning problems? Are cooperation and competition varied, very different, and what makes them different?

Jacob: 15:23

Yeah. It's interesting. So the the the the grand challenge I mentioned of, multiagent learning, or one of the grand challenges being what policy do you play because these problems are unspecified. It all depends on what others are doing. They disappear in some settings.

Jacob: 15:40

And in particular, if you have a fully cooperative setting where you're able to control the policies, so the way that other agents act, for every single player in the team, then that effectively reduces to a single agent problem. Not in terms of the hardness of solving the problem, in terms of this problem specification. Right? I can say the say the weights theta should just be the arc max of the joint policy in terms of of maximizing the return of this team. And that looks a lot like any single agent or URL problem in its mathematical specification even though the environment is now parameterized differently and policies are parameterized differently.

Jacob: 16:23

But it's quite simple. So this is fully cooperative, all agents under the same controller. And there's only one other problem setting that I can think of that is also easily specified, and that is 2 player 0 sum. Because in 2 player 0 sum, I can specify the the requirement of having Nash equilibrium. And if I find any Nash equilibrium, then I'm guaranteed not to be beaten by any training partner at test time.

Jacob: 16:53

So by any test time partner. And that's why sort of competition and fully cooperative self play are special cases of multi edge learning where it's really easy to specify the problem setting and, make sure that whatever we train in simulation with our trained hand partners actually works well at test time. In other words, the equilibrium selection of finding the exact policy is trivial beyond the computational cost of finding this actual equilibrium. In contrast, there are 2 problem settings where things get extremely complicated, and that is on the one hand side general sum learning whereby, for example, in the iterated prisoners dilemma, I have all these different equilibria, and my training algorithms have biases towards 1 or the other. And the interaction of learning systems can lead to disastrous outcomes that are not desirable from any for any party.

Jacob: 17:51

So that's in general some learning tragedy of the commons. And the other aspect is cooperation when we're not able to specify all of the policies for all agents at test time in the environment. And that's something I've worked on a lot under the banner of Zero Shot Coordination, which is essentially how do we train in simulation such that we can expect our policies to generalize to novel partners at test time. And, clearly here the the the sort of inability to agree on a specific policy makes it much harder to now specify what the problem setting is and also how we should train for it. So that's what I'm trying to say that really the competitive part, a 2 pairs of sum is quite unique in multi learning because all it takes is to find the Nash equilibrium.

Jacob: 18:39

Now this can be hard in complex settings like poker but we in Prince, we'll know what this solution should formally be.

Robin: 18:45

So I I saw a lecture of yours online where you talked about the different games that, were being tackled in Deep RL. And most of the games that, you know, we associate with Deep RL agents like Go and Chess and DotA and StarCraft, You showed how they're all zero sum games. And, and at the time, there was not nearly as much work in the in the cooperative quadrant, in the partially observable and cooperative quadrant. So, and then and so you have some in really interesting work in that quadrant. Can you can you tell us about, about that?

Jacob: 19:25

Yeah. So this is basically so I put it on the tagline of, you know, being able to, use computers not to beat humans at these competitive games, but instead being able to support and help humans, right, using large scale compute and in particular using compute in simulation. Now the challenge with this is that you have you can have a solution that does perfect in that team of AI agents in simulation. But the moment you replace any of these agents with a human attest time, everything breaks. Because these policies are compatible with settings whereby the, equilibrium can't be jointly chosen, for for everyone in the team.

Jacob: 20:10

This is really abstract. So let's try and make it a little bit more specific. So maybe our listeners can, visualize this. Imagine that, we're playing a game because I like toy examples as as you must have noticed. And there's 10 levers.

Jacob: 20:28

And 9 of these levers pay a dollar if we both pick it. And one of these levers pays $0.9 if you and I pick it. And the reward that's being paid by these levers is written on each of these levers. And obviously, you pick different levers, we don't get any points. Does does that problem set importantly make sense?

Robin: 20:48

Yeah. That makes sense.

Jacob: 20:50

And the question now is what would a standard machine learning algorithm, reinforcement learning say, multidimensional reinforcement learning, learn in this in this setting? Well, it would learn a joint policy. So policy for you and I that maximizes the reward and expectation of the team. And that policy can effectively pick any of the 1.0 levers and get one point in expectation. Now in contrast, if you and I were to play this game and it's common knowledge that we cannot agree on a policy.

Jacob: 21:21

So we can't there's no numbering to the levers. We can't pick lever 1 or 5 or whatever. Then it's fairly obvious that we should pick the lever that we can independently coordinate on, which is the unique 0.9 lever. And that highlights the difference between Nash equilibria which are well suited for self play where we coordinate, where we can control the entire team at test time, and a completely different set of Nash Equilibria that is well suited when we cannot do this. And understanding this is important because when machines meet humans, often the problem setting will be known, will be understood.

Jacob: 22:01

The task is clear. But the ability to specify a policy isn't there because that'll be quite costly. And what is worse, the space of possible Nash equilibria that these algorithms can consider often is exponentially large. And, only very few of them are actually suitable for coordination. So to illustrate this, imagine that we're playing this lever game now but repeatedly where you can observe which lever lever I played and what lever you played.

Jacob: 22:34

In this setting, the space of all possible optimal policies is actually joint policies that pick an arbitrary lever at every time step from the 1.0 set. And you can imagine if you're on time steps then there is something like 9 to the 100 possible optimal trajectories. But it would be quite hard to explain to a human that this is the policy you're playing. Instead, what a human would likely do is something like either I copy your move or you could copy my move. Well, that requires tire breaking so we're gonna randomly decide who copies zoom and who sticks.

Jacob: 23:08

And again, figuring this out is important because if I want to help and support humans, that can be formulated as a partially observable, fully cooperative multi agent problem. Partially observed because no no robot can look into the human set. You don't know the exact reward function of the human. Fully cooperative because you're trying to help the humans. There's only one reward even though it's unobserved by the robot in generality.

Jacob: 23:32

And multi agent because of the robot and the human. But it's also a coordination problem because we can't know the exact weight of the brain that is controlling the human. So therefore, we have to be able to work with another agent in the absence of the ability of agreeing on a policy. So therefore, I've I've been sort of trying to push the field more to think about fully cooperative partial observable coordination problems. And in particular, I've used the card game Hanabi here for the last few years, really, to develop novel methods in in that space.

Robin: 24:03

Now, I heard about your Hanabi work, I guess, last year. And so I got the game with my family and we did play Hanabi. And it's I don't know. For anyone who has not tried it, it's a very strange sensation playing that game because it's like the opposite of most games where you cannot see your own cards. And you're trying to work as a team with the the rest of your players instead of competing with them, which is, which is quite refreshing.

Robin: 24:31

Can you talk a little bit about, how you approach Hanabi and and what agent design you used for that for that problem?

Jacob: 24:37

Yeah. So it's been a journey really. When we started out a while ago, doing my DeepMind internship, doing the very first self play experiments. So this was basically about if you can control the entire team, what happened? And the good news is you can get really good performance in this team.

Jacob: 24:57

The bad news is what I found pretty soon is that a, these agents that you train in simulation on the game from scratch are quite brittle in terms of, independent training runs. So if you run the same training algorithm twice, you can get out teams that are completely incompatible. And also, they are very different from sort of natural human gameplay as humans would play the game. And at this point, I had had a choice. I could have said, well, let me try and solve for Navi by using large scale human data to regularize the learning process.

Jacob: 25:33

But I didn't do that. What I said instead is, can we use this as a platform for understanding what type of algorithms are suitable for coordination. What do you mean by coordination is again, this idea of independently trained algorithms being able to cooperate or coordinate at test time. And that's what I did. And this was a really challenging journey because it turns out that, the human ability to coordinate is quite amazing.

Jacob: 26:00

We can again, I think so, like, when you played the Hanabi game, most likely what happened is someone explained the rules, and you then started playing. And within a few games, you had a sense of strategy. And you didn't need to know exactly. You didn't know you didn't need to have to agree precisely in what this game was and how you're gonna act in every situation. And that's what I wanted.

Jacob: 26:19

I wanted to have developed our algorithms that get to this more sensible way of playing, of of acting in these DECCOMDPs, these partial observable fully cooperative systems without requiring vast amounts of human data. And, the last instance of this is called off belief learning, which was is is sort of looking back at one of the papers that I've really enjoyed working on that to me addressed lot of open questions. But it's also paper that's notoriously hard to make sense of. So I leave it up to you if you want to risk boring your readers or listeners with off belief learning, in which case I'm happy to talk about it at any length.

Robin: 26:59

Well, I mean, this show is explicitly aimed at people who don't easily get bored hearing about, Deep RL. So, in that sense, you're you're welcome to go into more depth with that if you'd like. But if you could maybe maybe we could start because we actually have quite a few topics and and, limited time. So but maybe if you could just give us, one one level deeper on OBL. What is the general, strategy you took with OBL?

Jacob: 27:25

Okay. So off belief learning, at the most basic level, tries to prevent agents from developing their own communication protocols. So it's in some sense

Robin: 27:36

because that's from your what you learned from before is that's what they do. Right? And then they all these problems come with that.

Jacob: 27:42

Correct. So the the the it's exactly the opposite of learning to communicate. Right? I started my PhD by saying, how can agents learn communication protocols? Made some progress.

Jacob: 27:51

Fantastic. And then years later, these communication protocols are real issue because they're quite arbitrary. And if you're now going to encounter a novel test partner in the real world, and you haven't agreed in a communication protocol, then suddenly, everything is going to break. Right? So imagine in Anave, if I used, to say If I told you the 3rd card is red, and this didn't mean play your 3rd card, but meant that you should throw away your 5th card, then that could be quite confusing.

Robin: 28:22

Yeah. So Sorry. The core

Jacob: 28:24

question of OBL is yeah. Please go ahead.

Robin: 28:27

Oh, I just wanted to add. So you you know you know what? I don't I shouldn't have not have interrupted you. You're about to get to the core question. Good.

Robin: 28:33

I'm sorry. My timing isn't always perfect, especially because I gave you the space. But we'll

Jacob: 28:37

go back. This is no this is this was actually a great instance of coordination problem. Right? So people say, oh, language can resolve coordination. And like, well, have you tried having a conversation with a robot?

Jacob: 28:46

It's actually really hard to get the coordination of who should speak when and when to interrupt right. Right? So coordination problems exist everywhere even in the usage of language. And actually this is not there I'm interested in, but let me get back to this. So the core problem was how can we train a policy that can play Hanabi, learns to play Hanabi from scratch, but is not able to develop any communication protocols at all.

Jacob: 29:11

So what that means is when you're playing Hanabi, this policy should only interpret red to mean that this card is red. And if I say this card is 3, it should only be that this is a 3. It shouldn't be able to assign any high order meanings to this Such as if I say this is a 1, it's playable and so on. And that's quite foundational because if you think about it, often we don't want agents to communicate arbitrary information. It'd be quite bizarre if you had a fleet of software in cars that you're training in simulation And then one day, we realize that these cars are gossiping about us through their indicators or small nuances of movements.

Robin: 29:48

It's like a TMI problem. Right?

Jacob: 29:51

It's a what?

Robin: 29:51

A t too much information. They're oversharing.

Jacob: 29:54

It's TMI. That's right. Yes. Absolutely. Right?

Jacob: 29:56

It's sort of, and it's brittle. Right? Because suddenly, you have noble partners who don't understand it, it might fall apart. And also, it might be an AI safety issue if AI systems are exchanging obscure information amongst themselves.

Robin: 30:08

Is is the the fact that deep learning systems have generally everything entangled, is that the same problem here? Or is that only part of the issue?

Jacob: 30:16

That is part of the issue. I mean, the the the the the problem is that coexistence of agents in an environment during training leads to correlations, which can be exploited. And once you do that, you get in a communication protocol. Right? The moment any agent does something in a situation where there's partial information, another agent can start making inferences.

Jacob: 30:38

And suddenly, information is passing through the environment. And in off belief learning, we fundamentally and probably address this. And the main insight is the agents actually never train. Think about the reinforcement learning. We calculate target values in Q learning, for example, where we ask, given an action in a current action in a given action observation history, what is the effect of that action?

Jacob: 31:03

Well, it's the target. It's a plus gamma q of a star given tau prime new trajectory. But this effect depends on the true state of the environment and the true state of the environment is correlated with the past actions of other agents. And suddenly, you have this correlation between past actions and future outcomes which leads to conventions. In off belief learning, we never learn from what happened in the environment but only from what would have happened had the past, had the other agents in the past been playing according to random policy.

Jacob: 31:37

Because a random policy doesn't introduce correlations. If I play all actions uniform in all possible states of the world, there's nothing you can learn from my actions themselves. Because obviously, if you knew you're playing Hanabi with me and, I say this card is red. But you know that I'm playing I set that randomly without looking at your hand. Then suddenly all you know is this card is red because that's revealed by the environment.

Jacob: 31:59

But you have no idea about why I said this because I would have set this randomly anyway. And that's the main idea of of belief of belief learning. It really mathematically this method takes away the risk or the ability of having emergent protocols in multi agent systems. But don't we need some type of conventions in Hanabi? For example, if you hint this card is a one, I should probably assume that you're telling me this because the card is playable even if it's not obvious from the current state of the board.

Jacob: 32:28

And you can get this out by iterating off belief learning in hierarchies. And then you get extremely human compatible gameplay out of OBL. And, at this point, if you're listening to the podcast, I highly recommend to check out our demo of off belief learning if you're interested. And this is at bitly, bitlydot bit.ly and then slash OBL minus demo. And you can actually play with these OBL bots.

Jacob: 32:54

And these are really interesting to play with. At least for me, that was quite fascinating to see how Hanabi looks without conventions and then how conventions emerge gradually at the higher levels of the hierarchy.

Robin: 33:06

So we will have a link to that, in our show notes. And so but I I'm really obsessed with this, entanglement issue with deep learning. So does that mean what what you're saying, made me think that OBL is the the policies that OBL comes up with are do not entangle, everything, in the same way that traditional deep learning would. In the sense that, it can actually it can actually point to one thing at a time?

Jacob: 33:35

Yeah. So OBL policies disentangle something very specific, which is the correlation well, the the correlation between the past actions of other players and the state of the world. And what this prevents is secret protocols, emerging protocols between the different agents in the same environment. And I think an interesting question now is, if we're using language models in this world for different agents, different interacting systems, how do we prevent them doing clandestine message passing between each other? And I think here something like of belief learning could be used down the line to make sure that the messages that are being passed through language models are being used in the literal sense as opposed to, you know, scheming a plot to take over the world.

Jacob: 34:29

Right? So it might you might you can imagine that if we have a lot of decentralized AI systems, you might want to make sure that they're not scheming in the background through their messages. But that actually when they're saying, you know, we should increase GDP by 5%, they actually mean that. And they don't mean, I've realized there is a hack that we can use to bring down humanity.

Robin: 34:50

That would be good. You brought up prisoner's dilemma in in in your talk at NeurIPS. And, you're referring I believe you're referring to your Lola algorithm, which is actually a few years back, but I I think I just first encountered it at your talk at Europe's 22 2022 Deep RL workshop. And definitely recommend listeners check that out as well as, other, great lectures that that Jacob has online. But, so I've always, found prisoner's dilemma very depressing.

Robin: 35:19

And, I've read that, it was first analyzed by Rand Corporation in the fifties in the context of strategy for nuclear war. And it depicts this tragedy of the commons where it seems like the sensible thing to do is always to betray your your counterpart and and and you both suffer. And, but but you had some some really interesting results on this on this whole game. Can you can you tell us about, about your results with with, with prisoner's dilemma and what you learned there?

Jacob: 35:45

Yeah. So this is I mean, I think the the prisoner's dilemma, yeah, it is a little frustrating, because obviously, if we're only ever playing one single prison's dilemma, a sensible agent should defect. And that is what makes it a tragedy of the commons. But the good news is that humanity has mostly managed to turn single shot games into iterated games. That means we're playing prisoner's dilemma over and over again, often with the same partners.

Jacob: 36:15

And often with transparency of what was done in the previous rounds and what the outcomes were. And that completely changes what type of outcomes are possible amongst rational agents, amongst self interested agents. So in particular, there was this tournament by Axelrod in the 19 eighties, I think it was, where he invited people to submit algorithms to play the prisoner's dilemma in this competition. And scientists spent many, many hours and many tens of thousands of lines of code coming up with these complicated strategies. In the end, the strategy is that one was a few lines of code, and it was tit for tat.

Jacob: 36:54

It was, I will cooperate if on the first move, and then I will cooperate again if you cooperate it with me in the last move. Otherwise, I'll defect. And this strategy was extremely successful in tournament. And obviously, if you put tit for tat against tit for tat, you actually get mutual cooperation because nobody wants to be punished. So the single shot game is frustrating.

Jacob: 37:16

The iterated game isn't frustrating unless you do standard naive reinforcement learning, independent learning of initializing some set of agents and training them together. Because what you'll find is these agents invariably will learn to defect unconditionally even in the iterated game. And that's obviously bad news because if you imagine deploying these agents in the real world where we have iterated games, we don't want agents to unconditionally defect. We would like to have agents that can actually, account for the fact that other players are there and are learning and realize that by reward and punishment, they can shape them into cooperation. And that was the key insight behind Lola where we don't maximize don't take a gradient step towards increasing my current return assuming the other agent's policy is fixed.

Jacob: 38:04

But we differentiate through the learning step of the opponent in the environment, anticipating that our actions will go into their training data. And I will never forget when we first implemented this method. This was during my internship at OpenAI. 1st implementation, 1st run. And we get this policy out that cooperates, but it doesn't cooperate blindly.

Jacob: 38:26

It placed it for that. This moment is always going to be something that I remember in my research career because it was a hard problem. We had honest we had come up with a theory of what is driving the the the failure of current methods, and we managed to fix it. Now LOLA has obvious issues. It's asymmetric.

Jacob: 38:44

It assumes that the other agents are non naive learners. It's myopic. It only shapes one time step and requires its high order derivatives. And ever since, in particular here in my, group at Oxford at FLAIR, we've done follow-up work, to address these issues. And a paper that I really like out of that line of work is model free opponent shaping, which I highly recommend, any of the listeners who are interested to look at.

Robin: 39:08

Okay. So you use this phrase opponent shaping. First of all, I wanna say, that moment that you shared about running Lola and seeing the results, that's the kind of stuff we're here for in the show. That's what I think that's what a lot of a lot of people, love about about about machine learning and about, Deep RL, these magical results. So so thank you for sharing that.

Robin: 39:29

What do you mean when you say opponent shaping? Can you say more about that concept?

Jacob: 39:32

Yeah. So this is maybe sort of at the very, very core of what I'm currently excited about in multi agent learning, and that is the coexistence of learning systems within a given environment. And if you think about it, this fundamentally and radically changes what the relevant state of the environment is. Because even in something as simple as a prisoner's dilemma, whereas of, on paper the state is the last action of the 2 agents. In reality, as soon as agents are learning, the state becomes augmented with the policy of the other player.

Jacob: 40:11

Because if after an episode, the opponent is going to learn from that data generated in the in the interaction with me before we interact again, then suddenly my actions will influence their learning process. And if we're doing this naively, then we're forgetting the fact that we can actually shape the learning process. And this is ubiquitous. This is whenever we have learning systems that are interacting, I believe we should be considering the fact that there's a chance to shape the learning process. And that if we're not doing it, extremely undesired long term outcomes such as mutual defection become quite likely.

Jacob: 40:57

And this has been a focus of of this this line of work. How do we do machine learning when our decisions are influencing other learning systems in the long run. And it's all done on the above opponent shaping. And I'm happy to talk more about any of the recent papers or methods in that space.

Robin: 41:16

So you're talking about, policies that are choosing actions based on how they will affect the other players policy in future? Is that what is that what you're getting to?

Jacob: 41:26

How they will affect the other players, learning step.

Robin: 41:29

Learning step.

Jacob: 41:30

Okay. Learning step. This is it. This is the crucial part because how they affect the policies, the current policies, that's really done by RL. Right?

Jacob: 41:38

Reinforcement learning gives me time horizons. I can just play a 1,000 times sometimes until the end of the episode. I see how my action change you have. Change, impact your future actions. I I impact your policy.

Robin: 41:49

Mhmm.

Jacob: 41:50

But the big thing is if you're not a learning agent, then these trajectories generated will go into your learning algorithm. Right? So imagine you have Waymond, you have Tesla, these cars are on the road, They generate data. The train data goes to the training center. And suddenly, they will tomorrow, Tesla is gonna slide drive slightly differently because it has learned from the interactions shared with Waymo.

Jacob: 42:20

Okay? And what this means is if you, you know, if you think this through, suddenly we have to consider the fact that if, self driving cars don't honk or too passive, they will encourage other cars, other participants of the road to take the right of way. Right? So for example, as a cyclist when I was living in San Francisco, I knew that Waymo cars are extremely passive. Therefore, I can be more aggressive with them.

Jacob: 42:47

On the other hand, if Wayne was accounting for the fact that I'm a learning agent, they would naturally honk. They would naturally be slightly more aggressive to prevent being bullied into this type of a passive situation where they end up blocking the roads of this of the city and need to be taken off the roads. Right? So this is the crucial difference between what happens, in terms of an agent influencing the future action choices within the episode or the consequences my actions are going to cause by going into your training data. And this happens the moment that we have interacting learning systems.

Jacob: 43:19

So, this is the future. The future is language models are everywhere. We're generating tons of data in that interaction. And that data will be used to train more AI systems. So when people saying what is model good for?

Jacob: 43:31

Well, it turns out that if we look under the hood, every deployed machine learning system becomes a multi agent learning problem. Because these language models exist in the same environment with humans and other language models. And systems will be trained on the data that they generate.

Robin: 43:50

So you had, one paper, I think you mentioned model free opponent shaping. And in that paper, you explicitly talk about, or that was first author Chris Liu with yourself as a co author. But in that paper, you talk about basically as as a meta game. Is this meta learning, and do you consider this a meta RL problem? And how do you how do you frame things in terms of meta RL here?

Jacob: 44:13

Yeah. So just for the to give some more background for the listeners, what we're doing is we're defining a metagame whereby the state is augmented with the policy of the other player, And each time step consists of an entire episode. And my action is to choose a policy for the next episode. Why is the meta state now the policy? Because based on my and your last policy, the learning algorithm in the other agent is going to induce a state transition, a new policy that comes due to the learning process.

Jacob: 44:49

So this is meta reinforcement learning in a specific setting. And more interestingly, we get to meta self play, which is when we combine 2 Mforce agents that learn to shape another shaper. So shaping is again used for opponent shaping here. Right? You have a learning you have a learning agent something like a PPO agent that's maximizing its own return.

Jacob: 45:15

Doing essentially independent learning in the prisoners dilemma. And there were meta learning, another PPO agent that could learn to optimally influence learning dynamics of this, naive learner to maximize the returns of the shaper. And then as a next step, we can now train 2 meta agents that learn to optimally influence each other's learning process, in the in the in this, iterated game.

Robin: 45:42

Okay. Thanks for doing that. We had, Jacob Beck and Risto Buario recently presenting their, their MetaRL survey paper. And so, and so I wanted to see I wanted to hear how how you you relate that to to meta r l. So meta self play, that is a phrase that I have not heard before.

Robin: 46:03

Is that a new thing with a new are you are you quoting a phrase here, in in your line of work, or is that an established idea?

Jacob: 46:10

I haven't come across it before. I mean, I don't think we've made a huge claim to novelty, but I do think I do think it is new, and it's nice that it addresses some of the issues with with Mforce. Maybe just more different one one more difference between sort of standard material that people talk about. And what we do in Mforce is that we are truly model free. So and and we don't we really have a meta state.

Jacob: 46:34

Right? So often what's missing in meta RL is the meta state. And in our case, we have the policy of the other agent, which is that that meta state that is commonly missing in meta rl, where then we get into the question of how to estimate gradients through finite time through short unrolls with minimal bias or how we differentiate to long unrolls of trajectories. Here we've wrapped we've got rid of all these issues by being able to actually learn in the meta state. Quickly back to meta self play, this is, I believe, new.

Jacob: 47:05

I've I haven't come across it before, and it allows us to have shapers that are consistent. What I mean by that is I don't have to model the other side as naive learner like in Lola. I can instead have a shaping policy that is cognizant of the fact that the other agent is also shaping at the same time. And it's sort of almost like that infinite theory of mine where I model you modeling me and so on at at at infinity. Mhmm.

Jacob: 47:31

That is really hard to wrap your head around, but but works fine in practice. And what it means is that Mforce doesn't just extort naive learners. Yes, it would do so. So if you have, population of naive learners, you introduce Mforce. It will then exploit these naive learners.

Jacob: 47:47

And so push them into, cooperating with it even though it's defecting. But that would actually incentivize others to use Mforce. And if you have a population of Mforce agents, there's one of your opponent shapers, then in meta self play, they would actually end up cooperating because they would stabilize each other's cooperation. They would mutually shape each other in cooperation, which is a nice finding. Now, this is an empirical finding so far.

Jacob: 48:09

We do not have theoretical results, which I think is, a really nice frontier of our work.

Robin: 48:14

Just to see if I understand this correctly, are you saying that MFOS agent could play another MFOS agent and it would account for the fact that that other agent is trying to shape it?

Jacob: 48:25

Correct.

Robin: 48:26

Is that right?

Jacob: 48:26

So so, yes. So the the way that meta software works is in the end, we get these agents that are free shaping aware. They're, approximately optimally shaping another shaper. Right? So this is this this this recursion twist in a sense, but because we have a training process where we can anneal, the probability of playing with a naive learner burst with this m force agent that actually gives us a specific equilibrium, and that ends up being a shaping aware equilibrium that stabilize cooperation between these M Force agents.

Robin: 48:56

So it would be like in the case of a negotiation, we'd have to think, oh, why are they taking action x? Are they taking it because they think that that's gonna make me think, like, down that whole rabbit hole? And how does, like I mean, you talked about how that is recursive, the theory of mind could be any number of levels. And I guess people generally just cut it off at some point for for practicality. But are you saying there's something deeper going on here where you don't need any more levels and you kind of cover the levels?

Jacob: 49:25

Correct.

Robin: 49:26

Can you say a little bit more about that? Because that seems kind of amazing.

Jacob: 49:29

Yeah. So we're we're effectively trying to find an approximate equilibrium of the of the of this meta self play. Right? Now again, we don't have theoretical guarantees but empirically, we have good reasons to believe that we're close to a Nash equilibrium of meta self play. And obviously once you are you are the Nash equilibrium in that meta game then this is equivalent to being infinitely aware of awareness of the other agent's policy.

Jacob: 49:56

Right? So this is, sort of like a nice result. If you can solve for the equilibrium of the meta game, then you're fully shaping aware, and you're no longer affected by these issues of, k level theory of mind or, higher levels of Lola, for example, because we can just solve for the fixed point.

Robin: 50:14

Okay. Let's move on to, some other recent papers you have on communication. There was one adversarial cheap talk, and the cheap talk has been a theme that's come up in your work before.

Jacob: 50:24

Can you tell us about that? Yeah. So basically, this is, asking, I think So what what I like is a question. Is there a a really counterintuitive setting, a minimal setting where we can still influence the learning dynamics of another learning agent? And what we came up with is so constrained that it was mind boggling to me that this is indeed still possible.

Jacob: 50:45

So the rules of the game are the learning agent, say the victim agent, observes the true state of the environment s. And all the adversary can control is bits that are appended to that true state. So you can think, for example, about a setting where you have, some noise features in the dataset that nobody cares about because nobody checks. For example, orders far away from the mid price in the order book that are cheap to do or data on Reddit or whatever else somebody might be able to, manipulate, but you don't want to care if it's in your training data because we throw everything in there. And what we found is that we can now meta learn an adversary strategy that cannot just disrupt the learning process of the victim agent.

Jacob: 51:33

It can also vastly accelerate that learning process, and it can actually introduce a backdoor during the training of the meta of the victim agent, whereby it can at test time remote control that agent to carry out a completely different policy. So, you know, in the extreme case, you could imagine if I do this in a stock market environment, I might be able to use orders far away from the order book to influence the training process of other participants to then make them maximize my financial return rather than their financial return. Obviously, we would never do such a thing, but it highlights potential abilities of agents to backdoor training processes. And it's quite interesting to ask, how would this can this be defended against in real world settings?

Robin: 52:22

Cool. That that actually sounds really interesting and and and kind of important. So I'm a little surprised that paper wasn't wasn't, wasn't picked up. But, but I encourage people to check that out. And as well as your previous paper, on cheap talk discovery and utilization in multi agent reinforcement learning.

Robin: 52:39

I guess, by cheap talk, you mean those extra bits that you you think they would be ignored, but they're not. Is that is that what that means?

Jacob: 52:45

Correct. So we're just saying there's bits that don't influence the dynamics. Cheap talk generally means, bits that can be modified by an agent and observed by another, but do not influence rewards or environment dynamics. Because we know that I can shape the learning process of another agent if I interact in the environment. For example, the prisoner's dilemma or other Johnson learning problems.

Jacob: 53:06

But what if I cannot actually change environment dynamics or payouts? Because paying people things is expensive. Right? I don't wanna pay people. Messing with the environment would be expensive because I have to change the world.

Jacob: 53:16

But what if I can't do any of this? All I can do is I can set random bits that nobody cares about or should care about. Can I still influence a learning process?

Robin: 53:25

Mhmm. Now this comes back to, deep learning often having this unwanted entangling between things, I think. Right? It seems to be related to

Jacob: 53:31

that. This is exactly a feature of the, fact that we use function approximators. In fact, we also prove that if you use a tabular policy in the limit of enough samples, you cannot be, adversarial attacked like this.

Robin: 53:45

Zooming out a little bit, you know, the topics we've been discussing, multi agent systems, communication, cooperation, competition. And all of these, of of course, have been major features of, life in society, social life from from the beginning in humans and other animals. And, I guess even in other in the plant kingdom too. But does your work, or every kingdom, I guess, all of life, but does your work tell us something about these these aspects in life in more general terms, like outside of of machine learning? Like, do you think that the ideas behind some of these algorithms could, could give us insight into tragedy of the common scenarios that we that we face that we face in real life?

Robin: 54:28

Yeah. I

Jacob: 54:28

think it's a really good question. I mean, I think I can speak for myself that, certainly by having worked on this range of problems for some time now, That has certainly changed how I think about interactions with humans. I think about conflict. I think about alignment, between different humans. So for example, it's I think it's quite easy to underappreciate what a precious gift common knowledge is and how hard it is to obtain.

Jacob: 55:00

And, this was a foreign concept to me until I started working on multi agent learning. And I've ever since actually used it in group situations where teams haven't been working well together to be very explicit about establishing some proximity of common knowledge, making sure that people are on the same page. Because it's really hard to coordinate if it's unknown what other people know. Right? What that means, for example, practically for flare is that the group puts a heavy emphasis on having meetings where people are in the same room, where groups of people understand what we're doing and why we're doing it.

Jacob: 55:34

And I think these insights sort of are really important from a coordination point of view. So like this, you know, I started out with my human intuition about coordination. I went to work with machines and both work with others but also understand conflict. The same holds for opponent shaping. So I think really emphasizing that, humans are learning and that there's a natural tendency for trial and error and understanding, but also that, you know, the feedback we provide will help others develop.

Jacob: 56:10

Right? And being very clear about what the what the goals are in creating alignment around this. Again, these these are all ideas that have come back sort of from from my work on on multiagent learning, and that have really helped me, deal with with coordination problems or incentive alignment problems in, in my research work, but also my personal life. And I think those ideas will probably also have use cases in other areas of of of science or understanding sort of where does you know, I think at some point, I'd like to get to a point where we can really understand how these algorithms could be discovered by logic by by an evolutionary process. Right?

Jacob: 56:50

Because we had to discover these things mathematically or through intuition, but it would be great to, at some point, see that these types of reasoning abilities and culture and rules and so on can really merge out of an evolutionary process like it has happened for for humans. I think they'll be much closer to truly understanding, the genesis of these abilities in terms of theory of mind, mult age and reasoning, yada, yada, yada, yada.

Robin: 57:15

So is there other work in, in in machine learning and reinforcement learning that that you find interesting outside of outside of what's happening at, at your lab? What what kind of things do you find, fascinating these

Jacob: 57:25

days? Oh, I mean, I obviously find gbt 4 to be absolutely mind boggling. You know, as as I said, I think this was Ilya was right and I was wrong. We had these conversations years ago, but he said, have you tried using a bigger LSTM? And I said, well, you know, I'm calculating exact gradients.

Jacob: 57:42

This is like an infinite batch size. Ultimately, he was right that this was a faster way of getting to read intelligence, sort of approximately intelligent systems than relying on emergent capabilities. So that's amazing. And and in that context, I think from human feedback, everyone's talking about it. I think I'd I'd love to understand that space better and thinking about how we can learn better algorithms for what are the limitations and so on.

Jacob: 58:09

Putting language models together to have a conversation, I think that's cool. It's unclear what we're solving right now, but I think trying to make that into really scientific approach that I think works well in terms of trying to merge it with agent based modeling and so on is cool. And lastly, I talked about it before, unsupervised environment design, open endedness, having algorithms that can utilize ever larger scale compute to really discover new things in the broader sense, I think that's a fascinating area because ultimately, the core the core hypothesis I have been operating under is that data is finite, but compute will be near infinite. And the big question is, how can we replace the need for data with simulation and methods that improve themselves in simulation? And this is a big sort of again, going back to what are the blind spots.

Jacob: 59:02

Right? If we're thinking this this to the to the end, to the current the current hype train, right, what is the final stop of this hype train, and what what happens next?

Robin: 59:11

So I heard some of your comments on working in academia versus industry. And I think, that was at your 2022 deep RL workshop talk. And our past guest, Taylor Killian, asked about that on Twitter. He said, ask him how he really feels about industry labs fronting as academic departments. Wink.

Robin: 59:28

So, any comment on that, Jacob?

Jacob: 59:30

Well, if you wanna know how I really feel, find me at a conference and let's have a chat. No. But, I will give you sort of the the the the the version I think that will be interesting to the listeners, hopefully, which is that it's kind of easy to forget that, everything we're seeing right now, the revolution of deep learning, large models, and so on, is seeded by academic research. And more importantly, that this is a branch of academic research deep learning that was vastly unpopular for decades. Okay.

Jacob: 01:00:06

This is really easy to forget. It's easy to see the large models coming out now from companies and to think that innovation is happening at these large company labs. When the reality is that almost by definition, the groundbreaking long term innovation comes out of academia, has always come, and I want to say will always come. And the reason is simple. The reason is the cost of exploration versus exploitation.

Jacob: 01:00:34

What industry labs are really really good at is throwing large amounts of money at relatively safe projects that are going to yield return one way or another. And I include in return being nature paper science papers. What they cannot do in the long term is open explore to work. And that is because the investments don't make sense with the incentives of the institutions. So the times learned this is different.

Jacob: 01:01:04

Obviously, we'll see as people squeeze more and more juice out of current methods. There's going to be more flashy, interesting results. But I'm sure that we'll see methods, huge breakthroughs that give us orders of magnitude of improvement in efficiency, in understanding, and and and and that will come out of academia. But these operate on a very different time scale. But it's important to if we're now in this field as PhD students, as professors, as academics, to take a step back, to zoom out, and to see the big picture, which is that breakthrough innovations on the large time scale, not the squeezing of current methods, have come out of academia and will be cutting coming out of academia.

Jacob: 01:01:51

We're not looking for the 1 or 2% improvement or the 10%. We're looking for orders of magnitude 0 to 1 changes by having fundamentally novel approaches.

Robin: 01:02:01

That's a beautiful thing. So is there anything else that we should have covered today that we didn't, that I didn't mention?

Jacob: 01:02:07

Yeah. Just a piece of advice for anyone who's thinking about joining the field at the moment. Get out of Twitter. I think this is one of the huge advantages I had as a PhD student. I didn't have Twitter until I had to promote my learn to communicate paper.

Jacob: 01:02:24

And it's really important to not be blindsided by looking at the same problems and results and approaches as everybody else, But instead to understand hard problems deeply and try and solve them. And that's a really fascinating and rewarding exercise, and it gets us out of this competition of trying to the next obvious paper fast. Right? That's not what science is about. Science is about, at least in my understanding, which is quite, I guess, romantic almost.

Jacob: 01:02:55

It's working to the bottom of heart problems and then addressing them and even addressing them. Rediscovering solutions is fine because we'll get us at something from a different angle, which will give us get us somewhere else. And it's amazing how many open problems there are everywhere once we stop just looking at the same things everyone's looking on at Twitter. So that's my one, note of caution. Stop using Twitter.

Jacob: 01:03:17

Even if that means I'm gonna use I'm gonna lose followers.

Robin: 01:03:20

Okay. But but do come to the TalkRL Podcast Twitter because that's where we're gonna post this interview, guys. Jacob Foerster, this has been a real treat. Thank you so much for joining us today and for sharing your insight with our TalkRL audience. Thank you, Jacob Foerster.

Jacob: 01:03:33

Well, thanks so much for having me. It's been it's been great talking to you.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere