TalkRL: The Reinforcement Learning Podcast | Transcript: Stefano Albrecht on Multi-Agent RL @ RLDM 2025

Stefano Albrecht on Multi-Agent RL @ RLDM 2025

July 22, 2025 / 31:34/E67

Speaker 1: 00:02

Talk RL.

Speaker 2: 00:05

Talk RL podcast is all reinforcement learning, all the time. Featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan. Today, I'm very glad to be joined by professor Stefano Albrecht.

Speaker 2: 00:26

He was previously associate professor at the University of Edinburgh. He's currently serving as director of AI at startup Deepflow. He's the program chair of RLDM conference, and co author of the MIT Press textbook, Multi Agent Reinforcement Learning Foundations and Modern Approaches. Welcome, professor Albrecht.

Speaker 3: 00:45

Thank you very much for having me.

Speaker 2: 00:46

So how do you

Speaker 1: 00:47

like to describe your focus area? So in in one line, I focus on developing machine learning algorithms for autonomous systems control and decision making, and also multi agent interaction.

Speaker 2: 00:59

Can you talk a little bit about the evolution of multi agent RL over the past few years, like, on a high level? Like, what, from your point of view, have been some of the most important recent advances in that field?

Speaker 1: 01:12

Multi agent reinforcement learning has been around in the AI field since at least the early 1990s, where you can trace some of the early papers. But it's received a huge amount of renewed attention in the mid two thousand tens when people brought in deep learning techniques and integrated them with multi agent reinforcement learning algorithms. And then there was a huge resurgence of the field, and many people came into the field, and and it grew very much. But you can trace it back even before the AI and machine learning type of research, where in areas such as game theory, example, in the early nineteen fifties, people already considered this idea of having multiple agents or players learn together in a game that they were playing together. And one of the most common algorithms that's cited here is fictitious play, which I think was from 1951.

Speaker 1: 02:08

That's sort of the canonical, one of the very first multi agent learning algorithms. When deep learning came into multi agent reinforcement learning, some really interesting and important new ideas came out of this work. So one of the most influential ideas that then was developed fairly rigorously is this idea of centralized training with decentralized execution. So the idea that during the training process, we are operating in conditions where the agents can share a lot of information, such as their observations, they can communicate everything they know about the world, everything they see about the world. And typically this would be the case if we train in simulation.

Speaker 1: 02:48

Right? Because we we we control the entire environment, and Asians can share all kinds of information. So this simplifies the training process. We can deal with issues like non stationarity in a much more effective way, and they can share a lot of information which helps them coordinate their actions. But then the other part of this acronym, DE, stands for decentralized execution.

Speaker 1: 03:10

And here the idea is that we want to train policies that, after the training is done can still be executed in a fully decentralized way. So we have the centralized training. We share all this information to improve the training process, but the policies that we train during this process can still be executed in a decentralized way. And that's important because these agents will typically be these localized entities that exist in a world where they can only see their own surroundings and they have their own information. They may be physically separate entities as well, and they may not be able to share this kind of information anymore after the training process.

Speaker 1: 03:45

So within this category of centralized training and decentralized execution, there were quite a few methods that have been developed. For example, there are now a lot of methods that are based on this idea of a multi agent policy gradient. Policy gradient already existed in single agent reinforcement learning, whereas one of the most influential ideas, and this was basically extended into the multi agent space where we now have policy grading algorithms that can share information such as sort of a value function that's gives estimates of values for states, for example, or shared observation spaces. Many other methods that have been developed were very influential subsequently. For example, self play, the idea that a policy is trained against a copy of itself or against copies of previous versions of itself, and you scale in that way.

Speaker 1: 04:32

You become increasingly competent at the task you're doing, and you're beginning to understand, you know, limitations of your own policy, and then you're closing those gaps. This was one of the influential techniques that was used in the AlphaGo and AlphaZero works, where they played Go and and chess and and shogi. Value decomposition is another important type of technique that was developed in the more recent years when deep learning came into the field. This idea that agents can actually understand their own contributions to a shared rewards signal, and then they can use these estimated decomposed value functions in order to make optimal local decisions. So this this goes back to one of the core problems in multi agent reinforcement learning, which is multi agent credit assignment.

Speaker 1: 05:17

The idea that all the agents need to understand the impact of their own actions on the performance of the whole team. Partial observability is another big aspect of multi agent reinforcement learning. The idea that the agents are separate and different, and they only they observe their own information, localized incomplete information about the environment, so they need to be able to coordinate despite the fact that they may see different things about the environment, and and and they may have to share information in certain ways to to facilitate that. One one more or two more things I can mention here. One is population based training, which is basically an extension of self play that I mentioned earlier, but now you have entire populations that are trained, and you're training policies against these distributions over entire populations of policies.

Speaker 1: 06:06

And this was very effective in, you know, for example, Starcraft two when they used it in in AlphaStar, one of the the methods that was used there where they reached human expert play level. So very interesting ideas here. One more thing to mention is communication learning, of course, is a very exciting aspect of multi agent reinforcement learning. The idea that the agents can learn shared language or a shared communication protocol. So you're exposing the agents to a particular task they have to solve, and in order to achieve this task, they may come up with a very specialized language or communication protocol that, facilitates coordination in a very effective way.

Speaker 1: 06:49

So this language may not at all be natural language like, human language. It may be a very specialized specific narrow language that is highly special to this particular task they're executing. And it's very interesting, this idea that they can come up with their own language basically to describe what they what they're up to and how they wanna coordinate their actions with with one another. So I think that's a quick round tour in terms of techniques that have come out, and and all of this is still very highly active research that's going on.

Speaker 2: 07:18

So in terms of the learned inter agent communication protocols, which, you know, we see some research about that, can can you talk about whether it's practical to use today, or is it more a, of theoretical interest, or or or hand designed communication protocols still more practical at this point in in in practice?

Speaker 1: 07:42

I think this is mostly a theoretical idea at this point. It's a beautiful idea if you think about it. Like, people get excited about this. Clearly, it's it's a really cool idea, but there's an it's not easy to do. There's a number of reasons why agents may struggle to come up with the shared language.

Speaker 1: 07:59

There's some well known problems such as the fact that the environment becomes nonstationary when multiple agents concurrently learn and change over time, and they need to track each other and adapt to one another over time. It makes it very complicated to learn in a stable way. And in particular, if you want to agree on a shared language or shared norms and conventions, it's difficult to do that if the environment changes all the time. But in addition, another practical reason is that for many applications that we want to build, we still want to understand what's going on. But if we allow the agents to come up with their own language, we may not understand anymore what kind of information they actually end up exchanging with one another.

Speaker 1: 08:38

And so I think there's a need to to still have communication protocols and languages that we can actually trace and understand and where we can build in things about privacy protection, information security, and and making sure we we still understand what's going on. Often for particular applications, we also have strong priors in terms of what we think they should be communicating with one another. And, you know, for example, if agents operate in a shared warehouse, commercial warehouse, there's a few things we already know that they probably wanna communicate with one another, like where are they going to go in the warehouse, what's their current location, What is their their current assigned task? So there's only so many things they probably wanna communicate, and if we have if we have a strong power like that, it probably makes sense to engineer that into the communication protocol already.

Speaker 2: 09:26

K. So if we imagine this kind of sci fi vision of these giant swarms of tiny robots people talk about this vision sometimes. They're communicating. They're collaborating on some difficult tasks. Now we we've seen, I think, I guess, some examples getting towards that, but but, could you could you, describe to us what you think of the as the main bottlenecks today for achieving that vision?

Speaker 2: 09:51

Why is that still kind of very challenging today? Is it is it do

Speaker 1: 09:56

we need better algorithms? Is it the communication bandwidth is difficult, or is it really the hardware or something else? Or many maybe many of these what are the constraints to getting this very sci fi vision of of massive swarms collaborating? Historically, and even still to this day, research in swarms and swarm robotics has been mostly separate from the mainstream research in multi agent reinforcement learning. People in robotics who work in swarm systems, they typically look at machines that are very simple in in their makeup, so very low resource compute, low battery usage.

Speaker 1: 10:35

They wouldn't be able to pull off compute intense policy inference computations. So they're already constrained by this idea that they have low resources to work with, basically. Whereas in multi agent reinforcement learning, we're interested in training policies that do interesting complicated things that require, you know, deep learning inference processes that are maybe not so easy to pull off in low resource environments. Typically, when we do swarms, we're talking about hundreds of, you know, robots. That's what that's the aspiration in terms of the scaling.

Speaker 1: 11:08

Whereas in mainstream multi agent reinforcement learning, most of the work you see there tends to be somewhere between two and ten or 20 agents. But those agents are supposed to be more capable in terms of, you know, what they're supposed to be doing in the in the in the task. So I think that's one of the important differences here and one of the reasons why multi agent beam phospho learning has not had as big of an impact on swarm robotics as you might expect to see, at least not from the from my knowledge. I think the kinds of control algorithms that end up being used in swarm robotics tend to be much more simplistic. For example, finite state machines and other hardcoded heuristics.

Speaker 1: 11:50

In addition to that, I already mentioned some issues with multi agent reinforcement learning. One of the core challenges is this idea of non stationarity, which is the fact that every agent is changing, and that means everyone has to track each other all the time and adapt to one another. Now you can imagine if you scale it up to hundreds of agents that are all changing constantly, it really makes it even worse. You know, the the the scaling makes this nonstationarity aspect even more prominent in a way. And the question is how to deal with that.

Speaker 1: 12:19

In addition, there's other core challenges that define multi agent reinforcement learning. Another one is referred to as multi agent credit assignment. This is the idea that agents need to figure out the impact of their own actions on the performance of the team. And, if you scale it up to many hundreds of agents, that issue that problem becomes even more complicated because you now have to reference against so many more other agents and understand whose actions were actually important at what point in time. Yeah.

Speaker 1: 12:45

So from my perspective, I think those are all reasons why a lot of the research in multi agent being fast learning has not found a found a direct path into the more robotic style swarm robotics research. Your work

Speaker 2: 12:58

also involves AI human collaboration. Can you talk about some of the ways to approach that that interaction?

Speaker 1: 13:06

Yeah. So that that's a very interesting part of the work. One of the most important things that that I like to think about when you know, in multi agent interaction in general, but in particular when the humans are also part of the the the picture, is this idea that we need to model the beliefs and the intent of these other agents and humans in particular. So we want to understand what they currently think the state of the world is. What are their beliefs about the world?

Speaker 1: 13:35

What information do they have about the world in terms of how they think the world works? And and also what do they want to achieve in the world? Know, based on the things they have been doing so far, what do they actually want to achieve longer term? And both of these types of inference are very important for a machine to understand how to interact with the human or with these other agents. Right?

Speaker 1: 13:54

So if I have a better understanding of what you might be thinking the world is like and what you might want to achieve in this environment, it will put me into in a position to better adjust my own actions to help you achieve that task, right, or to help to coordinate with you. So goal recognition is an important part of the picture here. We've done a lot of work in goal recognition algorithms. One particular technique that I can tell you about briefly is this idea of rational and brass planning. We've developed this in the context of self driving technologies.

Speaker 1: 14:25

And, you know, the the idea that a machine, in this case, a self driving vehicle, is able to understand your longer term goals, such as whether you want to turn left or right down at the junction that's in front of us based on your past actions. And the way we do this in this type of technique, rational inverse planning, is basically by putting ourselves into the perspective of this other agent for which we want to recognize the goals, and we're trying to understand, okay, what this other agent, this other vehicle has done so far, how close to optimal are those actions if we assume that this machine wants to achieve that particular goal longer term, such as turning left or turning right? And there's a number of techniques in which you can do this, and so we developed one where basically we use an a star, which is a kind of search algorithm to complete planning trajectories based on where the vehicle currently is, and we we combine that with the previous observations from the vehicle in terms of past driving trajectories, and then we pass it through a Bayesian inference process. And so this this type of reasoning is really quite robust.

Speaker 1: 15:30

You can still make that work even if you have only limited observation of the environment if there is missing parts of your observed trajectories, and we experiment with this a lot, it turns out to be a really robust approach. And and I would even guess that humans probably do a version of this rational inverse planning as well. You know, if I'm trying to understand your behavior and I'm trying to make sense of it, I'm trying to rationalize what you've been doing, I probably want to match up your observed behavior against a particular goal that you may wanna achieve. And the way in which I do this is by thinking about whether or not what you've done would make sense if you wanted to achieve that goal. Right?

Speaker 1: 16:07

So if it doesn't make sense for a particular goal, you probably don't wanna achieve that particular goal. And and I'm thinking humans might also have these kinds of processes. The other thing I think that's also important when you deal with humans, much more so than when you deal with artificial agents, is that humans it's well understood that human decision making is affected by a number of hard cordage decision making heuristics, and these things have been studied well in areas like behavioral economics and psychology and other fields. So various heuristics have evolved over a long period of time that bias bias us towards making certain decisions, which may not look like rational decisions. This is a problem for AI techniques which are often assuming a certain sense of rationality in other actors.

Speaker 1: 16:54

So but if the other, you know, agents, the humans, for example, are not entirely rational, this could be, like could make things a bit more complicated when we try to make sense of their actions. So baking in these kinds of priors and inductive biases in your reasoning processes when you make sense of someone else's actions and in terms of how you want to interact with them is also an important part of the whole machinery here.

Speaker 2: 17:15

Okay. Let's move on to your startup, Deepflow, where you are director of AI. Can you tell us about and and for listeners, this is deepflow.com, as that phrase shows up in in a few different places. Can you tell us more about Deepflow? What what's the elevator pitch for Deepflow?

Speaker 1: 17:31

Yeah. So it's an AI startup based in London. It's a still a new new startup, but it's rapidly growing. And we're trying to build a system where, basically, in a nutshell, you can tell the system about a project or a business you want to build. For example, I want to create a company for self driving vehicles.

Speaker 1: 17:48

The system will take that prompt. It will ask you a few more questions about your resources, about your experience, about you what you want to achieve, and then it will go ahead and produce an entire plan basically to realize that vision. So we'll create the various tasks that need to be executed, the interdependencies between the tasks in terms of input and outputs and requirements. It will be able to put autonomous agents on some of those tasks in order to complete them, and it can also put human workers in some of those tasks. And in the end, it's all about coordinating these AI agents and human workers as they begin to work through these workflows.

Speaker 1: 18:24

In addition, sometimes it may be the case that some of these tasks that we that the system created may be a little bit too complex. So you can ask the system to break the task down into a series of subtasks, and it will do that for you. And we'll break it down to make it more feasible. And then, again, you can assign AI workers or human workers on those various subtasks, and you can imagine that you can build up layers of complexity in your business model like this that you can break down successfully to make it more feasible, break it down into more feasible chunks, basically. So this the this is the the big vision here, and we've we've made a huge amount of progress in this space.

Speaker 1: 19:01

We already have a number of clients we're working with. One of the interesting projects we're currently doing is with the Center for Entrepreneurship in England. This is a charity that wants to help people who want to do startups and businesses, and they're getting them off the ground, basically, like, you know, providing them with the the systems and the training in order to achieve their business goals. And we've teamed up with them to provide a support mechanism based on our platform to their trainees. So they can use our Deepflow platform and basically learn from the platform about how to plan out their business, how to break it down into tasks.

Speaker 1: 19:43

And our platform will actually teach those aspiring business people about how to create successfully and effectively such a business plan and how to execute it, and we'll help it as much as possible. So there's also this conversational and teaching aspect. You can actually talk with the system and and ask questions, and the system will explain certain processes. It will help you to get to your goals. So this is one of the exciting projects that we're currently doing, and it's scaling up to thousands of users.

Speaker 1: 20:08

We have a number of other use cases coming up as well, and we're very excited for you to for you to see that when these things are getting deployed soon.

Speaker 2: 20:17

So you you talked about startups and business plans. Is this more targeting brand new companies that don't exist yet, or is it also targeting existing companies?

Speaker 1: 20:29

Yeah. Everything. So you can the the canonical use example is if you start from scratch, and you might not have a clue of what you do about what you're doing. Right? And so you can you can work with the system.

Speaker 1: 20:41

The system builds up a whole plan for you. It shows you the steps. It can already execute some of the tasks to some extent, and it will help you orchestrate the whole workflow and the workers in your team. But even if you have experience, you can still benefit from the system because it has a really strong, backing that is using state of the art LLM models, which have a lot of knowledge about how the world works, and they are surprisingly effective in terms of building up business plans and understanding the interdependencies between tasks. They can already complete a bunch of tasks for you.

Speaker 1: 21:15

For example, marketing research, understanding requirements for a particular product you want to develop, even coding is now really it's become really feasible with the recent iterations of large scale LLMs. It's very exciting to see the kind of the degree of automation that we can achieve. One of the next big things that we're working on right now is this vision that we want to automate basically the entire process, including the management of this workflow. So the idea that we also have automation in the system where the system takes care of the orchestration and the coordination between the worker agents. So it will it will monitor the progress of the individual workers.

Speaker 1: 21:55

It will prompt the workers in case there are bottlenecks or if there's delays. It will help the individual workers in order to eliminate those delays and continue to, to break down the task, to continue to assign and reassign the workers, or to reallocate resources. And we're trying to really push to the to an extreme the degree of automation that we can achieve using recent multi agent LLM approaches. Okay. So clearly leveraging LLMs.

Speaker 1: 22:24

You mentioned multi agent. Is there is there a multi agent RL aspect here? Are you able to disclose that? So I can say that we're looking at a range a spectrum of techniques, including fine tuning methods using reinforcement learning and multi agent reinforcement learning. We also engage in research, so we we will be pushing out some papers hopefully in the coming months where we're trying to develop some of these techniques.

Speaker 1: 22:50

There are certain considerations here. So one idea is if we do if we do use reinforcement learning to do fine tuning, it changes a little bit our business approach because now we also then have to service the inference approach, the the inference basically running the models. You know, we're thinking about that as well. We're not sure yet whether or not this is something we want to do. An alternative is that we use reinforcement learning not to directly fine tune the model because, as I said, this would require that we then also host the models and run the inference, but actually to guide an existing model without necessarily rewiring it internally.

Speaker 1: 23:26

So we're thinking that it would also be there will also be opportunities for reinforcement learning to have a positive benefit here without necessarily changing the the the makeup of the model. And the benefit of that is that we can still use whatever the recent state of the art LM models are without touching them, basically. We can continue to to benefit from these rapid improvements that happen in these fields.

Speaker 2: 23:48

Okay. Let's move on to your recent book on multi agent RL, which has received accolades from some definitely from luminaries in the field. Can you can you talk about the your reasons for writing a textbook? It's not a not an easy endeavor. Why did you put all this time into into writing this book?

Speaker 1: 24:06

Yeah. So this is an important question, right, as as an academic when you start this kind of process. One of the reasons for me personally was, as I said earlier, in the mid two thousand tens, deep learning sort of came into this field and created a huge amount of attention and a lot of activity, which which was great, like a lot of excitement, and and I was also very excited about this. But at the same time, when a lot of people come into a field, and many of these people don't necessarily have a background in important areas that they should know about, you know, previous work in multi reinforcement learning or work in game theory, for example, which is very closely related, basically, started to see certain highly cited papers that had issues with some some basics. For example, some of the highly cited papers did not rigorously define the problem they were trying to solve, or they did not link back to very important body of literature, such as various results in game theory that are relevant.

Speaker 1: 25:10

And so you you begin to reinvent the wheel in a way that's maybe not optimal. So what I began to see is that there were all these new papers that then came out subsequently, and they started these early high impact papers, and they there was this social propagation going on here. There's this academic network of papers studying papers, and some of these issues, such as not defining the problem rigorously, propagated into these newer and newer papers. And my feeling was if we don't put sort of, if we don't stop this social propagation, it's gonna hurt the the the reputation of the field. So I really had the sense that we needed to create a resource to bring everyone on the same page where everyone knows all the important ideas and concepts, and we all speak the same language, and we know how to rigorously define these kinds of optimization problems that we want to solve with multi agent reinforcement learning.

Speaker 1: 26:01

One more thing I can say before I even started writing the book, the first thing I did was actually to write this blog post, which I called common inaccuracies in multi agent reinforcement learning, and I think you should still be able to find that on the Internet. And here, basically, I I started with this blog post to point out, I think, three issues in terms of inaccuracies that I noticed in some of these papers, including this idea that people weren't rigorously defining the problem, which is to learn a joint policy that where the policies maximize a particular return return objective that is actually a function of the joint policy, not not just of their own individual policies. This was one of the issues that I noticed in in papers. Another issue as well I noticed, one of the highly cited papers most highly cited papers was the 1994 paper by Michael Littman where he talked about Markov games, as he called it in in that paper, and he combined minimax with Q learning, and he called it minimax Q, and it was a very influential paper. But I started to see papers that cited this paper by Michael Littman for completely wrong reasons.

Speaker 1: 27:07

For example, they talked about partially observable games, and he never talked about those kinds of games, Or they made other kinds of references and then cited his work. So clearly, these people didn't read his paper, and that's very easy to detect if he actually end up reading the paper. And then you start to think, okay, do these people know actually what they're doing? In this case, obviously not. So credit this blog post.

Speaker 1: 27:28

I shared it with Michael Littman. He then tweeted about it and basically said as much. He said, we need to nip bad habits in the bud. You know, he was saying as much, like, don't don't cite papers if you don't understand what they're talking about or you haven't even read them. And this was a prelude to the book, basically, but but with a similar motivation, I then got into the full book writing process.

Speaker 2: 27:46

Can you talk about the experience of writing a textbook? I'm sure that's like a massive undertaking. Tell tell us what that was like to write that book.

Speaker 1: 27:55

Before I started writing this book, I had conversations with colleagues that wrote pretty substantial books, and and they would all tell me things like, oh, well, it took me five or six or seven years to write this book, and and and that didn't sound very appealing to me, and I didn't wanna do this. That wasn't an opportunity now to do this. And if I was going to do this, I would rip through this, like, hard and get this done in two years, which is what I did. So I ended up and I discussed it with my wife. Do you wanna do this?

Speaker 1: 28:24

Yes or no? And because if I was gonna do this, I would have to write basically every night for which I did end up doing writing every night for almost three years. And it was just a process of coming up with the the structure, the ideas that you want to convey, what are the core ideas. But maybe the most important thing we were doing with this book was to come up with a universal language and representation that allowed us to pull all these different strands together and describe them in a universal, consistent way to the the the reader that would actually be helpful. We had a big focus on understanding how to define the problem in the first place in terms of the game mechanics, in terms of the optimization objectives, which is given by the solution concepts, and we had a big focus on some of the very basic learning techniques as well before we even get into any of the deep learning techniques.

Speaker 1: 29:19

So the focus on the basics was very strong there just to make sure that people have the same understanding in terms of how this field field operates. And, of course, we have the second part in the book, which has a strong focus on the more recent deep learning based approaches, including all the techniques that I mentioned at the beginning of the interview. And being in in heavily involved in this type of research With with my lab, we had a lot of experience in terms of building these kinds of algorithms and models, so we felt we had a lot to say in terms of the the theoretical background, but also in terms of the practicalities for how to build these kinds of algorithms and and and learning pipelines. While we have you, is there anything else you'd like to share with the audience? Yes.

Speaker 1: 30:06

So if you wanna get into multi agent reinforcement learning, obviously, good starting point would be the book. You can get the PDF for free from the website. Just go ahead and search multiaging reinforcement learning book, and you'll find it. If you wanna support the work, you can go ahead and buy the book off Amazon or eBay or any of these other shops, and you can also help us by leaving us a good review, five stars, for example. And in addition, in the book, there's also a code base.

Speaker 1: 30:34

I mentioned it earlier. The code base is designed to be easy to use, but if you wanna do more serious research level work in multi agent reinforcement learning, there's another code base that we built even before the book, and it's called Extended Pyramer, e Pyramer. This is based on work that came out before from a lab at Oxford University, and this is basically a code base and it comes with a number of prebuilt algorithms that you can just plug and play. It comes with the standardized interface for environments to run, and it has a number of parameters you can use to tweak how the algorithms work, and it's widely used in the community. It's still widely cited, and this is probably a good starting point if you want to do more serious research style work in multi agent reinforcement learning.

Speaker 1: 31:21

This has been fantastic. Thank you, professor Albrecht.

Speaker 3: 31:24

Thank you very much for having me.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Stefano Albrecht on Multi-Agent RL @ RLDM 2025

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere