TalkRL: The Reinforcement Learning Podcast | Transcript: Martin Riedmiller

Martin Riedmiller

August 22, 2023 / 01:13:56/E46

Robin: 00:05

TalkRL Podcast is all reinforcement learning all the time. Featuring brilliant guests, both research and applied. Join the conversation on Twitter at talkRL podcast. I'm your host, Robin Chohan. I'm super excited for our guest today.

Robin: 00:24

Martin Reed Miller is a research scientist and team lead at DeepMind. Welcome, Martin. Thanks so much for being here today.

Martin: 00:30

Thanks for inviting me, Robin. Happy to be here.

Robin: 00:33

We are gonna focus on your Tokamak work as well as some of your other work. You had a long and and storied career in in our research, but let's start with that exciting, work from 2022. That was the paper, magnetic control of Tokamak plasmas through deep reinforcement learning. De Grave et al, 2022, this made quite a splash when it came out. It made the news.

Robin: 00:53

We all heard about it on Twitter. And, I've been looking forward to this chat for actually about a year, so I'm really excited that we we were able to make this happen. So I understand in this project you used reinforcement learning to control the magnetic field in a fusion reactor and I read that Tokamak is from the Russian phrase describing the shape of this toroidal chamber with magnetic coils. So, and and you're using RL to control the magnetic coils to shape the plasma. Is that is that what's happening here?

Martin: 01:23

That's exactly what's happening. So the plasma per se is not stable. So if you wouldn't have any control from the outside, it would just, touch the vessel, and this would be very bad because it would lose its energy. There's a lot of temperature in the plasma, and it also would damage the vessel. So you need an active control to keep the plasma some somehow in shape without any mechanical interaction with the environment.

Martin: 01:48

And that's what's what and that's what basically, the the magnetic control does, the magnetic coils do. They keep the plasma in a certain position and in a certain shape.

Robin: 01:57

I understand there was some type of controller before, and I saw some diagrams of kind of quite scary, nested PID controllers. I did one, course in control systems in in my computer engineering, degree, and I found it quite a difficult course. So it was scary for me to see those diagrams. What was the control system like before? How were they doing it before?

Martin: 02:19

So they were as you said, they were using classical control, basically having an an observer that, mapped the the current observations of in within the tokamak, to an estimated, state and then using classical PID controller for each coil separately to control each shape that such that the plasma the state of the plasma matches, the desired configuration. And this uses a lot of prior knowledge, of course. You have to design the observer, which was quite an effort that they did at that time, and also a careful kind of calibration of the individual PID controllers, so that actually these shapes that they were able to control until then, could be controlled. And what we did basically, was to say, is there a better control method where we can, control all the coils more directly so that we only have to input the observations directly to our controller, map it to control signals so that we can kind of do a deliberately deliberately shape control without going through having an observer and then designing the PID controllers. So a much more direct method.

Robin: 03:39

And can you talk about, like, how you perceive this problem when you started on the project? Were you pretty confident that this could be accomplished with the RL tools you had at the time, or did you consider it quite a risky project and you weren't sure how it would turn out?

Martin: 03:54

Yeah. It was actually a very risky project because, the tokamak, was not in our lab, of course. It was in in Swiss or it it is still in Switzerland, in in Lausanne, and we had great colleagues from from the Swiss Plasma Center there. So we didn't have regular access to it. We have to rely on their input in order to kind of make progress.

Martin: 04:17

We we had a simulation, and we had first a simulator where we had some initial results which looked quite promising. But it's, when we first saw the actual device, how complicated it actually is, we we were really aware of how risky this project actually might be be in in in different respects. So one respect, in one respect, it's, of course, it's a 19 dimensional input, or 19, input 19 dimensional input space, which is quite a challenge for reinforcement learning in general. So the reinforcement learning per se is is a challenge. But then also the biggest question was, is the simulation accurate enough?

Martin: 05:03

Can we make it accurate enough so that actually this transfer from learning and simulation will actually work on on the real tokamak? And the signal that you usually get is just it works or it doesn't, and then you probably have 2 or 3 other trials at the same die day, And then basically the week is gone and you have to wake wait for another week. So this was quite risky and and quite a a challenge for us.

Robin: 05:29

Can you tell us more about the simulator and the, and the kind of the pros and cons of of the simulator that you had access to?

Martin: 05:35

Yeah. So first of all, I'm not an expert in the simulator. So this simulator was provided to us by Federico Felicci at and and colleagues at EPFL, so they already had a version. And what I understand is that they kind of used also the the insights that that also all went into the simulator to develop their PID controller. So there was already we knew that, a lot of the physics that are in this tokamak are somehow already kept because otherwise they wouldn't be able to derive a PID controller and all the stuff that they did, from that knowledge.

Martin: 06:10

So that was the positive side. On the other side, we were also aware that a lot of effects that happen in physics are not necessarily built in in this simulator, but they were left out. And at the beginning of the project, we were not sure, whether this was kept correctly or this this will kind of be sufficient to still deliver a learning controller or where we probably hit a boundary where some effects are just not modeled and the reinforcement learning controller would find a solution, but it never would work in in in the reality, because the the simulator pretended it would work, but the real world would behave a little bit different. And, so this was kind this was kind of risky. But on we had a a a couple of experts on both on on our side knowing about simulation a lot, knowing about the problems that RL might might run-in, for example, delays in in in the signals and stuff.

Martin: 07:12

So with all these effort, we were finally able to, have first signs, that the transfer is actually possible. And then from that point on, we were very confident that we we could also make it for different shapes and and for different various more interesting controls as we have shown in the paper.

Robin: 07:30

I guess you didn't have a model. Right? You just had the simulator. And but did you did did you think about, whether it would make sense to to learn a model from, from the data that you observed in the in the real, device, or was there just simply not enough data?

Martin: 07:45

Yeah. I I think the latter was true. It was simply not enough data. The the, experiments usually run for 1 to 2 seconds, So they are very short, and, the experiments the data that they have is recorded, but they are in in very different formats. They are usually also on under various conditions, so not necessarily in the conditions that were relevant for us.

Martin: 08:09

And, we didn't even try because in this 19 dimensional input space, 100, more than 100, outputs, sensor signals, to learn a a neural model or a model whatsoever, was kind of was not very, promising from the beginning. That is something that we probably want to tackle in the future because, of course, if you can learn from the from data, then the hope is that the simulation gets better and better, and then you can also do probably more novel things and and or other things more accurately. But this seemed, for us, a way that is not possible at the point, in time when we start the project.

Robin: 08:52

Can you talk about how you handled, exploration in in the system?

Martin: 08:57

Yeah. This is a very good question. Again, exploration is always an an unsolved issue or a bit of an unsolved issue in reinforcement learning. Here in this case, we used the exploration that comes with with MPO. MPO basically has a stochastic policy, in in baked in, basically, and we do the exploration according to the uncertainty measure that that is within MPO.

Martin: 09:22

So that shrinks during learning when the agent becomes more and more as, aware of of the the q function, that the q function is correct, then, this this, the stochasticity in the policy shrinks. And, otherwise, at the beginning, it's it's pretty wise wide, and we use that kind of exploration that naturally comes with MPO also in our experiments, which is not which does not mean necessarily that this is the best exploration method. We were actually thinking of of also trying to do other exploration schemes, but it worked. And at the end, this was important for that work at at the point. But we are still working on on improving, and and going looking for better exploration methods for the future.

Robin: 10:11

Oh, interesting. So it's just exploration out of the box. That's amazing. And, did you did did you ever think, you know, to try behavior cloning of the existing PID as a starting point, or or is that just didn't seem necessary?

Martin: 10:25

No. We we never thought of that because that's kind of in the philosophy of of of my team that we want to really learn with as few prior knowledge as possible and and then taking an expert policy into account that that has been derived, for for years, by human insight is not necessarily in the spirit of what we finally want to achieve. So we we decided to start from scratch here, and that also kind of worked because in simulation, it it doesn't it's not so important whether you you take a couple of hours or you take 1 or 2 days to learn the policy. If the final final policy works, then this was was good enough for our cases. For moving on, however, and we have a paper already on this, like, reusing previous policy will actually become interesting because then if you want, for example, to learn a different shape of the plasma, using a prior policy that has been learned before can help you to reduce the training time.

Martin: 11:27

And so it might be much more efficient for practical purposes to try out different shapes very quickly and to reiterate very quickly. So this might be just, for if if this really goes more and more into production and into daily use, this reuse of previous policies, is an interesting fact. But then we would use trained policies and not necessarily the one, that has been derived with PID control and and and the observer structure previously that was previously available.

Robin: 11:58

Okay. And so you mentioned the actions, space 19 dimensions, and and I saw observation dimension of 90. So there's there's a can you talk about the reward, design in in this project?

Martin: 12:11

Yeah. So the reward design was basically more or less straightforward to what we want to achieve. So there's basically this outer flux line which defines the shape of the plasma, and we had points on this outer flux line that we wanted to have at certain positions kind of to define the shape. And the the reward was basically derived from from the error between the current flux line and where the target flux line. So it's basically like a classical, a classical control, where you want to reduce the error towards a certain reference point.

Martin: 12:48

And this was the inspiration for for the, for the reward. And then the reward was also having some shaping shaping elements so that it's not only if you if you reach the the target point with a certain position, then you get a reward of 1, and otherwise you get a reward of 0. But already if you're getting close to it, you're getting more and more rewards so that you actually have a bit of a shaping and that the the reinforcement learning problem becomes a bit more manageable, for the for the learning controller.

Robin: 13:23

And was that something you were doing a lot of iteration on, or or is that that, pretty straightforward?

Martin: 13:28

I think, of course, finally, for the final solution, I think there a lot of experience has got gone into that. But in general, to to to show or to have the first sight of that it actually is is learnable and that it it works, that was pretty much done with the experience that we had from all the control problems that we did, in all the years before that. So the structure is not a a particular surprise. It's just then the final parameters and how you want to tune it in order to get the best possible performance on the final on the final task that you wanna solve. Yeah.

Martin: 14:06

But, it's, so the the it's so so basically, to summarize, the general structure, that is something that is kind of known now in between, meanwhile, so we know how to deal with these kinds of problem in a more or less standard way. And this is where we want to head, to have a standard tool to derive all kinds of sorts of controllers. But then, of course, the final fine tuning where you say, oh, probably I trade off a bit, the accuracy on on that point with the with the accuracy on another point because this gives me better problems in practice. This, of course, still needs a bit of of expert insight and, hand tuning, and and there has some effort has gone also in this kind of questions, for for for producing the final results in the paper.

Robin: 14:57

I read that, the system had a quite a high control frequency at 10:10 kilohertz, and so your actor needed to be really fast to be able to be execute, within that that high frequency. And, it was interesting to read about the the critic being a larger network and having, you know, many more parameters than the actor and then also the the critic having, recurrent units so it has some kind of memory. And, and so it can and I understand that was to help it track state over time. Was it was it obvious to use that that asymmetric, design of the the the very different critic than to the actor and the recurrent units in the in the critic? Was it, or or did you have to explore the problem more to come to that, that design?

Martin: 15:45

Yeah. It was not completely obvious, in particular, this asymmetric design so that we have a different choice for the critic and the actor. This was, kind of something that only developed during the project. For us, it's always kind of, we think we are aware that for for the reinforcement learning, the dynamic programming to work at all, we are relying on the state information, which really, an important property of that is that from the state and the action, you can infer the successor state. And usually, this is not the case if you only have the observations because observations are usually not the full state image.

Martin: 16:25

So the the default, is in principle that you have some kind of memory, either explicitly by having, also inputs from the past or that you're actually using a recurrent network to kind of make up for the deficiency from coming from the observations to to give the actor and the critic a chance to come up with a state by itself. So we we were kind of sure that we also need something in the architecture. I would have also liked to to have more, recurrency probably in the controller, but that was kind of would have also kind of made the controller much more slower. So that we were happy that at the end, it worked out to have the the recurrency and all the the heavy, weighted computations only in the critic that does not play a role in the real world control and the real time control. And the the basic controller itself could basically directly get from the observations to a reasonable control signal and still control the plant.

Martin: 17:28

So that was then the the reasoning for for that particular choice. On on the one side, the default would have been recurrency in controller and actor. But since practical constraints like, this 10,000 control actions per second, required us to to the controller to run very fast. We ended up with a controller having no recurrency at the end.

Robin: 17:54

Interesting. So so really, all only the critic has a much more complete picture of the state, and the actor is but the actor is is able to get away with its its very simplistic notion of state somehow. Is that what's happening here?

Martin: 18:09

Yeah. Exactly. That's exactly what's happening.

Robin: 18:11

Interesting. Okay. Wow. That's so cool. Can you tell us about how the, the people at EPFL and the controls people there, responded to the work you're doing and and and the RL based controller?

Martin: 18:23

Yeah. I I think it was so one thing that I have to say is is that they are very not knowledge able people, so we were very happy to really meet experts in in both, plasma physics and in control there. And of course, and that they were open to do a kind of a new approach to this. And, I I think at the end they at the beginning, they were also a bit skeptical whether this is possible because they have put a lot of effort in in designing their control system, and and could a neural network then then do the same thing that they have derived over years? So that was a big question.

Martin: 19:01

But at the end, when when the first controller worked, we were all really happy that that it actually could, maintain the plasma for these 2 seconds and keep it stable. And one of the the the quote that you had, or that we had in some of the papers, They were looking at the results, in awe, because they were not because, they thought it it wasn't possible. And what happened in in that experiment in experiment in particular is that the controller used coils that were not meant to keep the plasma, but that have have a different purpose. And using these coils, led to the results that it worked, which was good, which was the task of the reinforcement learning controller, but on the other side, put a lot of mechanical strain to the overall system because it was they were not meant to do that. And a human would never use those controllers in a PID kind of approach because they knew that this was not a a a good idea from a mechanical point of view.

Martin: 20:02

But since our reinforcement learning controller hadn't this knowledge built in from the beginning because we didn't know as the designers, he just was using those, and and they were very surprised that this worked at all. So that they found the controller found a new control strategy, but they also asked us please not to not use it again and and not to use it in fur further experiments, because of the mechanical strain, and they were afraid that this, at at some point, would also break, their mechanical, system, which would be very bad for for all sides, of course.

Robin: 20:35

Once again, RL exploiting everything it can to just get that reward without the notion of of whether it's a bug or whether it's intended or any of that. That's really cool.

Martin: 20:46

Exactly. And and on the one side, the positive thing is it led to an interesting physical insight that they weren't aware of before. And for the future, it was just us saying, okay. There's a constraint. Do not use the coils in that way.

Martin: 21:01

And then the controller found a different control policy. And and this is exactly, why I find these reinforcement learning controllers so cool and also so promising for for practical applications.

Robin: 21:13

So how far is this the the controller you ended up with? Is it, is it something that could be used in production, or is there still a gap? Are there open concerns, after the second version? Or or what would you do to improve it if you did another version?

Martin: 21:27

So we we just continued the work, and there's actually a a new shape that we were able to to produce with our controller that was used in a in a Experian by an experimenter at EPFL. So it's actually already practically used, which is a good thing. And then we also kind of improved about, on the precision that we now can have the shapes and and also the precision that we can control the currents that are flowing, which were some of the things that were not as good as they could achieve in the best case. But we kind of worked on these and showed that if we really are pushing on these, we can get this to, to be a very practical tool. And and we are closely we are still continuing to work with the people at EPFL, and I hope that we see more use of these, of these controllers for for different shapes that they want to do experiments with in the future.

Martin: 22:26

But from our side, it's it's kind of ready to be or or at at least another step closer to a practical, a broad practical use at at EPFL.

Robin: 22:38

So let's move on to to your other work in involving RL in the real world. Now I noticed you've been working on RL on real world applications for a long time. For example, I saw one project, you taught a car controller to drive with just 20 minutes of real experience only. And this is way back in 2006, long before, you know, people would say long before the deep RL was really a thing. How How does it feel to see this vision of intelligent control come so far in such a relatively short time like back then in 2006, did you were looking forward, did you predict that RL might turn out as it had?

Robin: 23:14

Did you dream that this would happen? Or was it quite shocking to you how things have unfolded since then?

Martin: 23:19

I I think seeing it now, so being so important and solving so many interesting real world problems, that is like a dream come true for me. I was always kind of since my since I was doing my my PhD, I was always kind of hoping that one day a reinforcement learning controller would run-in a in a car, for example, to to save fuel or to have a a better be more efficient, have a better engine control or something like that. So I was really kind of pushing this boundary and and the work that you mentioned in 2006 that was done at at Stanford, with in Sebastian Trunn's lab, At that time at that time, they were doing autonomous driving, and I had the chance, within a 3 week, short sabbatical for myself to to get a controller running on on their overall autonomous car control system. And, so this was a really exciting challenge for me to kind of prove that these reinforcement learning controllers could do something reasonable in in practice. And, at that time, basically, the it was not deep reinforcement learning.

Martin: 24:34

It was just reinforcement learning with a neural network. So the everything ran basically on the laptop that I had literally on my lap sitting in the back of the car. And, we were kind of, trying to to control the car such that it followed a certain trajectory that that was was given. And at the beginning, the car could do whatever it it it liked, so it was exploring very heavily, and you could sense the exploration by, kind of being bounced back and forth in the car. But then after a couple of minutes, it got it got better, and it didn't violate, to get too far off the track.

Martin: 25:13

And after 10 after 20 minutes, the car really smoothly followed the trajectory and and had learned to steer, the wheel. So this was basically the the input space was about, between 8 and and 10 dimensional, and the output space was 1 dimensional. So this was something, that was doable at that time. But I still was pretty proud that this could be learned in such a short time and with such a quality, also comparable to their nonlinear controller they had, instead of this, in this instead of this learned controller. This was one one pretty encouraging result for me at that point in time.

Robin: 26:01

That is a really cool story and the fact that you were in the car while it was running exploration real time, I mean I think we've all seen, Jan Lecun's or we've seen, like, Jan Lecun's videos of car exploration, running off a cliff and exploding. And so I know there's some risk in real world, real time exploration. And, and you also mentioned that youor I read that you worked on a Robocup back then as well. Could you share a little bit of a RoboCup? What was happening with RoboCup?

Martin: 26:29

Yeah. So so RoboCup for me was after my PhD thesis where where I I started to explore reinforcement learning and control, I thought this might be a nice area where we actually can, we started in simulation league, so where we actually could have the chance to bring reinforcement learning, in a real world competition where others trying different methods and then to prove that reinforcement learning could be even a very powerful tool and not just something that that occasionally works on mazes, in discrete spaces, but really on on real world problems where others try different methods and probably, don't get these results. And one of the the first successes that we had in simulation league was to learn a very powerful kick kicking routine, where you have to to to take the ball and and give it a certain number of kicks in order to kick it very hard so that it leaves the player with a a very, very quick momentum. And so this was one of the best, kicking routines that was was, there at that time, much better than everything that has been programmed so far. And then we also learned some some things like, tackling the other opponent, or dribbling the ball around the opponent with reinforcement learning.

Martin: 27:50

And we are also we were also using all these techniques in our competition teams. So it it was not just, something to write a paper, but we really wanted to show that with that, with that all that learned methods, you could also be pretty successful. And we were also able then to kind of do multi agent reinforcement learning. So our complete tech play was learned with multi agent reinforcement learning based on neural network, I think, which was, in hindsight, which was quite unique at that time. Unfortunately, not so many people cared for reinforcement learning, so it was very difficult to kind of get papers into conferences.

Martin: 28:30

That was just not questioned at at that point in in time, and nobody actually cares. But where it got really interesting is is at the time when we kind of figured out how to make reinforcement learning, more efficient, with this neural fitted q iteration algorithm, we were then also able to learn policies on on our real robot, our our midsize robot that we had at that time, for example, learned to dribble, with a dribbling routine that was much better than anything, the human students have come up so far. And then it showed basically the first sight that this can also be a very powerful method to load solve real world problems directly. So take this reinforcement learning controllers, take a difficult real world problem, and then just let it run ins instead of putting lots of hours of design and thinking how to solve the problem. Just let the the agent figure out by itself.

Martin: 29:30

So these were the first signs that reinforcement learning might actually work. But coming back to your question, at that time, I was basically hoping that during my lifetime I could probably get one controller in in a real world device, that that matters to to more people, but that actually reinforcement learning became so successful, like, in Go later on or in Atari, I haven't dreamt of. And and this was really a dream coming true for for myself, later on in in my career.

Robin: 30:01

So with the RoboCup, was that, were the robots just learning off of real experience then without a simulator?

Martin: 30:07

Yeah. Exactly. So for the the dribbling routine in in the in for the midsize robot was was literally done, in our lab. So it started from scratch having no idea how to to keep the ball, close to the robot because that can can only be done by actually actively controlling the robot. So moving forward or or turning a bit, because, the robot was not able to actually grasp the ball, but it could only kind of actively actively always push the ball between between very limited boundaries.

Martin: 30:42

And and figuring out this is is really a difficult task because it it actually depends on the physics of the robot. And the way it was done is basically we started with a random policy, collected data, did off policy learning, come came up with a better version, which already, kind of was a bit more successful in dribbling than using that data again. And in in in 5 to 10 iterations of of this kind of, learning a policy, applying it, collecting more data, learning again the policy with all the data, we were able to to come up with a with a control policy that was actually very powerful, and that was used during, the competition in 2007, which we ended in becoming world champion. So that was one of a real world success story of of reinforcement learning very early.

Robin: 31:34

That is incredible to me. Back in 2007 and, not just, you know, not just doing oral but doing our own a real robot with no simulator and also with the multi agent aspect. And before, you know, the bulk of these are all algorithm families have even been invented. So that's kind of incredible that that that you were able to make that happen.

Martin: 31:58

Yeah. So, yeah. I'm looking back, I'm I'm still kind of proud, that we we did all this and, and I'm also happy that that I could still be a part of that later on by by join having joined DeepMind and, having this opportunity to continue to work on on on this dream of my life and to bring it to more relevant applications, like, for example, fusion, at later stages, which, of course, wouldn't have been possible if I would have continued in the pace, that we had at the university in those days.

Robin: 32:30

Can you say more about that? Continuing at the pace, do you mean that it was slower there or, that the direction was different?

Martin: 32:36

Yeah. It was slower and it was not so for example, for me, being part of this Atari project where we actually said, okay, reinforcement learning should directly work from pixel to actions and this is possible. And this is not possible in principle for, like, solving Pong or so, but it is possible for doing 5 games 50 games. This was kind of a spirit that I I, I got to know at DeepMind. And and this is was something that was a big acceleration to the research for my research and also for the research in reinforcement learning in general.

Martin: 33:13

And I think, this was kind of all the years before, was we were going on on slow pace. We made some some good progress, but now, it was come becoming more serious, more people thinking about this and and really dedicated to solve big problems, like Atari was in in those days.

Robin: 33:34

So you were a co author on the original DQN papers back in 2013 and 2015. Now we've been lucky to feature one other, author from those papers, Mark Belmer. But, this was really the paper that that caught my attention. I remember sitting in my room upstairs by the ocean, just down the road, and seeing this nature paper and the hair standing up on my neck and thinking, What is going on? What are they doing over there in deep mind?

Robin: 34:03

And that was really one of the big moments that led me to start the show to get more insight into what exactly is happening in RL and to talk to the people who make these magical things happen. So that means a lot to me to speak to one of the original authors of this DQN work. You also did this NFQ, neural fitted Q iteration, quite a lot earlier, than DQN. Can you also touch on NFQ? And and do you consider NFQ a a direct predecessor to DQN?

Robin: 34:34

It says a lot of things in common with DQN. Can you tell us a little bit about NFQ?

Martin: 34:39

Yeah. So NFQ was basically, my personal breakthrough to make reinforcement learning, really data efficient. Up to that point I knew that neural networks are good for function approximation, for learning a Q function, but I was always kind of concerned that this, if you do it online, then then kind of you have 2, 2 difficult processes. 1 is gradient descent in neural networks to to estimate the q function and the other is that the q function is part- is continuously evolving. And bringing these two two parts together was kind of always challenging.

Martin: 35:17

So the idea of of Neural 50Q algorithm was basically kind of to to separate these processes a bit and, the major idea is to save all the experience that you have so far and then train a Q function on this experience with supervised learning until it converges, and only after you have this q function stable, then you make the next step by reevaluating all the transitions, doing the next value iteration step, and then doing basically the next iteration on the value function. And by this you you, the advantage was that the supervised learning part, namely to estimate the current q function, that was kind of, more or less stable. We We also have this supervised learning method called rProp, Resilient Propagation, which I proposed during my master thesis, which I knew basically was very reliably doing supervised learning without any parameter tuning, and that learned in the batch context, so the supervised learning part was super stable. And then it turned out that doing this on a on a stable set of experience and iterating on this Q function, that was a very data efficient training process, which then enabled all the successes on on being applied to Karl Pohlferst, in in less than 200 episodes.

Martin: 36:48

Later on then also on on the real, midsize robot, also on the autonomous car. And this was basically for us the moment where we said okay okay, we can make data efficient reinforcement learning work if we, work on this memory of experience and iterate over this over and over again? So this was the basis. And we also went then at that time also deep reinforcement or deep learning became interesting and so I convinced one of my PhD students at that time, Sascha Lange, that he should kind of try to combine this. And and then we were basically thinking that, combining, learning directly from vision to actions could be done by first learning, from vision to features, and then taking the features and with NFQ, learn the control policy.

Martin: 37:42

So that is what something that we did at the university back in 2009, and applied it to slot car racing, a typical toy domain where we actually got from from vision directly to actions, but with the intermediate steps. First learn a, a cont first learn a module that gets from vision to features, and then in a second step, learn something that goes from features to control. And so, what what we then did at at DeepMind was to say, okay, what if we don't do this intermediate step but directly go from a reasonable vision stack to the control directly. And for me, at that time, this was a really big jump also. And because it was involved having convolutional neural networks and then a really big network And the big difference was also that you probably couldn't get along with, let's say a training set of of a couple of 100 of episodes, but it was rather in the in the range of 10,000, 100,000 or a 1000000 of episodes.

Martin: 38:58

And the original NFQ wasn't made for such big batches of data because then the supervised learning would become very very slow. So the innovation that, or one of the main innovations of DQN was kind of figured out how, in the case of where we have a very large set of data, can you still learn stably the supervised learning part, but at the same time kind of, consider the fact that the dataset is much larger and that you cannot do batch learning again, but you have to learn kind of a, learning by pattern or a more, like, gradient descent method involved in this overall process again.

Robin: 39:41

So that when DQN, I guess, that would be the target network that's periodically getting updated as opposed to doing the whole thing in 1 batch like you were doing with NFQ. Is that right?

Martin: 39:51

Exactly. That was that was the big the big innovation that was needed, there to make to make it actually work.

Robin: 39:57

It's just fascinating to me that that so many of the parts of DQN are kind of already present in NFQ. And I think maybe some people may not be fully aware of that fact from where I look into it. So that was very fascinating.

Martin: 40:09

Yeah. That that was kind of the the input that I could provide at that time. I was basically taking a sabbatical from my, from my professorship at at at Freiburg, and I could bring in the experience that we had with with NFQ to make it data efficient and that it works into that DQN project and then having this amazing crowd, of of people like, Vlad and and Koray and and Dave Silver around that that really also believed to do it in in a big scale and that it should be, possible and then figuring out how this could work, with the target network, how these ideas can be, brought from NFQ to DQM, that was that was an amazing experience for me at that time.

Robin: 40:55

I bet. So the journey from, from Carpool to Tokamak, it was quite it was quite a journey. Right? From, like, the hello world of real world RO to, to really sophisticated applications. Actually, I'm honored to have you on the show and to hear about all these.

Robin: 41:10

But we're not done yet. I wanna hear you hear about your current work. You have a paper called Collect and Infer, and it's largely about data efficiency. There's some, phrases which I found interesting in this paper, explicitly modeling RL as 2 separate but interconnected processes, concerned data collection and knowledge inference, and also interpolation between pure offline batch and more conventional online learning. Can you tell us about collecting and for?

Robin: 41:36

Is this more a philosophy? Algorithm? Is it is it an exploration method? What is collect and infer?

Martin: 41:43

Thanks for raising this question. I think it's more a philosophy or or kind of a change in viewpoint. Coming from classical RL, very much is really online. So you you make an experience, you update your policy, you have the next experience, you update the policy. And NFQ was already kind of deviating from that, in saying Okay, no, the the experience that you have, it's actual, very precious information.

Martin: 42:12

You shouldn't throw it away, but you should reuse it over and over again to be get become data efficient. What's still done in in the class or coming from the classical reinforcement learning work is that you basically always follow a policy and then exploration is kind of triggered by this policy by doing epsilon greedy exploration as one example or having another scheme that is added to that policy. And with this collect and infer, I wanted just to to change the viewpoint a bit and say, look, it's it's really if you wanna get data efficient, then one thing is really important that if you have a set of data, then you should really squeeze out everything. It's not just probably making one gradient step, but make as many gradient steps as you wish because the important step is not the computation or the important thing for me is not the computation that goes into it, but it's really that, the experience with the real world that is what what is costly in the data efficient framework. For example, if you want to control a robot, then every single step that you do in the real world costs you something with respect of time.

Martin: 43:22

So you really want to minimize that kind of interaction. So inference, getting everything out of the data, that is very important. And then the other, the orthogonal process to it, or the dual process to it, is if you are if you're accepting accepting this, then you shouldn't probably necessarily just do exploration by exploiting the current policy, but you should really ask yourself, I have my data set, where in this distribution of experience should I collect more data in order to make this, the the extraction of the policy even more successful the next time. So you should actively ask for holes in your experience set and then really trying actively to capture that kind of experience. And and I know this is kind of probably, if I say it like this, it's more or less trivial but, I don't see a lot of work in exactly that direction and I I've with this collect and infer, I just wanted to make people aware that exploration is is more, or is is the crucial other part, when we get the inference process, right, going from data to a policy, then collecting the the most important data, the most relevant data to to improve the policy, that is next the next question and, having this viewpoint probably will change the way how we do exploration in the future.

Martin: 44:55

So, collect and infer is rather a philosophy where we kind of also try to orient our work towards. It's rather a philosophy than already a set of solution methods, that solve the problem in the best way. But hopefully with our future research, our current and future research, we will help to, in this collect and infer philosophy, we will hope to to make the reinforcement learning controllers more and more data efficient, even in in very complex reinforcement learning control problems.

Robin: 45:29

So in this paradigm, do you try to interleave exploration, with exploitation or are you trying to do both at the absolutely at the same time? And and I'm I'm seeing some parallel here to like say maximum entropy RL where, there's this constant drive to push the envelope of what we've seen while still, trying to perform?

Martin: 45:54

I think it's it's really from from the philosophical philosophical standpoint, it's really to do it in different stages and and, to say, okay, you have, you have currently a set of experience. Take your time. Do inference on this. Probably not only learn a policy but also try to get some auxiliary rewards out of that. Take your time, learn a model, take your time, think of different representations, and do everything that you want.

Martin: 46:23

And once you're ready, then carefully decide, where to collect your next 1,000 data points. And then you collect them, doing the collection, and then you go back to the inference step. So they are really 2 separate, stages. There's no reason to kind of rush through this process, at least in theory. So that's the mindset of collect and infer.

Robin: 46:50

It's always struck me, Narel, that there's been so much emphasis. Like you say, here clearly, there's so much emphasis on the knowledge inference parts of like computing that policy or the Q function and then the exploration part has been kind of weak. Exploration is actually in a way, it takes much more intelligence to do the exploration correctly than to do the inference. That might be a much deeper question in terms of using intelligence than the actual inference part.

Martin: 47:21

Absolutely. I I couldn't agree more. That it's it's really like, if you if you see us humans, then we are very much aware what we already can do, and we we wouldn't probably explore like, if we if we want to grasp a pen or something like that. There's no reason to kind of try all kinds of different strategies. But, for example, if I want to throw the pen for the first time and to hit a basket or so, then there's probably different ways that I wanna explore, and we are completely aware of this, so we we don't just do Epsilon greedy exploration of what probably, just by chance, would work.

Martin: 47:58

And this is what I mean that from from this experience that we have, we probably want also to derive some models so that we get these ideas where to explore next. Because, as I said, I think this is, so for me, intelligence, the the the name or the area of intelligence is very difficult to grasp, but for me, I came to the conclusion a bit that, intelligence is the ability to efficiently come to a solution. And if you see humans, I I think this is a measure that we also apply to humans. If someone, learns to to read a book, with 4, then it's probably more surprising to us than if someone needs 40 years to learn to read a book. And therefore, efficiently, intelligence really has to do something with efficiency.

Martin: 48:51

And the more efficient we are in learning something new, the more likely is that we have some novel insights during our lifetime or invent something new during our lifetime. So efficiency and being, and really exploiting all the knowledge that we have to be efficient in exploration and things where we spend our lifetime on, That is really crucial for us, and why should it be crucial for reinforcement learning agents.

Robin: 49:19

And then I've heard you talk about generalized, collected, and fur, and it this seems like a a more ambitious, vision and a more abstract vision. Can you share a bit more about about your vision there?

Martin: 49:34

Yeah. So, generalized in the sense that the the the basic collect and infer would be interested in in kind of refining the policy and deriving the policy from the experience. The generalized, collect and infer would at the same time also, for example, learn a model of the environment to figure out what could be the next steps to do some some kind of planning. Generalized, collect, and infer would also, for example, learn new reward functions based on some curiosity measures, being aware where it hasn't been in state space and where it's probably interesting to go next. It could also be something that, learns a representation and learns the same policy that it has learned before, but with a completely different representation as inputs in order, for example, because a different representation might generalize in completely novel ways.

Martin: 50:32

It might not be as as good as solving the original task, but it probably might be at least, useful to learn new tasks. And then, so generalized, collect, and infer would for me be mainly, what kind of knowledge can we also derive from the experience, that at the end is important for the agent to behave more behave better in the environment. That is the part that we are seeing, so the policy would be more and more sophisticated. But, the knowledge inside the agent is much richer, probably comparable to a human, can, for example, shoot a ball with with, with its foot, with their foot, but at the same time they can also probably explain why they are doing exactly, why they are passing exactly to that teammate and not to the other. So, there's a lot of additional experience, additional knowledge generated from the experience and generalized collect and infer would mean exactly that, so that we are not directly just optimizing policies, but we are optimizing policies but with the with the help of other forms of knowledge that are useful to, to make the agent more and more efficient during its lifetime.

Robin: 51:57

Are you interested in in explainability of these real world systems or do you feel like the explainability is very important or is there a way to make these systems safe without explainability? How do you feel about explainability and its role in in in RL, especially in the real world?

Martin: 52:17

Yeah. I think explainability in general would be very important. I'm personally not so much interested in explainability per se, but I think if if agents are actually able to reflect what they are doing, they will get better in exploration. And that's the aspect of explainability I would be interested in. So for for me, explain explainability is probably something that would emergently erase from an agent that is more and more aware of their environment and what actually, how the environment connects, rather than something that I would build in in 1st place, for the reason to make AI systems, safer.

Martin: 53:07

Again, I think explainability per se is a very important research topic. I it's just, I I think the interest of explainability for me, is really also again thought from this from this perspective of, this is probably something that emergently will happen, once we require our agents to be more efficient. And one, probably one concrete example would be, that, it could be if an agent, for example, learns to push something from images, then there's no reason for it to understand that what it pushes is an object, or there's the idea of an object. But if if an agent, needs to push several things in the environment, then it's probably very useful to develop the idea of an object in an internal representation. So to generalize what it has learned before and then it could also explain, what an object means to it.

Martin: 54:10

So it comes basically through the backdoor of efficiency.

Robin: 54:13

So explainability leading to better efficiency through understanding, something like that?

Martin: 54:19

Yeah, exactly. Yeah.

Robin: 54:20

Interesting. Okay. So but then in general, aside from explainability, just just talking about safety in general, how do you think about the process of deciding that a controller could be used in the real world? Like for say, for example, the tokamak controller, is it enough that it worked in the experiment so far or is there somewhere like I guess the dimensionality of these controllers are so high that the challenge of being sure that it's gonna do something sensible across all the possible state space, seems like a challenge. So how do you how do you think about that challenge, and or is that or is that a separate challenge from what you're focused on?

Martin: 55:00

I think it's it's a very important challenge, and I think if we want to make reinforcement learning to be applicable in the real world, we need to solve that problem. It's currently not in the inner focus of what we are actually doing because we are, still kind of fighting that that we get these, reinforcement learning controllers to do something reasonable, data efficiently, in more and more complex tasks. There are different answers to do, safety. So in the tokamak example, for for example, it was we were sure that nothing bad could happen because there are safety, explicit safety measures built in around this. So the regime that the agent could act was a safe regime and there was no way that the action, that the that the agent could come up with an action, that was not, that was not safe.

Martin: 55:52

So that was one thing. But on the other side also, it happens that the agent did something surprising, that was not caught by the safety system. This was not safety critical but something that, the engineers didn't want because it it caused mechanical stress. And this can only be seen once you actually apply this or once you look at the at least as you look at simulation and see, oh, this is something that we don't want, and this is only, arising through testing. So one thing to to to to, provide general safety is to have an envelope where you say, okay, the agent is only allowed to optimize its behavior within that envelope and I can guarantee with classical control theory that this envelope is is always safe because it's in a certain bound.

Martin: 56:41

And and this is completely, imaginable for for a couple of applications. But I I think, generally, what what I really, would like to to to see more is that reinforcement learning and classical control engineering, gets gets closer connections because there are also numerical approaches in in classical control theory, and they also have to deal with exactly the same, safety questions, the same robustness questions. And I think we can learn a lot of each other once these areas get closer together.

Robin: 57:18

It definitely seems like there's a giant chasm between those those areas still or I or if if not, I have not seen where what the bridge really is. I guess I I know I'm not into classical controls, but for the people I know who are, the terminology is so very different. All the assumptions are so very different and they might wonder when they look over their shoulder at the RL side, wow, how do we how does anything work on this side? How do we how are we sure of anything on this side? There's so few assumptions and there's, so few constraints.

Robin: 57:51

And then when I look up when you look on their side, we say, wow, how do you how are you able to do anything because your tools are seem in a sense so simplistic and you have to model everything so in such great detail in advance. So, yeah. What is it? Is is there a middle ground right now? Or where where is that middle ground?

Robin: 58:09

Or maybe you're building that middle ground?

Martin: 58:11

I I think one way to to bridge this gap is kind of to do applications that are interesting. Like the fusion work, I think people from control theory will look at it and say, oh, this might be interesting. There is something in reinforcement learning that that could could help us. It's probably an interesting airing area to do research in. So I I think kind of doing more and more applications, people getting aware, that then there will also be interest from from theoretical side to bring their, stability methods, their robust methods, closer to the control domain.

Martin: 58:46

I think the area where it's probably already happening a bit is in model predictive control, with where where you learn models by, by machine learning and and they have their models, that they all already trust. And and this is an area where the model predictive control is already applied also in practical applications, and there, probably the gap is is smaller than in this complete model free, reinforcement learning scenario where we are usually in. But I I think, the more successful reinforcement learning will be and the more impressive demonstrations we will do that this can actually work, the more interest there will also be from from the classical control guys to say, okay, there's really a chance for us to bring our tools to these domains. Let's let's try to do. And I think there can there there are a lot of things that are that can be done, like linearization of our controllers or bringing our controllers, like learning them with the neural networks, but then bringing in them in a certain linearized form, that they are familiar with.

Martin: 59:57

So I think there are some some obvious contact points, and then once more people will enter this area, I think, there will be more and more sophisticated techniques to actually tackle this very, very important point, of safe safety and reliability for these, reinforcement learning controllers.

Robin: 01:00:18

Just going back to the collect and infer, I noticed on one of your talks you referred to Sutton's hoard architecture, which is something that comes up every once in a while, and I wonder if you could just briefly make the connection to Rich Sutton's hoard architecture. What is the connection there?

Martin: 01:00:35

So I I think the the connection is that the hoard was also kind of based on these ideas of doing a lot of predictions, what what might happen, And this is very closely to how we also see. So if you have collected a lot of experience, then you you not only probably are able to to solve the the one task for for which you collected the experience, but there are also a lot of other tasks that are basically buried in the data, where you can immediately apply offline reinforcement learning on and derive useful controllers. And these controllers then might help you to explore better. So I think in that, it's probably very related to the spirit of of heart that you can learn much more of the data. Where I don't necessarily agree, Rich Sutton is much, very much a fan of of online reinforcement learning, so he doesn't like explicit memory in the loop a lot, and I think I can see where this comes from because the brain obviously probably don't doesn't have an explicit memory of all of our experience in the form of observation, action, next observation.

Martin: 01:01:43

It also has to deal with this kind of online thing. But I think unless we are, understand everything completely, how the brain works and how we can put it in big neural networks, all the data that we've collected, This explicit memory, idea allows us to do a lot of things very data efficient, and in a different way as the original hard architecture did it, because we are able to always revisit the data that we have experienced. We can, derive new skills from the old data and just put these skills to a skill library and then, much quicker, build a very capable agent. But the original idea of of hard, like like to learn a lot of prediction things at the same time, was very much an inspiration of of this scheduled auxiliary control that we came up later, which is on the basis of this collect and infer idea.

Robin: 01:02:43

Yeah. That makes that makes total sense. I I did a project based on on hoard in 20 back in 2019 for one of these and I had a predictive model that fed its output into the RL. And that was the relation to horde here having the separate predictive model. And I didn't know if it was a good idea.

Robin: 01:03:02

I didn't have enough time to work out all the issues but the model wasn't perfect and that caused problems downstream. The whole thing worked, but it it didn't have great performance. I learned a lot from doing it, but I've I've always wondered how far you could take that concept. And it seems like with generalized and collecting infer, you've you've kind of taken that vision much much further, which is exciting.

Martin: 01:03:23

Yeah. I I I fully agree. That's a very interesting area to explore further.

Robin: 01:03:28

So you talk about artificial general control intelligence. This is a phrase I haven't heard anywhere else. What do you mean by AGCI? And and how does it relate to to AGI?

Martin: 01:03:39

Yeah. So I I think it was basically a way to position, my team at DeepMind, the the control team, where the the idea of of DeepMind was, at the beginning, from the beginning, to to solve artificial general intelligence, or to solve intelligence, to understand intelligence, I thought maybe for a small team as mine, it would be a good idea to be a bit more, to be a bit more focused, and therefore I added this term control intelligence. So I see this as something that is working on the direct interaction with the real world, therefore this closed loop control being very important, to interact with the real world, and aiming for these closed understanding how to do these closed loop controllers. And the generality comes in from the fact that, we want to bring in as few prior domain knowledge as possible. So we are really interested in concepts, that can work over a wide range of applications and not are not just are tailored to to one single application.

Martin: 01:04:48

And that's basically the, the mission of the control team, to come up with controller designs that are general, that only have general principles, that can work directly only from interacting with the world, with the real world, with own experience, and still being able to have very powerful final control performance. So data efficiency, is one thing on on the one thing, so that it actually can be applied to robots that and and collect the experience there. But the other important aspect is that we want to bring it as few prior domain knowledge and engineering as possible, and this is expressed by the generality in this AGCI term.

Robin: 01:05:36

And do you foresee this as a collection of methods, more insights, or do you foresee this as an artifact that you're constructing this one controller to rule them all? How do you see the how do you see that playing out?

Martin: 01:05:51

Yeah. Ideally, so that is the final dream to have an agent, and you just plug in the sensors. You just plug in or tell it what the actions are, you just tell it the goal, and it just runs and solves the task. That is that is the final, the final vision that we are working towards in 10 years, 20 years, 30 years, 50 years, I don't know. But at least that's the north star where we're heading towards.

Robin: 01:06:23

But looking back, you know, to in the years before GPT 3 came out, I I think it's fair to say, at least some people thought that the the first powerful AIs would be RL agents that were trained, in the way that we were seeing at the time, like Tabula Rasa without, prior knowledge. And then but with the advent of these these powerful LLMs in the in the in the recent years and they're focused on pre training and then, you know, the current crop of LLMs, a lot of them are doing this, supervised fine tuning. And so there's some kind of comments about, you know, how important is RL really? And how do you feel about how central RL is on this this whole journey to to AGI?

Martin: 01:07:11

Yeah. So for me, personally, I think RL will come back, and it and it will play an important role at least at at this lower level interaction with the real world. I I at the moment, I can't see another way to learn these controllers than by RL, but I also come, completely admit that this is basically because I have been working on rl for for all my my research life and therefore I just might might be completely blind to other approaches. My, I'm I'm completely also blown away by the performance of these large language models. I never had thought that this kind of generalization that we see, this understanding of concepts, is possible at all.

Martin: 01:07:59

So this is absolutely amazing and this is one of the biggest learning that I have from these large language models. However I still see, if you look at this, there's humans that provide all this corpus of language and they provide all the texts, so there's a lot of intelligent pre work already going into until you have a dataset for a large language model. So humans are still in the loop and the humans, they learn from experience, they made concepts, they acted with the real world, and all this was needed in order to come up with the ability to have language and to form all these texts. And I think this process of how you get from experience to something that you can kind of formulate as a text, that is still not understood. And also humans are all all still in the loop for interpret in the interpretation of the outcome of a language model, to bring it back to the world.

Martin: 01:09:07

So and I, for at least from my philosophical standpoint, the original mission of intelligence or of AI, the original promise was to have every single aspect of intelligence, being done, by being so precisely described that it can be done by a machine. And therefore for me at least, the conclusion at the moment is as long as we have humans in the loop, we haven't solved AI as a whole. And therefore, I currently see large language model as something that is absolutely amazing and that is probably the solution for these higher cognitive levels. But that still falls short in solving the whole AI picture. And it might be that in the future, we extend the knowledge from this, starting from these large language models and solve all the problems of AI.

Martin: 01:10:01

I can I can see imagine this happening? But it might also be that in order to solve this low level interaction with the world, doing something reasonable, building on this experience, doing this efficiently, still RL will will play an important component. And that is my current, modus of of oper modus operandi. That's my, philosophy of work, and until I'm proven, unless I'm proven wrong, I will kind of try to push this reinforcement learning, viewpoint as far as possible.

Robin: 01:10:37

We see a lot of work connecting LLMs and robotics. How do how much do you think that the LLMs could help with, with your vision of artificial general control intelligence, if if at all?

Martin: 01:10:48

I think, so we we also did a bit of work, in together with an intern, to bring the knowledge of LLMs, for for for example for doing better data collection, for improving this exploration process. My my personal view of it is basically it is a possibility to to bring, common sense knowledge directly into the agent. And I see it a bit as that part that was currently done by the reinforcement learning engineer, the one that was kind of saying oh, we need a shaping function to do that or we use that kind of representation to learn that reinforcement learning task. I think that this part I can probably seen completely done by a large language model, and then we have basically made the agent much more autonomous because then this is not a human anymore, but then this is the knowledge that is already automatically available and this would be a big advantage. I think there's also another bet that these large language models are so good because they have so much prior knowledge, that they can actually also help, to bring their generalization capabilities even to the to the lower level of control.

Martin: 01:12:08

With that, I think we just have to see, what what happens in the future. I'm a bit skeptical about this direction, to be honest, but on the other side, I think that could also be something that turns out to be very powerful, that we, just have to provide enough examples of previous control, successful controls, and then out of a sudden we see similar generalization effects as we have seen in language. And, the Gartner work, from from DeepMind was one example in this direction, Robocat, is one example in this direction. So I think there are a lot of very interesting, works that are, exploiting or working in this domain of of what we have learned, that scaling actually, can can lead to amazing generalization capabilities that are beyond, the typical, we can kind of do a in distribution generalization, but out of distribution generalization is not possible. And that I think is very exciting and I think still this is kind of orthogonal to the work on basic RL concepts.

Martin: 01:13:19

But how this exciting area develops, I think we will see in the close and mid future.

Robin: 01:13:29

Martin Reedmuller, this has been an absolute pleasure and a total honor. Thank you so much for sharing your time and your insight with our talk our audience today. Thank you, Martin Reedmuller.

Martin: 01:13:39

Thanks, Robin, for giving me the opportunity. It was a pleasure for me too. I really enjoyed talking so much. I hope it was interesting for everyone, and thanks again.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere