Jacob Beck and Risto Vuorio

[00:00.000 - 00:12.240] Talk RL podcast is all reinforcement learning all the time featuring brilliant guests, both
[00:12.240 - 00:18.280] research and applied. Join the conversation on Twitter at talk RL podcast. I'm your host
[00:18.280 - 00:24.000] Robin Chauhan.
[00:24.000 - 00:29.240] Today we're joined by Jacob Beck and Risto Vuorio, PhD students at the Whiteson Research
[00:29.240 - 00:33.680] Lab, which is at University of Oxford. Thank you so much for joining us today, Jacob and
[00:33.680 - 00:34.680] Risto.
[00:34.680 - 00:36.520] Hey, thanks very much. Yeah, great to be here.
[00:36.520 - 00:41.440] We're here to talk about your new paper, a survey of meta-reinforcement learning. So
[00:41.440 - 00:47.320] we have featured meta-RL on the show in the past, including the head of your lab, Professor
[00:47.320 - 00:53.040] Shimon Whiteson, who covered very bad and more. That was episode 15. Sam Ritter in episode
[00:53.040 - 00:59.520] 24, Alexandra Faust in episode 25, Robert Lange in episode 31. Hope I'm not missing
[00:59.520 - 01:04.160] any. So we have touched on this before, but never in this comprehensive way. And this
[01:04.160 - 01:10.160] paper is just really a tour de force. Really excited to get into this. So to start us off,
[01:10.160 - 01:14.560] can you tell us how do you define meta-RL? How do you like to define it?
[01:14.560 - 01:21.280] Yeah, so meta-RL is learning to reinforcement learn at a most basic level. So reinforcement
[01:21.280 - 01:26.200] learning is really slow and simple and inefficient, as we all know. And meta-reinforcement learning
[01:26.200 - 01:30.800] uses this slow reinforcement learning algorithm to learn a fast reinforcement learning algorithm
[01:30.800 - 01:33.480] for a particular domain of problems.
[01:33.480 - 01:39.520] And why is meta-RL so important to you guys that you're willing to put all this work into
[01:39.520 - 01:42.520] produce this giant paper? Why is it important to you and your lab?
[01:42.520 - 01:49.680] Yeah, so as Jake hinted at there, there's like sample efficiency is a big issue in reinforcement
[01:49.680 - 01:57.280] learning. And meta-RL is then like a pretty direct way to try to tackle that question.
[01:57.280 - 02:04.680] So you can train an RL algorithm that then will be more sample efficient in the sort
[02:04.680 - 02:10.000] of test tasks you're interested in, if that makes sense. So I think that's the big motivation.
[02:10.000 - 02:17.740] And then also meta-RL is like as a problem, it comes up a lot in subtle ways when you're
[02:17.740 - 02:24.600] doing like complicated settings otherwise, but maybe we'll get to that as we talk more.
[02:24.600 - 02:30.120] And how does meta-RL relate to, say, auto-RL? Are those two things related?
[02:30.120 - 02:36.080] So we were just talking about this. Auto-RL is any way you can automate a piece of the
[02:36.080 - 02:41.720] RL pipeline. And it could be, you know, learning, it could be heuristics, it could be other
[02:41.720 - 02:49.400] methods. Meta-RL specifically is when you learn to replace a particular component in
[02:49.400 - 02:55.180] the RL algorithm. So it's learning an RL algorithm as opposed to selecting a particular heuristic
[02:55.180 - 03:01.400] to do that. So in that sense, you can view meta-RL as a subset of auto-RL. But meta-RL
[03:01.400 - 03:06.120] is also a problem setting. So as we mentioned in the paper, like a distribution of MDPs
[03:06.120 - 03:10.720] kind of defines the meta-RL problem. And I think auto-RL isn't really a problem setting
[03:10.720 - 03:11.880] in that same sense.
[03:11.880 - 03:17.240] The meta-RL problem setting is really central. And that's kind of where most of this work
[03:17.240 - 03:23.760] comes from as well. So yeah, and I feel like auto-RL just doesn't, like, it can handle
[03:23.760 - 03:29.920] any task. You can, you can be, like, it doesn't have to be a particular setting where you
[03:29.920 - 03:30.920] would use that.
[03:30.920 - 03:36.360] Now, to help ground this a little bit, you pointed to in your paper, two classic meta-RL
[03:36.360 - 03:41.880] algorithms from back from the early days of deep RL, which is really when I started reading
[03:41.880 - 03:49.760] RL papers. And these two illustrate some really important points that maybe can help us understand
[03:49.760 - 03:53.960] these concepts going forward and for the audience. So you mentioned MAML, that was from Berkeley
[03:53.960 - 04:01.040] back in 2017, and RL squared from back in 2016. And there was, that was a funny one,
[04:01.040 - 04:04.360] because there was two papers that came out almost the same time from OpenAI and DMIND
[04:04.360 - 04:09.640] that with a very similar ideas. But can you can you briefly describe these two, just to
[04:09.640 - 04:14.560] get us get us grounded in this? How, how do these, what these algorithms do? And how do
[04:14.560 - 04:16.840] they work? And how are they very different as well?
[04:16.840 - 04:23.040] Yeah, so let me start with MAML, and maybe Jay can then explain RL squared. So MAML is
[04:23.040 - 04:29.080] a, I feel like very sort of iconic meta-RL algorithm from, as you said, early days of
[04:29.080 - 04:34.440] meta-RL in the sort of deep RL wave. There's, there's of course, like, earlier works that
[04:34.440 - 04:41.080] do meta-RL from the 90s. And so, but I guess like the there's been a big jump in popularity
[04:41.080 - 04:46.840] more recently. So what MAML starts from, like, I think the intuition there is really the
[04:46.840 - 04:54.720] key to it. Like, it's that in deep learning, we have, pre-training is a big, big thing
[04:54.720 - 05:01.160] people do. Like you, you train your convolutional neural net on image nets. And then you have
[05:01.160 - 05:06.280] some like maybe like an application where you like you want to classify produce in the
[05:06.280 - 05:09.880] supermarket or something like that, which for which you have way less data. So what
[05:09.880 - 05:14.320] you can do is like use the pre-trained model and then fine tune it on that task you're
[05:14.320 - 05:20.320] interested in. And RL still doesn't have a lot of that. And especially in 2016 didn't
[05:20.320 - 05:28.280] have any of that. So like what MAML does is like it takes this, this question very explicitly,
[05:28.280 - 05:35.400] like whether we can use meta-learning to produce a better initialization, like a pre-trained
[05:35.400 - 05:42.000] network for that, that's then quick to fine tune for other tasks. So you essentially you
[05:42.000 - 05:49.640] take a big distribution or tasks, and then you train a network using any algorithm for
[05:49.640 - 05:55.040] your choice on those tasks. And then you back propagate through that learning algorithm
[05:55.040 - 05:59.760] to the initialization such that like the, the learning algorithm in the middle can make
[05:59.760 - 06:03.080] as fast progress as possible, if that makes sense.
[06:03.080 - 06:08.640] So that sounds a bit like a foundation model for your, but for your specific setting. Is
[06:08.640 - 06:09.640] that similar?
[06:09.640 - 06:14.480] In addition, yeah, I think so. I mean, I think the, the mechanics work out quite differently,
[06:14.480 - 06:17.920] but like, but the, the, the motivation is definitely there.
[06:17.920 - 06:22.760] Yeah, I think recently did a good job of summarizing MAML. I guess most simply put, it's just meta-learning
[06:22.760 - 06:28.200] the initialization and kind of on the spectrum in the paper we talk about of generalization
[06:28.200 - 06:33.200] versus specialization, MAML is at one end of the spectrum. So it's just learning an
[06:33.200 - 06:39.880] initialization and the inner loop or the algorithm that's actually learned to do the, the fast
[06:39.880 - 06:43.440] learning reinforcement learning algorithm that's learned is all hard coded other than
[06:43.440 - 06:48.360] the initialization. And so from that, you can get certain properties like generalization
[06:48.360 - 06:51.920] to new tasks that you haven't seen during training. And at the other end of the spectrum,
[06:51.920 - 06:57.160] we have RL squared and L2RL, which are both papers that came out around the same time
[06:57.160 - 07:03.040] doing roughly the same thing. So RL squared, I think was Duann et al and L2RL was Wang
[07:03.040 - 07:09.320] et al. And the idea more or less in these papers is just the inner loop. So the reinforcement
[07:09.320 - 07:13.480] learning algorithm that you're learning is entirely a black box. It's just a general
[07:13.480 - 07:18.200] function approximator. So, you know, it tends to be GRU or LSTM. That's kind of at the extreme
[07:18.200 - 07:20.760] other end of the spectrum from MAML.
[07:20.760 - 07:25.440] And so where would you apply these, these two different approaches? Or can you talk
[07:25.440 - 07:29.040] about the pros and cons of the spectrum?
[07:29.040 - 07:33.840] MAML really has found more popularity, I feel like in this sort of few shot classification
[07:33.840 - 07:40.160] setting more recently, like turns out actually doing MAML for reinforcement learning is really
[07:40.160 - 07:46.160] challenging. So I don't know if it's like a big baseline or anything anymore. But like
[07:46.160 - 07:52.560] the nice thing about the basic algorithm in MAML is that since the inner loop, the algorithm
[07:52.560 - 07:58.160] you're learning is just usually a policy gradient method, policy gradient reinforcement learning
[07:58.160 - 08:04.880] algorithm, there's like some guarantees that it'll do reasonably well on any task you throw
[08:04.880 - 08:11.640] at it. Like, so even if your initialization was kind of off, like even if the task distribution
[08:11.640 - 08:17.480] you trained it for, isn't the one that where you eventually then deploy the initialization,
[08:17.480 - 08:23.800] there's still hope that the policy gradient algorithm will recover from that. So like
[08:23.800 - 08:31.120] it has this more sort of a little bit better generalization performance than RL squared
[08:31.120 - 08:32.200] would have.
[08:32.200 - 08:36.560] And I would add to that the horizon also matters, right? So in MAML, you actually are computing
[08:36.560 - 08:40.200] a gradient, you're computing a policy gradient, so you need to collect some data for that.
[08:40.200 - 08:44.960] If your performance matters, you know, from the first time step of training, then RL squared
[08:44.960 - 08:48.400] is kind of more the algorithm you would use.
[08:48.400 - 08:54.800] Did you have a clear idea of how you would do this categorization and how things would
[08:54.800 - 08:59.600] be organized before you did the paper? Or is that something that really came about through
[08:59.600 - 09:03.720] a lot of careful thought and analyzing what you learned through through your readings?
[09:03.720 - 09:08.600] Yeah, I mean, we had a few false starts. It was kind of a mess at the beginning, we proposed
[09:08.600 - 09:13.240] it, we started by proposing a taxonomy of survey of meta learning papers. And then we
[09:13.240 - 09:16.800] quickly realized that the literature just didn't really reflect the taxonomy we had
[09:16.800 - 09:20.360] just sat down and thought of.
[09:20.360 - 09:26.080] So we had to kind of reorganize based on what the literature reflected, so the main clusters
[09:26.080 - 09:30.560] of literature, and then we had to be pretty careful about divvying up work within each
[09:30.560 - 09:31.560] of those.
[09:31.560 - 09:37.240] In retrospect, I think the structure is kind of what you would find from the literature,
[09:37.240 - 09:39.400] but we definitely didn't start from this one.
[09:39.400 - 09:44.640] Cool. So let's get into the different settings that your paper mentions, where our meta RL
[09:44.640 - 09:50.000] can be applied. Do you want to cover the main settings? Can you give us a brief description
[09:50.000 - 09:53.960] of or an example of what that setting would look like?
[09:53.960 - 10:02.520] Yeah, so let me just get started here. So we have two axes along which we distinguish
[10:02.520 - 10:10.640] these meta RL problems. So there's zero or a few shots versus many shots. So that has
[10:10.640 - 10:15.640] to do with the horizon of the task in the inner loop. So if you have something where
[10:15.640 - 10:21.800] you, as Jack mentioned earlier, if you have something where you want your agent to make
[10:21.800 - 10:27.360] as much progress as possible from the first time step it's deployed in the environment,
[10:27.360 - 10:32.040] then you're kind of in this zero or a few shot regime. And usually those are tasks where
[10:32.040 - 10:39.940] you then are also expected to do really well after like, you know, a small number of steps.
[10:39.940 - 10:45.480] So this could be like, originally, these were the kinds of things where you have maybe a
[10:45.480 - 10:50.480] mujoco environment where you have like a cheetah robot running around and you need to decide
[10:50.480 - 10:55.960] which way to run with the cheetah, like that will be sort of a canonical early task from
[10:55.960 - 10:59.800] there. And they're more complicated now, but like, that's sort of roughly the order of
[10:59.800 - 11:08.480] mag like size, we're thinking here. And then many shot is more about learning the whole
[11:08.480 - 11:13.360] like learning a sort of long running our algorithm in the inner loop. So something you can think
[11:13.360 - 11:18.840] of like, you would want to meta learn an algorithm that you can then use to update your policies
[11:18.840 - 11:24.080] like 10,000 times. So it could be it could be 10,000 episodes, it could be like, you
[11:24.080 - 11:32.120] know, hours or days long training run, then using the learned meta learning algorithm,
[11:32.120 - 11:36.200] or a learned reinforcement on the algorithm, the inner loop of the meta learning algorithm.
[11:36.200 - 11:40.000] So in that case, you're not worried about performance at the beginning?
[11:40.000 - 11:43.760] Yeah, basically, right, like you would you would assume that you essentially start from
[11:43.760 - 11:47.280] like a really random policy, and then you just try to, of course, you still try to make
[11:47.280 - 11:53.920] as fast progress as possible. But like, if it's if the inner loop is modeled after, let's
[11:53.920 - 11:59.400] say policy gradient algorithm, then you're going to need need some amount of samples just
[11:59.400 - 12:06.240] to get like a reasonable gradient estimates for the update. So it won't get started like
[12:06.240 - 12:08.880] in any kind of zero shots manner for sure.
[12:08.880 - 12:13.360] Okay, so you're not you're not evaluating it to test time doesn't start right away.
[12:13.360 - 12:14.360] Is that you're saying?
[12:14.360 - 12:22.600] Yeah, usually you would evaluate the the policy after like, you know, hundreds up to thousands
[12:22.600 - 12:25.400] or tens of thousands of updates.
[12:25.400 - 12:28.760] The goal of that setting can be stated as learning a traditional RL algorithm.
[12:28.760 - 12:32.480] And the other axes of that setting is the like, whether we're dealing with a single
[12:32.480 - 12:37.640] task or a multitask setting. And like, this is kind of this is like a trippy thing, I
[12:37.640 - 12:43.520] guess, like, this isn't something that is super often discussed in the especially in
[12:43.520 - 12:49.680] some parts of the metal literature, but the but the single task case is still very interesting.
[12:49.680 - 12:56.960] And it actually does. The methods are very similar between the many shot multitask setting.
[12:56.960 - 13:02.240] So where you would have like a big task distribution of distribution of tasks, and then you're
[13:02.240 - 13:07.640] trying to learn that like traditional RL algorithm, the inner inner loop, turns out you can actually
[13:07.640 - 13:14.840] just grab that algorithm, the the meta learning algorithm, and run it on a single RL task
[13:14.840 - 13:19.280] essentially, and still get reasonably good performance. So you can get like, you know,
[13:19.280 - 13:26.840] you can train agents on Atari, where you actually meta learned the objective function for for
[13:26.840 - 13:33.080] the the policy gradient that's then updating your policy just on that single task. But
[13:33.080 - 13:37.280] oh, yeah, and I guess one important thing here is that there really isn't like a few
[13:37.280 - 13:44.800] shot single task setting, because there needs to be like some source of transfer, I guess,
[13:44.800 - 13:50.320] like you need to have in if you have multiple tasks, then you kind of like what you do is
[13:50.320 - 13:55.120] like you train on train on the distribution of tasks, and then you maybe have a held outside
[13:55.120 - 14:00.320] of test tasks for which you try to like where you test whether your learning algorithm works
[14:00.320 - 14:07.680] really well. If you're in the long horizon setting, in the mini shot setting, then you
[14:07.680 - 14:15.120] can like, test like the single task, you can compare it to just the kind of vanilla RL
[14:15.120 - 14:20.680] algorithm, you would run over that, but they're like a zero shot single task setting, there
[14:20.680 - 14:25.440] isn't like anything that it doesn't like, you can't really test it on anything like
[14:25.440 - 14:30.760] and there's no, there's not enough room for meta learning. No, it's a pretty difficult
[14:30.760 - 14:35.600] concept to explain. So I think you did a good job. But what you said basically is that right
[14:35.600 - 14:42.160] in the multi task setting, you're transferring from one set of MDPs to a new MDP a test time,
[14:42.160 - 14:46.160] in the single task setting, what you're doing is transferring from one part of a single
[14:46.160 - 14:50.840] lifetime in one MDP to another part of that same lifetime in the same MDP. So you have
[14:50.840 - 14:53.680] to have some notion of time for that to occur over.
[14:53.680 - 14:58.680] Awesome, I actually was going to ask you about that, just like a missing square in the quadrant,
[14:58.680 - 15:04.560] right? And so that totally makes sense. So then you talked about different approaches
[15:04.560 - 15:11.120] for the the different settings. Do you want to touch on on on the some of the most important
[15:11.120 - 15:17.080] those? So I guess, as we mentioned, mammal was kind of the prototypical algorithm in
[15:17.080 - 15:22.960] the PPG setting. But you can also imagine, you can add additional parameters to tune
[15:22.960 - 15:28.040] other than just the initialization. So you can do the learning rate, you can learn some
[15:28.040 - 15:33.120] kind of a curvature matrix that modifies your gradient, you can learn the whole distribution
[15:33.120 - 15:36.320] for your initialization instead of just a single point estimate for initialization.
[15:36.320 - 15:40.560] And it's kind of a whole family of things that build on mammal. And the inner loop,
[15:40.560 - 15:44.200] the only thing that's consistent between them is the inner loop involves a policy gradient.
[15:44.200 - 15:50.960] So we call those PPG methods or PPG algorithms for parameterized policy gradient. And that's
[15:50.960 - 15:54.560] kind of the first category of methods we talked about in the few shot setting.
[15:54.560 - 15:58.240] And did you guys coin that coin that phrase PPG?
[15:58.240 - 15:59.240] We're trying to.
[15:59.240 - 16:00.240] Cool, I like it.
[16:00.240 - 16:05.000] Thank you. Thank you. So yeah, that's that's the PPG setting. And then the other there's
[16:05.000 - 16:10.680] two other main types of methods in the few shot setting. There's black box. And the main
[16:10.680 - 16:16.000] example of prototypical algorithm in that setting would be RL squared. But you can also
[16:16.000 - 16:20.040] replace that black box with many different architectures, transformers, other forms of
[16:20.040 - 16:26.640] attention and other memory mechanisms. And so there's a whole category of black box algorithms.
[16:26.640 - 16:32.800] And then I guess the only one we haven't really touched on yet is task inference methods.
[16:32.800 - 16:38.800] So the idea here is a little more nuanced. But meta-learning, as we mentioned, considers
[16:38.800 - 16:43.920] a distribution of MDPs, also known as tasks. What's different about the meta-learning
[16:43.920 - 16:49.080] setting from the multitask setting is you don't know what task you're in. So you actually
[16:49.080 - 16:53.680] have to explore together to gain data to figure out, hey, I'm in, you know, I'm supposed to
[16:53.680 - 16:56.920] run 10 miles per hour instead of five miles per hour. I'm supposed to navigate to this
[16:56.920 - 17:03.280] part of the maze as opposed to this part of the maze. And you can frame, you know, the
[17:03.280 - 17:07.280] setting as task inference. So I think humplik et al, one of the early papers that pointed
[17:07.280 - 17:10.480] this out, if you can figure out what task you're in, you've reduced your setting to
[17:10.480 - 17:13.800] the multitask setting, and you've made the problem much easier. And so that kind of gave
[17:13.800 - 17:16.000] rise to a whole bunch of methods around task inference.
[17:16.000 - 17:21.880] So is that the scenario where you may have a simulator that has a bunch of knobs on it,
[17:21.880 - 17:24.800] and then your agent just has to figure out what are those what are the settings on the
[17:24.800 - 17:25.800] knobs for the environment?
[17:25.800 - 17:28.960] Yeah, I mean, more or less, right, you can consider your environment parameterized by
[17:28.960 - 17:33.480] some context vector, you have to figure out what those parameters are. And then you can
[17:33.480 - 17:36.720] assume maybe that once you know those parameters, you have a reasonable policy already ready
[17:36.720 - 17:41.440] to go for that set of parameters. I think Shimon, maybe even on this podcast at one
[17:41.440 - 17:46.680] point, pointed out that like, if you can figure out what MDP you're in, you don't even need
[17:46.680 - 17:51.000] to do learning anymore, right? You can just do planning. If you know that MDP, you don't
[17:51.000 - 17:52.600] need to do any more learning at that point.
[17:52.600 - 17:55.160] Right, that's a really important observation.
[17:55.160 - 18:03.040] Right. So and then we also have the many shot setting where this would, there's like two,
[18:03.040 - 18:09.320] I guess, like the major categories of things to think here are the single task and multitask
[18:09.320 - 18:20.100] many shot problems. The methods for both single task and multitask end up being quite similar.
[18:20.100 - 18:24.440] So the kinds of things that people learn in the inner loop. So okay, let me try to like
[18:24.440 - 18:31.820] be clearer about the many shot setting one more time here. So basically the structure
[18:31.820 - 18:39.440] is that you take your policy graded algorithm, A to C or whatever, and then you put some
[18:39.440 - 18:44.360] parameters to the loss function there. So maybe you have an intrinsic reward function
[18:44.360 - 18:52.840] or an auxiliary task or something of that flavor, and then you tune, like you change
[18:52.840 - 18:58.920] the parameters of that loss function with the meta learner. So there's this sort of
[18:58.920 - 19:04.320] outer loop meta learner that computes the gradient through the updated policy into the
[19:04.320 - 19:09.800] loss function parameters and changes those so that you get better performance from the
[19:09.800 - 19:20.240] agent. And so this idea applies to both like the single task and multitask setting. And
[19:20.240 - 19:26.480] I think one of the like important topics there would then be like, you know, what is the
[19:26.480 - 19:32.840] what is the algorithm you're building on top of? Like, what is the inner loop, like base
[19:32.840 - 19:37.360] algorithm? And what is the way you're optimizing that and those kinds of things? And then the
[19:37.360 - 19:44.600] sort of things that you learn there are these often like intrinsic rewards is pretty big
[19:44.600 - 19:50.640] auxiliary tasks, you could have like you could have a more general parameterization of the
[19:50.640 - 19:57.480] RL objective function in the inner loop. So there's kind of algorithms that just parameterize
[19:57.480 - 20:04.620] that very generally. And then one other thought people have considered is like learning hierarchies.
[20:04.620 - 20:10.240] So you do like hierarchical RL, maybe option discovery, for example, could be done in this
[20:10.240 - 20:15.920] like long, many shot meta RL setting. When I think of like, one item on this list like
[20:15.920 - 20:23.120] intrinsic rewards, I remember when Patak, you know, came up with this curiosity, intrinsic
[20:23.120 - 20:28.320] reward and did that study. And, and I think his agent, his agent had like billions of
[20:28.320 - 20:34.440] steps for the curiosity to really do its thing. And that was not in meta RL, that was just
[20:34.440 - 20:40.840] in straight RL. So when I think about doing this in a loop, it seems like it could be
[20:40.840 - 20:45.200] maybe massively expensive. Like, how do you think about the cost of these, of these algorithms
[20:45.200 - 20:52.120] and when it actually makes sense economically, or, or just when it makes sense to, to use
[20:52.120 - 20:55.680] these methods and how and how expensive are they? Do you have a sense of that?
[20:55.680 - 21:01.360] Yeah, that's a great question. So for the few shot learning setting, it's not really
[21:01.360 - 21:13.360] hugely different from just training a like agent that can generally solve that tasks
[21:13.360 - 21:17.800] of that flavor, like I would say that the meta RL part there isn't like, there's of
[21:17.800 - 21:26.120] course, like an upfront cost to training the meta learner, but then like the test time,
[21:26.120 - 21:30.880] it should be very efficient. I think the like big costs come out in the many shots setting
[21:30.880 - 21:39.800] where you like you're trying to train a full RL algorithm in the inner loop, and then just
[21:39.800 - 21:45.800] the getting like, being able to optimize that that that can take a whole lot of samples
[21:45.800 - 21:53.480] for sure. There's the trick there is that these algorithms can generalize quite a bit.
[21:53.480 - 22:00.240] So there's a paper by I think it's junior and others from deep point, where they train
[22:00.240 - 22:05.320] and train an inner loop algorithm that is trained on like, essentially grid worlds and
[22:05.320 - 22:11.200] bandits and those kinds of things. And so they're training the inner loop objective
[22:11.200 - 22:15.640] on on very simple environments. And it still takes a whole lot of samples, like it takes,
[22:15.640 - 22:21.580] I think, like billions of frames, but in very simple environments, at least. And then they
[22:21.580 - 22:27.840] transfer that and show that it actually can generalize to Atari and produce like, roughly,
[22:27.840 - 22:32.320] original DQN level level performance there, which is pretty impressive to me. But I mean,
[22:32.320 - 22:40.360] like, yeah, it's a it's a it's the most expensive Atari agents of that, that performance level,
[22:40.360 - 22:44.880] for sure. One thing I don't know if the question was intended to be this specific, but you
[22:44.880 - 22:49.320] mentioned it took a while takes a while for the curiosity based rewards to do their thing.
[22:49.320 - 22:53.760] recent knows a lot more about this setting than I do. But I my understanding is that
[22:53.760 - 22:57.480] generally for the intrinsic rewards, you don't actually try and meta learn the propagation
[22:57.480 - 23:02.600] through the critic. So you could like, you know, the meta learned reward would be useful
[23:02.600 - 23:07.040] in the end step return or like the TD lambda estimate. But I don't think you're actually
[23:07.040 - 23:10.960] meta learning how to propagate that information through the critic. Is that right, Risto?
[23:10.960 - 23:14.880] Would that change the cost too much? I feel like they probably don't. But I would actually
[23:14.880 - 23:19.560] like it's just a couple of extra, a little bit more memory cost in the backward pass,
[23:19.560 - 23:21.840] but it doesn't seem like critical. I'm not sure.
[23:21.840 - 23:25.360] Sure, sure, sure. But you don't need to do like many steps of value iteration to try
[23:25.360 - 23:28.280] and figure out the effects of that through that process. Oh, yeah, of course. Yeah, no,
[23:28.280 - 23:34.400] no, no, no. Yeah, it's a huge approximation or in all kinds of way, ways to compute the
[23:34.400 - 23:40.320] update for your intrinsic rewards. And I and one critical thing that the algorithms in
[23:40.320 - 23:47.000] that setting often do is that, in some sense, you if you're into many shots, multitask setting,
[23:47.000 - 23:55.160] you want the intrinsic reward or whatever your training that would produce the best
[23:55.160 - 24:00.440] end performance of the agent when you train train a new agent from scratch using that
[24:00.440 - 24:06.120] learned objective function, you want like the, you know, however long is your training
[24:06.120 - 24:12.080] horizon the best agent to like the want the agent that you want the loss function that
[24:12.080 - 24:17.520] produces the best agent at convergence. But of course, like back propagating through the
[24:17.520 - 24:25.060] whole long optimization horizon in the inner loop would be extremely costly. So then people
[24:25.060 - 24:29.080] often do like they truncate the optimism, this is like the truncated back propagation
[24:29.080 - 24:34.120] to through time, essentially. So you're just consider as tiny window of updates within
[24:34.120 - 24:40.560] that inner loop, and then back propagate within there to keep the memory costs reasonable.
[24:40.560 - 24:46.080] Okay, then you you mentioned that in your paper exploration is clearly a central issue,
[24:46.080 - 24:52.760] especially for few shots, but RL, what can you talk about the importance of exploration
[24:52.760 - 24:57.600] in meta RL and the main methods you use for exploration in meta RL. So exploration is
[24:57.600 - 25:03.960] kind of a central concept that makes meta RL distinct from just meta learning in general.
[25:03.960 - 25:07.680] So in meta learning, you might be given a new data set, you have to rapidly adapt to
[25:07.680 - 25:13.360] that new data set. In meta RL, you actually have to collect the new data yourself. And
[25:13.360 - 25:17.120] it might not be clear, you know how to do that or what data you need. So you have to
[25:17.120 - 25:21.240] explore to figure out what task you're in and what data you need to identify the task
[25:21.240 - 25:26.760] itself. And that's kind of one of the central challenges in the few shot meta RL setting.
[25:26.760 - 25:30.800] And here you're talking about exploration at the task level, not at the meta level,
[25:30.800 - 25:34.560] right meta exploration? Is this something you mentioned in a separate part of the paper?
[25:34.560 - 25:39.080] Yeah, meta exploration was a bit distinct. So that's exploration in the space of exploration
[25:39.080 - 25:43.760] strategies. Yeah, I don't know if we want to unpack that statement, that statement more,
[25:43.760 - 25:49.000] but I guess first, so some of the methods that are used for exploration, right, there's
[25:49.000 - 25:55.480] end to end learning is common. But it's difficult for really challenging exploration problems.
[25:55.480 - 25:59.560] So you can just do RL squared, where your inner loop or you know, the reinforced, I
[25:59.560 - 26:03.940] guess, I'm not sure we actually have defined inner loop and outer loop so far in this discussion.
[26:03.940 - 26:07.480] But when we say inner loop, we mean the reinforcement learning algorithm that you are learning,
[26:07.480 - 26:10.600] we say outer loop, we mean the reinforcement learning algorithm, the slow one that you're
[26:10.600 - 26:16.500] using to learn the inner loop, which can just be like PPO or something along those lines.
[26:16.500 - 26:20.680] And so you can just use the inner loop as a black box. And that can solve some exploration
[26:20.680 - 26:25.040] problems, but generally more challenging exploration problems won't be solved by RL squared and
[26:25.040 - 26:28.960] things that are just a complete black box. So people have tried building in more structure
[26:28.960 - 26:36.280] for exploration, ranging from posterior sampling to more complicated methods, often using task
[26:36.280 - 26:42.740] inference actually. So we mentioned task inference being this idea that you want to take actions
[26:42.740 - 26:45.700] or you know, identify what task you're in. And often you might need to take actions to
[26:45.700 - 26:50.280] figure out what task you're in. And if you need to take actions to figure out what task
[26:50.280 - 26:54.560] you're in, one way to do that is by saying, okay, we're going to give the agent a reward
[26:54.560 - 27:00.160] for being able to infer the task. There's some drawbacks for doing that directly, right?
[27:00.160 - 27:03.960] So you might be trying to infer the MDP, which is the transition function and the reward
[27:03.960 - 27:08.600] function. And there might be completely irrelevant information in that exploration process that
[27:08.600 - 27:12.440] you don't need. So if you're trying to figure out what goal to navigate to, let's say in
[27:12.440 - 27:15.800] a kitchen, you're trying to figure out like where the robot should be. And maybe there's
[27:15.800 - 27:18.920] some paintings on the wall. And the paintings on the wall are completely irrelevant to where
[27:18.920 - 27:24.160] you're supposed to be navigating right now to make someone's food. And in that case,
[27:24.160 - 27:27.320] there are also algorithms to tackle that. So one that we like to highlight his dream
[27:27.320 - 27:32.960] from one of our co-authors on the paper, Evan. And there you actually learn what information
[27:32.960 - 27:37.360] is relevant first by doing some pre-training in the multitask setting, you figure out what
[27:37.360 - 27:41.840] information would an optimal policy need, an informed policy, and then you separately
[27:41.840 - 27:48.360] learn an exploration policy to try and uncover the data that allows you to execute the informed
[27:48.360 - 27:49.360] policy.
[27:49.360 - 27:52.400] There's a lot of concepts in this paper, I got to say, compared to the average paper.
[27:52.400 - 27:56.960] I guess that's the nature of survey papers. So I'm really glad you're here to help us
[27:56.960 - 27:57.960] make sense of it.
[27:57.960 - 28:04.720] Yeah. So we talked about a few different exploration methods. One that came from our lab is the
[28:04.720 - 28:08.960] very bad paper, which I think you already had Shimon on to talk about as well. It's
[28:08.960 - 28:13.440] a really cool method that allows you to have to quantify uncertainty in this task inference
[28:13.440 - 28:19.360] that we just mentioned. So what the very bad does is trains of VAE separately to reconstruct
[28:19.360 - 28:23.960] transitions and it trains a latent variable, mean and variance in order to do that. And
[28:23.960 - 28:29.240] then you condition a policy on the inferred mean and variance. So you're explicitly conditioning
[28:29.240 - 28:33.360] on your uncertainty in the distribution of tasks. And you can actually frame that entire
[28:33.360 - 28:37.720] problem as what's called a BAMDP or a Bayes Adaptive MDP.
[28:37.720 - 28:43.560] Yeah. Very bad treatment of uncertainty is so cool. That makes it really special to me.
[28:43.560 - 28:48.160] And I guess that's the magic of variational inference. Is that right?
[28:48.160 - 28:51.640] Yeah. I mean, it's variational inference plus conditioning on that uncertainty for the meta
[28:51.640 - 28:55.440] learning allows you to learn actually optimal exploration pretty easily.
[28:55.440 - 28:58.160] Cool. Should we move on to supervision?
[28:58.160 - 29:10.080] Yeah, sounds good. Mostly we focus on the case where we have reinforcement learning
[29:10.080 - 29:15.240] in the inner loop and reinforcement learning in the outer loop. And I mean, most of the
[29:15.240 - 29:20.520] meta RL research is also in that setting. But similarly, as has happened with the kind
[29:20.520 - 29:28.560] of term RL or especially deep RL, it sort of subsumed a lot of other topics that also
[29:28.560 - 29:33.240] are doing some kind of machine learning for control. So like imitation learning, for example,
[29:33.240 - 29:37.680] like often people just say RL and sometimes they mean something that's more like imitation
[29:37.680 - 29:42.120] learning. So we also have a similar thing happening in meta RL, where there's a lot
[29:42.120 - 29:51.800] of like meta imitation learning, where you might be doing just kind of like the, I guess
[29:51.800 - 29:57.440] the most direct approach for that would be doing something like MAML for imitation learning.
[29:57.440 - 30:01.680] So it's like imitation learning is sort of just supervised learning of control. And then
[30:01.680 - 30:07.760] you could just take the supervised learning version of MAML to learn a fast imitation
[30:07.760 - 30:12.240] learning initialization. But then you could have all these other variants as well, where
[30:12.240 - 30:21.600] you have like, let's say you're trying to learn a imitation learning algorithm, which
[30:21.600 - 30:28.400] when it's shown a demonstration of a new task, can do that as quickly as possible. You could
[30:28.400 - 30:35.360] you could have you could meta train that so that actually the the meta learning algorithm
[30:35.360 - 30:40.240] is still optimizing the reward of the task somehow, like if you if you have access to
[30:40.240 - 30:43.440] the rewards, the outer loop could still be reinforcement learning. So now I have this
[30:43.440 - 30:47.880] setting where you have imitation learning in the inner loop, and reinforcement learning
[30:47.880 - 30:54.080] in the outer loop. And then then you would test it as a imitation learning algorithm.
[30:54.080 - 31:00.760] So you you show it a new demonstration, and you're expecting it to adapt to that as quickly
[31:00.760 - 31:08.320] as possible. And then of course, like all the other the permutations of that same setting
[31:08.320 - 31:15.600] apply, and people have done research on those. Then there's also like unsupervised learning
[31:15.600 - 31:25.480] is a big topic. So and also, people in meta RL have looked into doing sort of unsupervised
[31:25.480 - 31:27.360] learning algorithms in the inner loop.
[31:27.360 - 31:32.200] Right, so I mean, unsupervised in the inner loop could be useful if you just don't have
[31:32.200 - 31:37.840] access to rewards at test time, right. And some algorithms that do this are like, heavy
[31:37.840 - 31:41.280] and learning algorithms. There's a lot of heavy learning algorithms that just don't
[31:41.280 - 31:46.200] condition on reward, they're local, and they're unsupervised in their inner loop. But the
[31:46.200 - 31:49.520] outer loop, as we mentioned, still uses reward. So you're still meta learning this with rewards
[31:49.520 - 31:53.840] end to end. I think there are there are a bunch of other papers, you know, aside from
[31:53.840 - 31:57.640] heavy learning as well. But the idea there is that you might not have access to rewards
[31:57.640 - 32:02.400] when you actually go to test. There's also unsupervised in the outer loop. So if you're
[32:02.400 - 32:05.340] given one environment, it's kind of like a sandbox you can play with, but you don't
[32:05.340 - 32:09.200] really have any known rewards, you can do some clever things to get a distribution of
[32:09.200 - 32:12.720] reward functions, they might prepare you for a reward function, you're going to encounter
[32:12.720 - 32:17.760] a test time. So there during meta training, you create your own distribution of tasks
[32:17.760 - 32:26.160] or own distribution of reward functions. And then there's also like, so I guess that's
[32:26.160 - 32:30.160] unsupervised outer loop unsupervised inner loop, you can also have a supervised outer
[32:30.160 - 32:34.560] loop where your inner loop is reinforcement learning. And there, the idea is just like
[32:34.560 - 32:41.440] reinforcement learning in the outer loop is very slow. And it's a very weak supervision.
[32:41.440 - 32:45.960] And the cost of meta training is huge. Right. So we're learning very simple, efficient algorithms
[32:45.960 - 32:50.080] for test time through meta learning, but that blows up the cost of meta training. And if
[32:50.080 - 32:53.920] we can use stronger supervision during meta training, then that can get us huge wins in
[32:53.920 - 32:58.440] terms of sample efficiency. Okay, that part I think I followed. It's kind of like how
[32:58.440 - 33:06.420] many ways can you put together a Lego kit? There's a lot of ways, right? So can we talk
[33:06.420 - 33:12.560] about some of the application areas where meta RL has been been important or looks promising
[33:12.560 - 33:18.360] in the future? Yeah, for sure. So I mean, there's, I think it's a pretty recent paper
[33:18.360 - 33:28.880] by Evan again, where they do meta RL for this really cool code feedback thing. So you have
[33:28.880 - 33:33.880] like a online, so this is a very specific thing, but just because it's on the top of
[33:33.880 - 33:42.840] my memory, like you have a online coding platform where you try go on and learn programming.
[33:42.840 - 33:47.520] And if there's like an interactive program, you're trying to code there, it's really
[33:47.520 - 33:51.720] hard for the automated toolkit to give you good feedback on that. So what they do is
[33:51.720 - 33:57.840] actually train a meta reinforcement learning algorithm, our agent that like, provides good
[33:57.840 - 34:06.280] feedback there, because like, the students programs, like, make a task distribution,
[34:06.280 - 34:11.680] which you then need to explore efficiently, to find like, what kinds of bugs the students
[34:11.680 - 34:17.840] have figured out to implement there. So that's actually they got like pretty promising results
[34:17.840 - 34:26.440] on the on the benchmark there. And and that seems like it could tentatively be deployed
[34:26.440 - 34:31.800] in the real world as well. And maybe they can talk about the other applications we cover
[34:31.800 - 34:32.800] in the paper.
[34:32.800 - 34:38.080] Yeah, we cover a bunch of other ones, but I guess to highlight here, like robot locomotion
[34:38.080 - 34:43.160] is a big one. So there, it's pretty common to try and train in simulation over distribution
[34:43.160 - 34:47.480] of tasks, and then try and do sim to real transfer to a real robot in the real world,
[34:47.480 - 34:53.400] as opposed to trying to do meta learning on a robot from scratch. And there's some pretty
[34:53.400 - 34:58.600] cool algorithms that have been applied in order to do that import by Camiani at all,
[34:58.600 - 35:01.720] in particular, being one of them, where you actually do this kind of multitask training
[35:01.720 - 35:06.000] I talked about before and task inference that I mentioned before, but you do it simultaneously
[35:06.000 - 35:10.280] well doing meta learning. So you'd have some known information about the environment that
[35:10.280 - 35:17.120] robots trying to walk in in your simulator. And maybe we assume that at test time, this
[35:17.120 - 35:22.600] information wouldn't be known, like the exact location of all the rocks and steps. And some
[35:22.600 - 35:26.720] sensory information isn't available to the actual robot in the real world. So what you
[35:26.720 - 35:31.480] can try and do is have the known representation, some encoding of that, then you have your
[35:31.480 - 35:34.560] inferred representation, you have some encoding of that, and you can try and make these two
[35:34.560 - 35:39.000] things look very similar. And that's been used in a number of robotics papers at the
[35:39.000 - 35:44.000] moment for some pretty cool effects. So I guess in addition to the robot locomotion
[35:44.000 - 35:49.680] problem, one application area we go into in the paper in some detail is the meta learning
[35:49.680 - 35:56.760] for multi agent RL problem. And there kind of just to summarize, concisely, you can view
[35:56.760 - 36:01.040] other agents as defining the task. So if you have a distribution of other agents, that
[36:01.040 - 36:04.520] pretty clearly creates for you a distribution of tasks, and you can directly apply meta
[36:04.520 - 36:09.280] learning. And that enables you to deal both with the adapting to novel agents at test
[36:09.280 - 36:14.960] time. And that allows you to deal with maybe the non stationarity introduced by the adaptation
[36:14.960 - 36:18.200] of other agents. So all the learning other agents are doing can be taken into account
[36:18.200 - 36:25.520] by your meta learning. Your paper also discusses using meta RL with offline data. Can you say
[36:25.520 - 36:32.760] a couple things about that? Yeah, so as I as I mentioned earlier, like, meta reinforcement
[36:32.760 - 36:36.520] learning, it tries to create a sample efficient adaptation algorithm in a few shots setting
[36:36.520 - 36:43.080] anyway. And that shifts a huge amount of the data and burden to meta training. So you can
[36:43.080 - 36:49.000] imagine having an offline outer loop. Right. So the meta training, if you're having such
[36:49.000 - 36:52.560] a large meta training burden, you can't really do that directly in the real world. So one
[36:52.560 - 36:56.920] thing you might want to do is have some safe data collection policy to gather a lot of
[36:56.920 - 37:02.160] data for you. And then you can immediately use offline meta RL in the outer loop to try
[37:02.160 - 37:06.440] and train your meta learning algorithm having not actually taken any dangerous actions in
[37:06.440 - 37:13.120] the real world yourself. So it's kind of the offline outer loop idea in meta RL. We also
[37:13.120 - 37:18.040] go into offline interloop and different combinations of offline online interloop offline online
[37:18.040 - 37:24.280] outer loop. But the idea with the offline interloop is we're already trying to do, you
[37:24.280 - 37:28.120] know, you know, few shot learning. So at the limit of this, it's like you're given some
[37:28.120 - 37:31.680] data up front, and you actually never have to do any sort of exploration in your environment,
[37:31.680 - 37:36.160] you can adapt immediately to some data someone hands you a test time without doing any sort
[37:36.160 - 37:41.680] of exploration or any sort of data gathering. So of course, we have RL is is generally framed
[37:41.680 - 37:48.040] in terms of MDP is the Markov decision process. And in the case of meta RL, can we talk about
[37:48.040 - 37:56.720] the MDP for the outer loop or the POMDP? What does that MDP look like in terms of the traditional
[37:56.720 - 38:03.520] components of state action and reward? As we mentioned before, meta RL defines a problem
[38:03.520 - 38:07.960] setting. And in this problem setting, there's a distribution of MDPs, which could also be
[38:07.960 - 38:13.480] considered a distribution of tasks. So your outer loop is computed, like, for example,
[38:13.480 - 38:19.360] your return is computed in expectation over this distribution. Instead, you can actually
[38:19.360 - 38:24.220] view this distribution as a single object. In that case, it's a partially observable
[38:24.220 - 38:29.400] Markov decision process, also known as a POMDP. And what's different from a POMDP and MDP
[38:29.400 - 38:35.400] in a POMDP is a latent state. So it's something agent can observe. And in this case, the latent
[38:35.400 - 38:39.800] state is exactly the MDP you are inhabiting the agent is in at the moment. So your latent
[38:39.800 - 38:44.120] state would include the task identity. And so if you actually were to try and write out
[38:44.120 - 38:49.240] this POMDP, then the transition function would condition on this latent variable, your work
[38:49.240 - 38:53.800] function would condition on this latent variable. And then there's just kind of the action space
[38:53.800 - 39:00.960] left to define. The action space is usually assumed to be the same across all these different
[39:00.960 - 39:05.540] MDPs. And so that's usually just the same for the POMDP. But there's also work trying
[39:05.540 - 39:12.280] to loosen that restriction. So someone from our lab, Zhang, recent paper, trying to generalize
[39:12.280 - 39:17.600] across different robot morphology with different action spaces. And there he's using hyper
[39:17.600 - 39:22.080] network, which is also other work we've done in our lab, hyper networks and meta RO. So
[39:22.080 - 39:26.000] kind of what is held constant and what changes between these, usually the action space is
[39:26.000 - 39:29.640] held constant, the safe space is held constant. And then the reward function and the transition
[39:29.640 - 39:33.640] function depend on this latent variable. But you can also try and relax the action space
[39:33.640 - 39:38.880] assumption as well. How practical is this stuff? Like, where is meta RL today? I mean,
[39:38.880 - 39:45.040] you mentioned some application areas, but for let's say a practitioner, an RL practitioner,
[39:45.040 - 39:49.160] is meta RL something do you really need to understand to do this stuff? Well, or is it
[39:49.160 - 39:54.240] kind of exotic still and more forward looking research type thing?
[39:54.240 - 40:01.880] It's definitely more on the forward looking edge of deep RL research, I would say, like
[40:01.880 - 40:10.040] it's the whole idea that you need to learn these adaptive agents and cut the computational
[40:10.040 - 40:15.800] cost at test time using using that, like it is very appealing. And it is actually like
[40:15.800 - 40:23.480] a very it, it has, it's rooted in like a very practical consideration. Like what if your,
[40:23.480 - 40:28.360] your robot is deployed in a slightly different environment, you would still want it to like
[40:28.360 - 40:37.320] be able to handle that well. But in practice, I think, mostly this is still a little bit
[40:37.320 - 40:43.760] speculative, like, and then there's also the aspect that the meta RL algorithms, to some
[40:43.760 - 40:49.520] extent, if you're doing if you're dealing with some like new environments where you
[40:49.520 - 40:56.720] need to adapt to get a good policy. Oftentimes, what you end up doing is just taking a policy
[40:56.720 - 41:03.080] that has memory. So like, let's say an RNN, and then it just like so it can like, it can
[41:03.080 - 41:09.960] do if it doesn't observe the full state of the environment, it can like retain its observations
[41:09.960 - 41:15.360] in the memory, and then like figure out the details of the environment as it does, as
[41:15.360 - 41:20.040] it goes. And that's essentially R squared, like that's like, that's the essence of what
[41:20.040 - 41:26.900] R squared does, whether, whether you want to like call that meta RL in each instance,
[41:26.900 - 41:30.880] like maybe not. And do you really need to know everything about meta RL to actually
[41:30.880 - 41:36.640] do that? Again, maybe not. But that's kind of like, in this kind of sense, the ideas
[41:36.640 - 41:44.840] are still fairly pragmatic. And actually, like you can you can often find that the algorithm
[41:44.840 - 41:50.560] ends up behaving in a way that's essentially like adaptive learned adaptive behavior, which
[41:50.560 - 41:55.080] is what the meta RL agents would do. Yeah, I guess to add on, what Risto said, I think
[41:55.080 - 41:59.780] the practicality also depends on which of these kind of clusters you're in that we discussed.
[41:59.780 - 42:05.000] So in the few shot setting, I think it's whether or not you call it meta RL, like you're trying
[42:05.000 - 42:09.720] to do sim to real transfer over distribution of tasks, which generally is meta RL, an extremely
[42:09.720 - 42:15.440] practical tool, right? It's very cleanly and directly addressing the sample inefficiency
[42:15.440 - 42:22.440] of reinforcement learning and shifting the entire burden to simulation and meta training.
[42:22.440 - 42:28.440] In the long horizon setting, I'm not so sure there's a practical use at the moment to the
[42:28.440 - 42:33.000] multitask long horizon setting. But the single task long horizon setting seems to have some
[42:33.000 - 42:37.560] practical uses, like, you know, hyper parameter tuning, it's a particular way to do auto RL,
[42:37.560 - 42:41.480] right? Where instead of just using a manually designed algorithm, you're doing it end to
[42:41.480 - 42:45.420] end on the outer loop objective function. And so from that perspective, if you're trying
[42:45.420 - 42:49.280] to tune some hyper parameters of an RL algorithm, it's pretty practical if you're trying to
[42:49.280 - 42:56.880] run any RL algorithm. It also is kind of as research said, this emergent thing, that a
[42:56.880 - 43:00.560] lot of systems, a lot of generally capable systems will just have in it, whether or not
[43:00.560 - 43:06.800] you're trying to do meta RL. A lot of systems like you know, large language models have
[43:06.800 - 43:12.160] these emergent in context learning that occurs, even if that wasn't directly trained for.
[43:12.160 - 43:16.320] So it's practice in some ways very practical and other ways it's very not practical, but
[43:16.320 - 43:20.000] will arise regardless of what we try and do. Do you guys have I know you've already mentioned
[43:20.000 - 43:24.880] a couple, but are there any other specific examples of meta RL algorithms that you're
[43:24.880 - 43:32.040] specifically excited about? Or are your favorites? We talked about dream and very bad.
[43:32.040 - 43:40.280] Yeah, those are those are definitely really, like, thought provoking. And also dream, dream
[43:40.280 - 43:46.240] is like they use dream in that the code grading thing. So like, turns out it's practical as
[43:46.240 - 43:52.240] well. One algorithm that for me personally has been especially thought provoking and
[43:52.240 - 44:00.480] kind of impacted my own interest slot is the learned policy gradient, the junior paper
[44:00.480 - 44:06.720] that I hinted at earlier, where they learn the objective function completely in the inner
[44:06.720 - 44:11.760] loop and show that like this is one of the few papers in meta RL that shows this like
[44:11.760 - 44:18.680] sort of quite impressive form of transfer, where you train the inner loop on like tasks
[44:18.680 - 44:23.760] that don't look anything like the tasks that you see at the test time. So in their particular
[44:23.760 - 44:30.360] case, it's like great worlds to Atari. And I find that that's like, sort of thought provoking,
[44:30.360 - 44:35.880] even if the algorithm ends up not being super practical, but like the the idea that yeah,
[44:35.880 - 44:42.960] really like meta learned system can transfer this way. And I think that's an exciting capability
[44:42.960 - 44:51.120] that would be would be fun to see appear even more in in meta RL, and elsewhere as well,
[44:51.120 - 44:52.120] of course.
[44:52.120 - 44:56.760] Yeah, I think that's a great example of a paper for the many shot setting. And kind
[44:56.760 - 45:01.560] of in the few shot setting, I, as I mentioned, I'm pretty fond of this import idea that I
[45:01.560 - 45:07.000] think has been pretty useful in robotics as well, where you try and simultaneously learn
[45:07.000 - 45:10.360] what the task representation and how to infer the task at the same time.
[45:10.360 - 45:17.040] So in regular deep RL, we've seen an explosion algorithms. And but recently, we've seen the
[45:17.040 - 45:23.840] dreamer and recently dreamer version three from danger halfner at all, that beats a lot
[45:23.840 - 45:30.480] of algorithms without tuning. And that suggests there's some maybe some pruning or conversions
[45:30.480 - 45:36.680] of the state of the art family tree of our algorithms is maybe in order. I mean, maybe,
[45:36.680 - 45:42.560] maybe there's some things we don't have to worry about as much or we don't have to. Maybe
[45:42.560 - 45:47.160] we can mentally trim the tree of algorithms we have to keep track of, because dreamer
[45:47.160 - 45:54.200] is kind of covering a lot of space. Do you see, do you see any similar thing being possible
[45:54.200 - 46:01.600] in meta RL in terms of algorithms being discovered that covers a lot of space? Or is is meta RL
[46:01.600 - 46:05.880] somehow different? I mean, sounds like one of the main things I've gotten from this discussion
[46:05.880 - 46:10.440] is there's just so damn many combinations of things. So many different variations and
[46:10.440 - 46:17.560] settings that that it's it's is it is it is it is it is it different in that way? And
[46:17.560 - 46:21.880] that we should not expect to find some unifying algorithm? Or do you think that may be possible?
[46:21.880 - 46:26.680] I guess dreamer v3 already will solve a huge chunk of meta RL problems. That's already
[46:26.680 - 46:33.760] a good start. But but I think that really there are different problem settings with
[46:33.760 - 46:37.920] pretty unique demands in the meta learning setting, right? So if you have a very narrow
[46:37.920 - 46:42.280] distribution of tasks, and you don't have to do much extrapolation to test time, it's
[46:42.280 - 46:48.240] kind of hard to beat a pretty simple task inference method. And on the flip side, if
[46:48.240 - 46:52.440] you need a huge amount of generalization, I'm not sure like dreamer is going to do any
[46:52.440 - 46:57.080] better than actually building in some policy gradient into the inner loop to guarantee
[46:57.080 - 47:02.520] that you have some generalization. So I think, in that sense, it is kind of hard to say there's
[47:02.520 - 47:06.240] going to be one algorithm for all of meta learning, because of the different demands
[47:06.240 - 47:08.600] of each of the different problem settings discussed.
[47:08.600 - 47:17.560] Yeah, but I just want to stack on that a little bit in that the sort of black box style, like
[47:17.560 - 47:25.920] R squared inspired that sort of very general pattern of algorithm seems quite powerful
[47:25.920 - 47:31.160] and and has recently been demonstrated to do well in like, quite complicated task distribution
[47:31.160 - 47:38.000] as well. So it's like, there's definitely some convergence there. But maybe like a big
[47:38.000 - 47:43.640] reason why we see so many different kinds of algorithms in meta RL is that it's also
[47:43.640 - 47:49.520] just like, it's also just about learning about the problem and the different features there's
[47:49.520 - 47:56.400] like, you want to like, you're trying to understand more and like uncover like the critical bits
[47:56.400 - 48:01.320] of like, what are we what is the what is the challenge where and like, how can how can
[48:01.320 - 48:08.360] we so I feel like the the multi multi many, many algorithms that we see, like some of
[48:08.360 - 48:12.720] them are just like kind of trying to answer a smaller question as well than than just
[48:12.720 - 48:18.120] like, you know, whether this is a really a state of the art contender for meta RL. So
[48:18.120 - 48:22.760] like, naturally, some of them will kind of fall by the wayside as we go on.
[48:22.760 - 48:29.800] That makes sense. That's all part of the research process, right? So in deep RL, we've seen
[48:29.800 - 48:36.760] that pretty minor changes into an MDP. We have to consider as a different task, and
[48:36.760 - 48:42.680] trained agents might no longer perform well with with the with a slightly different MDP.
[48:42.680 - 48:47.120] For example, a robot may be having slightly longer or slightly shorter legs or playing
[48:47.120 - 48:55.880] with a blue ball instead of a red ball. And my senses for humans, we we can generalize
[48:55.880 - 49:00.480] quite well naturally. And so we might not really call that a different task. Like basically
[49:00.480 - 49:08.120] we might not chop up tasks so finely in such a such a fine way. And, and I, I, I always
[49:08.120 - 49:13.720] think of that as, that's just a property of our current generation of function approximators.
[49:13.720 - 49:18.640] Deep neural networks are very finicky and they generalize a little bit, but they don't
[49:18.640 - 49:23.440] really extrapolate. They mostly interpolate as the way I understand it. So do you think
[49:23.440 - 49:28.840] that the the facts that our current function approximators have limited generalization
[49:28.840 - 49:35.320] forces us to look more towards meta RL? And if we were to somehow improve, come up with
[49:35.320 - 49:39.320] improved function approximators that could maybe generalize a bit better than we wouldn't
[49:39.320 - 49:44.040] need as much meta RL. Do you think there's any truth to that or, or, or no?
[49:44.040 - 49:48.960] So I think this seems like a distinction kind of between the, whether we were talking about
[49:48.960 - 49:55.440] the meta RL problem setting or the algorithms for meta RL. Like if you think of the task
[49:55.440 - 50:03.920] distribution it's, it's, it's just, you know, a complicated world where your agent has to
[50:03.920 - 50:10.000] like, it can't know zero shots, the expected behavior. So it has to like go and explore
[50:10.000 - 50:16.200] somehow the environment and then do the best it can with the information it has gathered.
[50:16.200 - 50:20.120] And I feel like, you know, that idea is, is not going to go away. Like that's sort of
[50:20.120 - 50:27.280] how a lot of the real world works as well. So like, in some sense, thinking about that,
[50:27.280 - 50:32.720] that problem setting, it seems very relevant going, going forward, whether or not we're
[50:32.720 - 50:36.520] going to use like these specific methods we came up with. Like that's, that's more of
[50:36.520 - 50:41.320] an open question. And, and like, I guess there's some hints that in many cases we can get away
[50:41.320 - 50:44.080] with like fairly simple ideas there.
[50:44.080 - 50:47.900] But I don't think it's going to be like, we've come up with some new architecture and magically
[50:47.900 - 50:52.120] we don't need to train to generalize anymore. Like, I think that you're still going to have
[50:52.120 - 50:57.840] to train if you want your, whether, you know, the universal function approximator to generalize,
[50:57.840 - 51:00.760] I think you're going to have to train over a distribution of tasks intentionally to try
[51:00.760 - 51:06.320] and get that generalization. Whether the task distribution is explicit or implicit, like
[51:06.320 - 51:12.320] in more like large language models, I think it doesn't necessarily matter. But I think
[51:12.320 - 51:18.080] that trying to, you know, expecting some machine learning model to generalize without being
[51:18.080 - 51:22.040] explicitly trained to generalize is kind of asking more than is feasible.
[51:22.040 - 51:27.440] All right, we're going to jump to some submitted questions now. These are exactly three questions
[51:27.440 - 51:33.120] from Zohar Rimon, a researcher at Technion. Thank you so much for Zohar for the questions.
[51:33.120 - 51:37.200] And the first one is, what do you think are the barriers we'll need to tackle to make
[51:37.200 - 51:42.340] meta RL work on realistic, high dimensional task distributions?
[51:42.340 - 51:49.160] Great question, Zohar. So yeah, I think really kind of the, it's sort of in the question,
[51:49.160 - 51:56.280] the answer as well, that I believe that like the barrier keeping us from generalizing to
[51:56.280 - 52:03.120] these more complex task distributions is really that we don't quite have like a good training
[52:03.120 - 52:07.880] task distribution, where we would train the meta RL agent that would then generalize to
[52:07.880 - 52:13.400] the other tasks. So there's been efforts in this direction, right? So there was a meta
[52:13.400 - 52:19.840] world, which proposed like a fairly complicated robotics benchmark with a number of tasks
[52:19.840 - 52:25.000] and a lot of parametric variation within each of them, but still like, not quite there,
[52:25.000 - 52:31.400] like I guess my intuitive answer is that like, it doesn't have enough of the categories of
[52:31.400 - 52:38.520] tasks. Then there's also alchemy, which also didn't catch on, I don't quite remember why,
[52:38.520 - 52:43.760] but that was, that was also like trying to post this like complicated task distribution
[52:43.760 - 52:50.800] and see if we can study meta RL there. And now DeepMind has their xland, I think it's
[52:50.800 - 52:57.680] called, which seems really cool and has like a lot of variety between those tasks. But
[52:57.680 - 53:01.800] I guess the drawback there is that it's closed. So nobody else gets to play around with it
[53:01.800 - 53:08.040] and like evaluate whether, whether that is like solving, whether you know, you get reasonable
[53:08.040 - 53:13.320] generalization from those tasks. So I would say that, you know, we need better training
[53:13.320 - 53:15.280] task distributions for this.
[53:15.280 - 53:20.880] Okay. And then he asks, some meta RL methods directly approximate the belief, like a very
[53:20.880 - 53:26.240] bad Pearl and some don't like RL squared. Are there clear benefits for each approach?
[53:26.240 - 53:29.920] I think you guys touched on some of this. Is there anything you want to add to that?
[53:29.920 - 53:34.240] Yeah, I guess if you can directly quantify the uncertainty, it's pretty easy in a lot
[53:34.240 - 53:39.160] of cases to learn an optimal exploration policy or at least easier. However, if you're doing
[53:39.160 - 53:42.560] task inference based methods, you might just represent, if you're trying to infer the MDP,
[53:42.560 - 53:46.440] there might be irrelevant information in the MDP that you don't need to learn for the optimal
[53:46.440 - 53:51.720] control policy. So you might just waste your time learning things you don't need to learn.
[53:51.720 - 53:56.160] And then Zohar asks, would love to hear your thoughts about DeepMind's Ada. That's the
[53:56.160 - 54:00.360] adaptive agent. Do you think it will mark a turning point for meta RL? And again, you
[54:00.360 - 54:05.760] Risto, you just mentioned Xland. I think Ada is, is based on Xland. Is there anything more
[54:05.760 - 54:06.760] to add there?
[54:06.760 - 54:11.200] Yeah, yeah. I mean, it's, it's really exciting work, actually. Like, I think it's a really
[54:11.200 - 54:16.280] strong demonstration of the kinds of things that you can get a big black box meta learner
[54:16.280 - 54:21.980] to do. So you take a big, big pool of tasks, and then you train this, like, big memory
[54:21.980 - 54:30.040] network policy on it, and it can really generalize in quite impressive ways. But I mean, like,
[54:30.040 - 54:34.080] turning points, I don't know, like, you know, I think there's always been a contingent of
[54:34.080 - 54:41.640] meta RL researchers who would have said that, you know, a big recurrent neural network and
[54:41.640 - 54:47.160] a complicated task distribution is kind of all we need. So RL squared, for example, kind
[54:47.160 - 54:53.120] of already starts from that idea. And now it feels a little bit like Ada is like, figuring
[54:53.120 - 55:00.680] out, like, what do you actually need to make that idea really work and scale it up. So
[55:00.680 - 55:04.960] I think it remains to be seen whether that's a turning point, it to me, it feels like it's
[55:04.960 - 55:14.640] on the continuum, to a large extent, but it is it is, I guess, it is at least a, like,
[55:14.640 - 55:19.600] very bright spot in the in the spectrum right now.
[55:19.600 - 55:24.720] Yeah, I'm not sure it like, proposed anything novel that we hadn't seen before, where I
[55:24.720 - 55:28.640] think, you know, it was a huge distribution of tasks, it talked a lot about using attention
[55:28.640 - 55:32.800] or transformers, or at least some sort of attention in your memory. I think they did
[55:32.800 - 55:37.720] some sort of curriculum design. And I think they did. It's like student teacher distillation
[55:37.720 - 55:41.580] thing. There were a lot of kind of a hodgepodge of ideas, and I'm not sure it really added
[55:41.580 - 55:45.920] too much novel on the method side. But it was definitely an establishment of, hey, we
[55:45.920 - 55:49.880] can do this really cool, you know, demonstrate that we can do this really cool thing and
[55:49.880 - 55:56.320] get these really cool generalization capabilities out of a generally capable recurrent agent
[55:56.320 - 56:00.520] over a complex task distribution, as Risto said, so maybe more synthesis of existing
[56:00.520 - 56:05.240] ideas than some very new concepts. Yeah, that sounds about right. So when I was preparing
[56:05.240 - 56:10.680] for this episode, I was looking back at RL squared, and, and, you know, Ilya Sutskever
[56:10.680 - 56:18.640] was giving a talk about this. And at the time, it was it was open AI universe, which, which
[56:18.640 - 56:24.880] was like the Atari learning environment, but with way more games. And that was something
[56:24.880 - 56:29.000] that kind of just fell by the wayside back in the day, I guess it didn't either we weren't
[56:29.000 - 56:33.880] ready for it, or it didn't the meta learning didn't really happen. Do you guys have any
[56:33.880 - 56:38.440] comments about opening a universe or what what happened back then? I guess was it was
[56:38.440 - 56:42.480] RL just not powerful enough for for such a task distribution? Yeah, that's a that's a
[56:42.480 - 56:49.080] great question. I remember universe. I don't actually know what was the specific issue
[56:49.080 - 56:53.160] they ran into with that. But I think it's just generally, I think what we're finding
[56:53.160 - 56:59.120] here is that really like designing designing a task distribution in which you can train
[56:59.120 - 57:03.520] these like, more capable agents is really is a really complicated problem. Like there's
[57:03.520 - 57:09.080] been multiple really high profile efforts in this direction. And somehow still, we're
[57:09.080 - 57:15.640] like, not really there, I feel like or I mean, maybe excellent 2.0 is that but we don't we
[57:15.640 - 57:21.040] don't get to play with it. So I don't know. But like, yeah, I think it just a testament
[57:21.040 - 57:27.600] to the complexity of that particular issue, the problem there that it's just, it's hard
[57:27.600 - 57:33.880] to come up with really good task distributions for meta RL. So this was a very long, very
[57:33.880 - 57:42.080] detailed paper, 17 pages of references, actually more than 17. It was absolutely mind bending,
[57:42.080 - 57:47.680] honestly reading this and trying to keep track of all these these ideas. I'm sure we've
[57:47.680 - 57:52.640] just scratched the surface of it today. But can you can you tell us a bit about the experience
[57:52.640 - 57:57.240] of writing this paper? I think you mentioned a little bit of in the beginning about how
[57:57.240 - 58:01.200] some of your ideas changed as you went through it. But can you talk about the experience
[58:01.200 - 58:04.520] of writing? What's it like writing a survey paper? I can't imagine how much reading you
[58:04.520 - 58:10.160] had to do. Yeah, I think we alluded to before, but we kind of had a couple of false starts.
[58:10.160 - 58:13.120] We didn't really know what we were doing, right? This was a lot of trial and error on
[58:13.120 - 58:20.240] our part since the very beginning. And we kind of sat down and like, methodologically
[58:20.240 - 58:26.320] proposed different ways in which meta RL algorithms could differ. Really, you know, okay, how
[58:26.320 - 58:29.160] can the inner loop be different? How can the outer loop be different? How can the policy
[58:29.160 - 58:32.760] we're adapting be different? And it turned out that just wasn't at all how the literature
[58:32.760 - 58:37.840] was organized and didn't reflect anything out there in the world. So we had to completely
[58:37.840 - 58:44.840] redesign our framework, which was a big effort. And then after redesigning the framework,
[58:44.840 - 58:48.440] actually keeping track and organizing people on a project this large was something I'd
[58:48.440 - 58:53.680] never done before. And, you know, I think we had to come up with processes just for
[58:53.680 - 58:58.800] that. And that was pretty difficult. So Risto has like multiple spreadsheets where we keep
[58:58.800 - 59:04.920] track of who's assigned to what conference and what paper has been read by who. And I
[59:04.920 - 59:10.240] think that was a pretty useful tool in and of itself. Yeah, definitely. Like, it turned
[59:10.240 - 59:17.240] into a project management exercise, to a large extent, as much as as much as it was about
[59:17.240 - 59:24.040] writing, it was just like, you know, managing the complexity. So in the future, do you think
[59:24.040 - 59:30.400] we will all be using meta RL algorithms only, like, or algorithms designed by meta RL, maybe
[59:30.400 - 59:36.080] I should say, or it's like right now, they're generally all hand designed, as you mentioned
[59:36.080 - 59:41.400] in the paper hand engineered. Do you think this is this is just an early phase, pre industrial
[59:41.400 - 59:46.560] revolution type thing? Well, I wouldn't be surprised, I guess, if every algorithm you
[59:46.560 - 59:51.480] said some automatically tuned component, whether that is directly using meta RL or some other
[59:51.480 - 59:56.840] form of RL. And I would also be surprised if it turned out that the long horizon multitask
[59:56.840 - 01:00:00.300] setting, one of giving us something that could beat state of the art methods were hand designing
[01:00:00.300 - 01:00:06.040] you know, they're smart as engineers ourselves. But that said, like, I think emergent meta
[01:00:06.040 - 01:00:10.000] learning, whether, you know, meta learning, whether explicitly designed as part of the
[01:00:10.000 - 01:00:13.360] problem in the few shot setting, or as an emerging capability, like in the LLMs we're
[01:00:13.360 - 01:00:18.760] seeing now, I think that's going to be in a lot of products from now into the far future.
[01:00:18.760 - 01:00:24.160] Any comment on that one, Russo? Yeah, yeah, I feel I kind of feel the same. Like, it's,
[01:00:24.160 - 01:00:30.280] there's definitely a lot of people who believe that the you can do better, like, you know,
[01:00:30.280 - 01:00:34.560] learned optimizers, and those kinds of things are very relevant here. And I think a lot
[01:00:34.560 - 01:00:41.160] of people are looking into how to actually make that make make those things work.
[01:00:41.160 - 01:00:47.520] That said, like, I don't think we have anything like that deployed. So maybe we're missing
[01:00:47.520 - 01:00:53.760] like some bigger piece of that puzzle, like, how do we actually like get, you know, through
[01:00:53.760 - 01:00:57.900] the uncanny valley and to optimize learned optimizers and learned RL algorithms that
[01:00:57.900 - 01:01:02.760] are actually better than human design ones. So there's there's there's work to do there.
[01:01:02.760 - 01:01:09.240] But I don't really doubt like, I think it's it's an exciting problem to work on.
[01:01:09.240 - 01:01:13.840] This might be a bit of a tangent. But even if you know, like, LPG didn't create state
[01:01:13.840 - 01:01:19.560] of the art, you know, it was a state of the art circle, like 2016, or 2013, or something,
[01:01:19.560 - 01:01:24.080] using algorithms in the outer loop of the art now, even if the inner loop wasn't better
[01:01:24.080 - 01:01:28.960] than anything we have lying around, it might be the case that for particular types of problems,
[01:01:28.960 - 01:01:33.520] like if we're trying to meta learn offline interloop, which is a pretty difficult thing
[01:01:33.520 - 01:01:40.640] to manually hand engineer, or we're trying to meta learn an interloop that can deal with
[01:01:40.640 - 01:01:45.640] non stationarity. So for instance, for continual learning, it might be the case that those
[01:01:45.640 - 01:01:48.960] the meta learning and the outer loop can produce better learning algorithms there than humans
[01:01:48.960 - 01:01:52.800] can hand engineer. I think that's kind of yet to be seen.
[01:01:52.800 - 01:01:56.000] Is there anything else I should have asked you YouTube today?
[01:01:56.000 - 01:02:03.160] Is chat GPT three conscious? I'm kidding. Yeah, I'm kidding. No, I think I think we
[01:02:03.160 - 01:02:10.640] covered this pretty extensively lately. Okay, well, well, since we're going there, let's
[01:02:10.640 - 01:02:14.600] just take a moment and say because I've started to ask people for this. What do you guys think
[01:02:14.600 - 01:02:19.800] of AGI? And this is meta are all going to be a key step to getting to AGI?
[01:02:19.800 - 01:02:30.920] Oh, this is risky. I mean, like, I mean, like, carefully now, Risto, but yeah, like, I think,
[01:02:30.920 - 01:02:36.960] if you want to do, if you want to train agents that can tackle like problems in the real
[01:02:36.960 - 01:02:42.440] world, they're going to require some level of adaptive behavior, most likely, like it's,
[01:02:42.440 - 01:02:46.980] it seems a little, well, I mean, maybe you can get around that by doing a really careful
[01:02:46.980 - 01:02:51.320] design of the agent itself and that kind of thing. But like, probably it's better if you
[01:02:51.320 - 01:02:56.880] can adapt to the environment. So in that sense, like this idea of learning to adapt, learning
[01:02:56.880 - 01:03:04.640] agents that can like take cues from the environment and, and act upon that is, is really central
[01:03:04.640 - 01:03:14.520] to like, just deploying stuff in the real world. So like, again, the meta RL emergent
[01:03:14.520 - 01:03:20.560] meta learning seems seems important. And and on the other hand, we kind of see like these
[01:03:20.560 - 01:03:27.440] kinds of meta learning behaviors come out of things like JET GPT, like it can do in
[01:03:27.440 - 01:03:31.240] context learning and stuff like that, even though it hasn't been actually explicitly
[01:03:31.240 - 01:03:40.680] trained on on like a particularly an explicit meta learning objective. So I would say that
[01:03:40.680 - 01:03:46.760] we're definitely going to going to see at least emergent meta reinforcement learning
[01:03:46.760 - 01:03:51.200] in the generally capable agents we're going to be looking at in the future.
[01:03:51.200 - 01:03:56.040] Yeah, I agree with what Risa said, and I should also tread carefully here. And to be clear,
[01:03:56.040 - 01:04:03.880] I was joking about the, you know, consciousness of chat GPT three, lest I be misquoted. But
[01:04:03.880 - 01:04:08.440] I do think that fast learning is kind of one of the major hallmarks of intelligence. And
[01:04:08.440 - 01:04:12.880] so regardless of whether we design that manually, or it's an emerging property of our systems,
[01:04:12.880 - 01:04:18.000] fast adaptation, fast learning will be a product in the generally capable systems going forward.
[01:04:18.000 - 01:04:23.560] So there has been a meme that's been popping up every once in a while more recently, is
[01:04:23.560 - 01:04:29.680] that do we even need RL, like Ian LeCun had, you know, slide not long ago saying, basically,
[01:04:29.680 - 01:04:37.720] let's try to either minimize RL or, or use more like shooting methods and learn better
[01:04:37.720 - 01:04:42.960] models and just not use, not need RL. And we had Arvind Srinivas from OpenAI saying
[01:04:42.960 - 01:04:48.800] that he's kind of what maybe what you were saying, Jake, is that emergent RL was just
[01:04:48.800 - 01:04:54.480] happening inside the transformer in, say, decision transformer, which isn't really even
[01:04:54.480 - 01:04:59.280] doing RL, it's just supervised learning. So do you guys think that RL is always going
[01:04:59.280 - 01:05:04.960] to be an explicit thing that we do? Or is it just going to be vacuumed up by these,
[01:05:04.960 - 01:05:10.840] you know, another emergent, emergent property of these, these bigger models, and we don't
[01:05:10.840 - 01:05:12.120] have to worry about RL anymore?
[01:05:12.120 - 01:05:16.680] Well, I guess to start, like, not to split hairs on definitions of terms, but I guess
[01:05:16.680 - 01:05:22.040] kind of split hairs on definitions of terms. Like some people would say, when Jan says
[01:05:22.040 - 01:05:25.640] that RL isn't necessary, we should just do like, world models, a lot of people would
[01:05:25.640 - 01:05:31.480] call that model based RL. I think, if I'm not misremembering, I think Michael Littman,
[01:05:31.480 - 01:05:35.760] our advisor, would, you know, definitely fall into that camp. So I think we could you could
[01:05:35.760 - 01:05:43.240] make a lot of people less upset by not saying that that's not RL. Plus, maybe comment number
[01:05:43.240 - 01:05:48.400] one. And comment number two is we really shouldn't do RL at all. If we can avoid it, it's a pretty
[01:05:48.400 - 01:05:53.480] weak form of supervision. And so if we can, you know, we had a small section on supervision,
[01:05:53.480 - 01:05:58.680] if we can at all avoid RL on the outer loop, that's better. And we can still clearly wind
[01:05:58.680 - 01:06:01.040] up with reinforcement learning algorithms in the inner loop.
[01:06:01.040 - 01:06:08.440] Yeah, I'm on the same same lines here as Jake, for sure that like, if you can get get away
[01:06:08.440 - 01:06:13.840] without using RL, go do it, like, it's probably going to be better. But like, it's it's hard
[01:06:13.840 - 01:06:21.240] to hard to imagine, like, at least to me, it's not clear how you would solve. Well,
[01:06:21.240 - 01:06:26.880] I don't know what, what's a concise description of a problem where surely you need RL. But
[01:06:26.880 - 01:06:31.440] there's problems where I have a hard time imagining that you can get around that without
[01:06:31.440 - 01:06:34.080] something like RL being actually deployed there.
[01:06:34.080 - 01:06:44.400] Well, chat, he uses it as it's called. I'm invested in it, right? Yeah, exactly. So,
[01:06:44.400 - 01:06:49.840] so I gotta thank you both. This has been fantastic. Jacob Beck and Risto Vuorio. Thanks so much
[01:06:49.840 - 01:06:53.360] for sharing your time and your insight with the talker audience today.
[01:06:53.360 - 01:07:00.720] Awesome. Thanks so much, Robin. Yeah, thank you.

Creators and Guests

Robin Ranjit Singh Chauhan
Host
Robin Ranjit Singh Chauhan
🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳
Jacob Beck and Risto Vuorio
Broadcast by