Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests,
both research and applied. Join the conversation on Twitter at Talk RL podcast. I'm your host,
I'm super excited to have our guest today, Danijar Hafner, a PhD candidate at the University
of Toronto with Jimmy Ba. He's a visiting student at UC Berkeley with Peter Abbeel and
an intern at DeepMind. And of course, Danijar was our guest before back in episode 11. Welcome
Thanks for having me, Robin. Yeah, our last interview together was honestly one of the
favorite interviews of mine that I've ever done. It was an audience favorite and I learned
a ton, so I'm super excited about today. You have done a lot of incredible work since the
last time we spoke. And let's jump right into it. We're going to start with Dreamer version
three. That's mastering diverse domains through world models. That is yourself as first author
at all. And so this is version three of your Dreamer series. We talked about version one
back in episode 11. But can you briefly describe, remind us, what is the idea with the Dreamer
So the idea is typically reinforcement learning would just try out a lot of different action
sequences in the environment. And over time through trial and error, it would figure out
which of them are better and more likely to lead to the goal than others. But that just
requires a lot of interaction with the environment and it's not really feasible or makes it really
hard for tasks where it's hard to get data. So for example, real robots. They run pretty
slowly. You can't just speed them up like a simulator. And so the more sample efficient,
the more data efficient you can be, the better. And the Dreamer line of work addresses that
problem with a pretty intuitive approach where instead of just running a ton of trial and
error in the real environment, you instead use the data that you get from interacting
with the environment to learn a model of the world. And then you can use that model to
run a bunch of trial and error in imagination without having to actually interact with the
And so the idea of model-based reinforcement learning is old. And it's just been pretty
challenging to get it to work in practice, especially for complex environments that have
high dimensional inputs like images and complex tasks. And so Dreamer learns the world model
from images and actually from any input you give it pretty much. So yeah, that's part
of what we got to in Dreamer v3. The goal really being an algorithm that is very data
efficient and you can just throw it in any problem out of the box. The code is open source
and hopefully it'll make reinforcement learning a lot easier to use for people out there.
And you definitely succeeded in that. I see from the paper, exceeds Impala performance
while using 130 times fewer environment steps like that is a really big difference. Can
you walk us through just on a high level, the progression from version one, two and
then now to three. And were these different versions about like refining your original
vision or were your goals of evolving as you want?
The line of work actually started with planet, the deep planning network. And then there
was Dreamer one, two and three. And the vision hasn't changed along the way, just the algorithmic
details to make it work better and better. And I think it's true with so much in AI
that the general idea has been out there for a long time. If you think about large language
models, for example, right, sequence to sequence has been a big research topic for a long time.
And then it just incrementally gets better over time. Sometimes there's a bigger jump
like transformer architectures and so on. And now with models like GPT-4, we've gotten
to really good, really good capabilities. But I think it's easy for people to forget
how incremental and gradual the progress is and over how many years it spans out. And
it's been similar with model based reinforcement learning, or just data efficient reinforcement
learning in general. So planet was the first version where we train these world models
on high dimensional inputs on videos. And to really make that feasible, you need two
things. So you need an environment model. Well, first of all, at a high level, what
do I mean by world model? I mean, something that gives the agent a rich perception of
the world, some representations that summarize its inputs more compactly, and then allow
you to predict forward. So by predicting forward, you can then do planning with it, right? It
could be a tree search like an AlphaGo, or it could be a model predictive control that's
common in robotics, you could use it to do decision time planning as you interact with
the environment, you can also do it, use it to do offline planning, where you just imagine
a lot of scenarios, a lot of sequences with your world model, and train a policy on that.
And then when you interact with the environment, you just you just sample from the policy.
So what do you need for this to work? Why has this been challenging in the past? Well,
you need a model, first of all, that's accurate enough to actually get successful planning
with it, right? If it doesn't approximate the environment well, then you're screwed.
If you're just learning your behaviors from from predictions of the model. And then the
second point is that it has to be computationally efficient. And that's another theme that reaches
throughout AI, we really care about maximizing the compute efficiency of our algorithms.
So you know, if you can scale the algorithm up, if it's more compute efficient, it means
you can have like if an algorithm is twice as compute efficient as another one, you can
just run twice the size of the model, or train on twice as much data, and you'll get way
better performance by comparison. In planet, we figured out how to learn this world model.
And we do that by encoding all your sensory inputs at each time step into a compact representation.
There's a stochastic bottleneck there as well. And then learning to predict the sequence
of these compact representations with a GRU conditioned on the sequence of actions. And
then you train that by reconstructing the images at each time step, and also by predicting
the rewards. And the rewards are important because later then you can do planning just
in imagine like just in this learned representational space without having to reconstruct images.
Now the model hasn't actually actually changed very much, at least at a high level, since
since planet. And I tried a lot of different designs. And I keep coming back to this because
it works really well empirically. And it's pretty simple. This is the RSSM component.
This is the RSSM. Yeah. And so yeah, recurrent state space model is what we called it. And
what has changed from planet to the first dreamer, what has changed totally is how do
we learn or how do we derive behaviors from this model? So in planet, we did online planning
at each time step as you interact with the environment. And that's just very computationally
expensive because you have to do a bunch of rollouts with your model to find good action.
And then you can't really reuse that computation much for the next time step, because then
you'd be like too narrowed in on your previous plan that didn't look far enough into the
future. So you just end up planning from scratch doing thousands of model rollouts at every
time step as you interact with the environment. And that's just it's fine if you're training
on one robot, because it does run faster than real time. But it's not really fine if you
want to compete with other algorithms on simulated environments where collecting data is really
And so in dreamer, we switched from online planning to doing offline rollouts, just starting
from states that come from the replay buffer and using these rollouts to train an actor
critic policy that is fast to sample from. And that also has a value function and therefore
can take rewards into account that are within that are beyond the planning horizon or beyond
the rollout horizon.
Yeah, in dreamer v two. Yeah, and I yeah, in dreamer v two, we
so dreamer v one was focusing on continuous control from pixels. And it was very data
efficient. But but it wasn't that general of an algorithm. And if you want to convince
people to start using that using this algorithm, okay, let me redo that. So dreamer v one,
for dreamer v one, we focused on continuous control from pixels. And it was very data
efficient and got high final performance, but also pretty narrow in the sense that we
couldn't deal with discrete actions very well. And we weren't competitive on standard benchmarks
like Atari. And so that was the focus of the dreamer v two paper, really matching or exceeding
the performance on a very popular benchmark that people have tried to improve results
on for a long time. And so we did that by switching to discrete discrete representations
in the algorithm that just match the discrete nature better of these games where you have
discrete actions. And we also improved the objective function in some ways. And then
dreamer v three is the next natural step where people were starting to get pretty interested
in using dreamer for all kinds of problems where especially where data efficiency matters.
And especially where you have high dimensional inputs. And so I ended up helping a lot of
people tuning the algorithm and getting it to work. And that ended up taking a lot of
time for me. And I developed all the intuitions of, you know, in this type of environment,
you need to set your maybe your entropy regularizer for the policy a bit higher because the rewards
are pretty sparse. And so you need to explore more. And then here, the the visual complexity
of the environment is very low. And so maybe in Atari, you need to pay attention to individual
pixels to get really good performance. And so you need to change the world model objective
to get really good at reconstruction and not abstract too much away in the representations.
Whereas in complex 3d environments, you actually want to abstract quite a bit away so that
you can generalize more quickly, or you can generalize better, and forward prediction
becomes easier as well. You don't care about all the like texture details in the image.
My goal for dream every three was to automate all of that away, have an algorithm where
the the objective functions are robust enough that you can just run it out of the box and
it'll give you good performance. So as an engineer, I love your attention to efficiency.
And the efficiency of this design is actually really amazing. And it's the compute efficiency
was also the sample efficiency, like it requires few samples in the environment. Do you think
that we're that you're approaching some kind of limit with with these efficiencies? Or
can you envision a future dreamer that may be two or 20 times even more efficient in
these senses? Where do you think we are in the long term in terms of efficiency?
Yeah, that's a great question. In dream every three, because the algorithm is so robust,
we observed very predictable scaling behavior of the algorithm. And so, obviously, if you
do more gradient steps, or more, if you replay more data from your replay buffer, then you
will be more data efficient. And at some point, you will start overfitting, and your algorithm
will degrade in performance. And that point is much further down, down the road than it
is for a lot of model free algorithms. So in a sense, the world model lets you just trade
off more compute to become more data efficient. And maybe more surprisingly, increasing the
size of the model has a very similar effect, where you will become more data efficient,
and reach higher final performance as well, which matches what we're seeing for large
language models, but hasn't really been demonstrated very much in RL. And it's pretty exciting
to be at a point now where we have very predictable scaling capabilities. And I think even today,
if we want to be 10 times more data efficient, we can actually do that by increasing the
model size and increasing the gradient steps and just waiting longer for the whole thing
So we featured Jacob Beck and Rista Vuario recently, and their survey on meta RL. And
I believe they mentioned Dreamer does a good job as a meta RL agent as well. Does that
surprise you at all? Or if you used it that way, and do you see Dreamer being applied
more in a meta sense?
That's interesting. Do you know what kind of tasks they did with it?
I would have to get back to you on that one.
Yeah. So the model integrates information over time into Markovian states. And there's
actually a reason we're not using transformers in the world model, even though transformers
are everywhere now. And the reason is that Markovian states make it much easier to do
control, to do RL on top of the representations. It's easier to fit a sequence with a transformer
that doesn't have this recurrent bottleneck to squeeze everything through. But by forcing
the model to learn Markovian representations, we're actually offloading a lot of what's
challenging about RL to the unsupervised model learning objective. And so we don't need rewards
to learn which parts are relevant about the state. And so I'm not that surprised that
just because it's a sequence model, or it integrates information over time, if you feed
in rewards as well, it can do some sort of meta learning. But yeah, this is almost just
an immersion property of using sequence models in RL, which we should have been doing for
a long time anyways. And yeah, if there are any specific capabilities where this is model
based approach works much better, other than just being more data efficient than model
free, that'd be pretty interesting to know.
So I noticed you had results in dreamer three for Minecraft and making diamonds in Minecraft,
which is, I understand a very hard exploration problem. Can you can you tell us about that?
It's actually been an open challenge that's been posed by the research community for a
couple of years, there were these competitions at neurobes to find algorithms that can perform
in this complex environment, you know, Minecraft is 3d, every episode is procedurally generated.
So you never see the same thing twice. And the rewards are very sparse. So to get to
the diamond, you have to complete a bunch of tasks along the way, it takes 30,000 time
steps to get there roughly. And I, I think, like, personally, I didn't think it was possible
to do without either human data to guide the agent along the way, like open AI did, or
at least have a very strong intrinsic exploration objective, which, you know, there are some
ideas for that out there, but there isn't any, like, really good thing that will work
out of the box and something like Minecraft yet, I think. So it was part is, in a sense,
it was a pretty long shot. But we had this algorithm dream every three that works quite
well out of the box, right? And you can't really tune hyper parameters that well on real
robots. And it's similar in Minecraft, because it's a pretty complex task, and it will take
quite a while to train, it ended up taking 17 days for our training runs to finish. And
so you don't want to tune hyper parameters and fiddle with the algorithm and that at
those timescales. So we just ran it and waited. And we're like, okay, let's give it two weeks,
see what comes out of it. And then after two weeks, we already had a couple of diamonds,
so we let it run a little bit longer. But yeah, it was a test of the algorithm, in a
sense, where, yeah, it ended up working out great. And we didn't really expect that it
was possible just just with the robust objective functions and the entropy regularized policy
objective that we use in Dreamo v3.
So do you plan Dreamer version four and so on? What kind of issues do you think you might
tackle with future issues if there's, with future versions, if there's future versions?
Is there Dreamer v4? Um, unclear. So I think the biggest issue we're facing in RL now is
to do temporal abstraction really well. Another issue is that we want to leverage pre training
data to squeeze out more sample efficiency and learn tasks much faster, right? At the
end of the day, maybe some people might wonder why does sample efficiency matter so much?
Can't we just get a lot of data and solve everything that way? And I think it's important
to point out, there are two reasons that sample efficiency matter a lot. The first one is
we don't have web scale data sets for how to make decisions for a lot of decision making
problems, right? Like for example, in robotics, even for like language based assistance and
so on, if we want to fine tune them to become goal oriented, we don't really have huge data
sets for that. OpenAI is collecting one, but I doubt they'll share it. And, and so you
have to be data efficient, because there isn't, there's just isn't that much data to supervise
from. And the second part is that we also want our algorithms to adapt quickly and learning
from small amounts of data is the same as adapting quickly. And so that's, to me, that's
the core of intelligence, how can you adapt very quickly? You know, you want to generalize
as far as you can, but then at some point, you'll be out of the distribution of what
you can generalize to, given the data you have, will reach that point even with with
large language models. And and then the tricky question is how do you how do you adapt away
from there? Or how do you adapt to new stuff, right? We want these algorithms to discover
new things for us. And, and so, in terms of open challenges in RL, the biggest to me seem,
seem to be learning abstract events and planning over them to do a very long horizon reasoning
and using pre training data that is available, but it's not that easy to use, like unlabeled
videos without actions that you can get from YouTube. And so those are the two things I'm
focusing on going forward. And I don't know if there'll be another just like, general
dreamer algorithm, because it will have these new things built in. And yeah, it might deserve
a new name at that point. So let's move on to daydreamer. That's where you apply dreamer
to physical robots, if I understand. And I admit, when I first encountered dreamer,
I and plan it, I assume something like, oh, you know, maybe that model is good for learning
in simulated worlds. But you know, with that kind of model really makes sense in the real
world. And I guess you you answered that question here. So so could you tell us about daydreamer?
Daydreamer uses the dream of v3 algorithm, actually a slightly earlier version of that
algorithm. So it was published before the dream of v3 paper came out. And the question
is, can we actually run these, these algorithms on real robots? Do all the sample efficiency
improvements we see in simulated environments transferred to something, something real in
the physical world. And the whole project actually happened in in just a few weeks.
Because really, we just ran the algorithm on the robot, and it worked out of the box.
And that was the ultimate test to see that, you know, you can't tune hyper parameters
on the real robot very well, because every training run takes multiple hours and things
break along the way, you need people to fix stuff all the time. So the focus of dream
of v3 of just running out of the box really paid off when we were running on real robots.
And I would say the tasks that we did there are still fairly simple tasks. So it'd be
very exciting to see how much we can scale this up. But yeah, we trained on visual pick
and place from sparse rewards with an arm that picks up balls and places them into a
different bin. We trained the quadruped the doc robot to just from scratch with with manually
specified reward function that has three components on it, train it to roll over, and then figure
out how to stand up and walk without any simulators in in just one hour. And there were no resets.
I mean, sometimes the robot would like, get too close to the wall and like, start just
trying stuff. So we would pull it back into the middle of the room. But at least we were
making sure to not change the joint configuration of the robot. So if you had more space, then
you wouldn't have that issue. Yeah, and it worked amazingly well. We got to Yeah, it
walked in one hour, and then it's continuously learning in the real world, right? So you can
just start messing with it and see how it adapts. And initially, if you just perturb it a little
bit, it just falls and struggles to get back up. But then in 1015 minutes, it actually learned to
withstand when we tried to push it or roll over very quickly and get back up on its feed.
That's amazing. So it's this is really dreamer out of the box. With no changes? Like did you come
away from this thinking, maybe there's some things I could I could tune to make it more
robot friendly, or really not even? The only thing we had to do was to paralyze the gradient
updates for the neural net with the data collection on the robot. So run those in two different
processes. So that whenever you're doing a gradient step, that doesn't pause the policy.
So the policy can run continuously. And then you sync the parameters over every couple of seconds.
So nothing on the algorithmic side. And yeah, a little bit on the software infrastructure side,
and I think things are starting to become more general on on that front as well. And yeah,
for Dreamer v3, I think we didn't talk about it yet. But yeah, the main result in terms of
capabilities was to solve the Minecraft diamond challenge from sparse rewards. And that also need
a little bit of infrastructure setup and so on. I played Minecraft a little bit, I can say I've
never made a diamond. We did talk to Jeff Clune in the last episode, who did vpt, which was video
pre training method with open AI's approach to learning off of human videos on YouTube humans
playing Minecraft. And I understand there was something like 24,000 actions required to
create a diamond. So this is this is quite a quite exploration channel challenge. Okay,
let's move to director. That's deep hierarchical planning from pixels. And that's yourself at all.
So what is what is happening with director? So I already mentioned briefly earlier, that I think
one of the big challenges is to deal with temporal abstraction in RL now. I mean, it's crazy that our
algorithms are still doing basically all their reasoning. I mean, our RL algorithms are doing
all their reasoning at the timescale of primitive actions. And if you think about it, you know,
humans can set very long term goals. And it doesn't even sound that long term to us if we say I just
want to go to the grocery store and buy some stuff. But if you think about the millions of muscle
commands that have to be executed along the way to get there, then it just seems completely
hopeless to learn that if you're assigning credit and planning in this low level action space only.
So somehow, we as humans have this ability to identify meaningful events from our raw sensory
inputs, high dimensional inputs. And then we can plan over those things like, okay, I have to,
you know, look up where the grocery store is, and I have to go through the door and then open the
door, blah, blah, blah, and then drive over there. And all these high level events, you know, our RL
algorithms aren't really able to identify those things and plan over them through long horizons.
So director is the first step towards that by using work models, but also training a goal
conditioned policy at the low level that learns to go from anywhere to anywhere in state space.
And then you can use a high level policy on top that just directs the low level policy around,
and that's where the name comes from. And so the high level policy chooses goals that are either
exploratory, or helpful for solving the task that achieve high task reward. Whereas the low level
policy only chooses, is only trained to reach the goal. So it doesn't even know about the task. And
that's the ultimate test that this thing is actually working. So I first encountered this idea
in the, there was a feudal networks paper a few years ago from DeepMind. And I guess they also
referenced an earlier feudal reinforcement learning paper from the past. And I wondered,
is there an intuitive reason why splitting the agent into this manager and worker in this way
is helpful for learning? It lets you on a high level plan much further into the future, right?
Because it's only planning over the sequence of goals that change less frequently than the low
level primitive actions. And I think there's a lot more to be done in director, we're changing
the goal every 16 steps. So it's only 16 times further that it can look into the future at the
high level. And so there are multiple benefits to that. One is, if you plan further into the future,
you might just find a better strategy to get to the goal, especially if the reward is sparse,
it might just be out of your visible horizon. Otherwise, even with a value function, it still
has an effective horizon based on the discounting. And then, moreover, credit assignment becomes
easier. So maybe you got to the goal, and it's the sequence of thousands of actions. And now
which of these actions should I make more likely? And which of them should I make less likely?
Well, it's a lot easier to make that decision at the high level and say, okay, here are the five
goals that I chose. And those five goals seem to have been good. Another thing is that you're
offloading a lot of the learning how to get from A to B, the goal condition policy to an unsupervised
objective function, you actually don't need any rewards to learn a good low level policy that can
reach arbitrary goals in your environment. And so you can get away with learning from fewer rewards
if you only need them to train the high level policy. So I saw your clockwork via E, which has
more than two levels of temporal abstraction. And do you see future directors having more levels
of management to expand to higher levels of temporal abstraction? That's a very good question.
So I think there are two perspectives on this. One is, yes, you just want a deep hierarchy,
and you want that hierarchy to be explicit. And then you can do top down planning, right? You
plan very far into the future at the high level, but only take your 10 steps. And then at the level
below, now you plan how to achieve the goals from above. And now you're not planning as far into
the future because the level is less temporary abstract. And so you still predict your 10 times
steps forward. And so effectively, you end up having like a triangle where the highest level looks
the furthest into the future. And then the lowest level is quite reactive. Now, in that sense, you
definitely want more than just two levels. But there's another perspective on it, which maybe
these hierarchies can actually be a bit more implicit than that. Maybe we don't want to fix
the timescales to be, you know, powers of two or something. And maybe we don't want like a fixed
number of 10 levels. But what if we can train dynamics models such that some of the features
just change less often than others. And so implicitly, this becomes a hierarchy now, where
the slow features or the slow dimensions of your representation, because they change less often,
effectively, when you predict what they will change to, that'll give you a much further
prediction into the future. And then there will be some things that just model or represent the
high frequency information in the video that change all the time. And those will effectively be the
lower levels of your hierarchy. And I think it's a big open question how to actually learn that and
use that for abstract planning. But it seems like a pretty compelling idea.
So I understand that director and director goals are states from the environment, or observations.
Is that right? Goals are representations or states of the world model. So they are,
yeah, they are recurrent states of the GRU. And because we have a decoder in the world model,
we can actually look at them in image space and interpret what the thing is doing,
interpret what sub goals it's choosing to decompose some really long horizon task.
So for humans, when I think of a goal, we think abstractly and partially kind of like when I
imagined getting an engineering degree, I didn't really think about the details of the scene when
I obtained my certificate. How might we bridge that difference from like a partial abstract goal
to these very concrete goals? Yeah, that you're bringing up a really important property of
of what we want, like a property we want in goal spaces for goal-conditioned RL.
And there isn't a great solution for that out there yet. I think it's one of the biggest
challenges for goal-conditioned RL and template abstract RL. How do you actually want to specify
your goals? And language can be, you know, language can be a decent approximation to that,
and maybe it'll surface for a couple of years. But really, I think what you would want is you
have a representation that's quite disentangled, and then your goal is to change some small aspect
of that representation, maybe change a couple of these dimensions. So the top-down goal could be
something like, here is the feature vector, and here is a mask, and the mask is very sparse,
and so most of the features in the representation I don't care about. But yeah, I'm sure there are
better ways to do it, and we don't really have working algorithms for it yet. So yeah, excited
to see what people come up with. Okay, let's move to your next paper, action and perception as
divergence minimization by, again, by yourself at all. I remember seeing this when it came out,
I believe it was in 2020, and I had a sense that this was, you're saying, you're telling us something
really important, but I admit I really didn't understand a lot of the details, so maybe you
can help us more today. It seems like a grand unified theory for designing these types of agents,
and the scope of all the different types of agents that you've explained in this framework
is really quite diverse and amazing. The big question this is addressing is, what objective
function should your agent optimize? And the objective function shouldn't just be a reward,
because if you want to solve tasks just based on some task-specific reward, then now you have to
design the reward function. So you still have to basically know what's a good way to solve the task,
because your agent won't work just from sparse rewards. It will need very detailed
shaping rewards to actually solve a task, and so we're just offloading the problem,
or changing the problem from designing the strategy directly to designing the
detailed reward function, and then using RL to fill in the gaps and track that reward function.
So we can take a step back and think about, you know, fundamentally what should the reward
function be for a general agent? And, you know, there are some ideas out there that say, well,
maybe just if you have the right reward function, then everything will be fine,
but that's actually not true. So you can have objective functions that only care about the
inputs the agent is receiving, like a reward function, right? Reward is basically an input
from the environment. It's at least often thought about that way in traditional RL.
And you can also have objective functions that depend not just on what the agent is seeing
or receiving, but also on the agent's internal variables and its actions. And so now you have
a broader class of objective functions, and you can ask the question, well, what is the space of
all these possible objective functions? And it turns out we can actually categorize what the
unsupervised objective functions are that an agent can optimize. So in addition to rewards or task
specific objectives that are inherently narrow in the sense that they only work for a specific
domain, and then if you're in a different world, then they wouldn't really make sense there
potentially. You always find an environment where this reward function is the opposite of what you'd
want to do. So is there something more general? And the answer is yes, there are unsupervised
objective functions that make sense in any environment. And it's similar to how when you
do object classification, you can learn your representations through some unsupervised
objective like CPC or masked autoencoding, something like that. And then you can learn
your classifier on top of that very quickly from a small amount of supervision.
But in the embodied case, for embodied agents, we actually have three different classes of
unsupervised objective functions. And all unsupervised objective functions can be put
into one of these three categories. And so the three categories are to learn representations
from your past data. And the most complete way of doing that is to learn a work model,
just model your complete trajectories in the past doesn't have to be through reconstruction,
could be through other ways that end up given some inductive biases, your model architecture,
and so on, and maximizing the information that's shared between the past inputs you've received,
let's say in your replay buffer, and the representations that are in your agent,
right, infer representations that are informative of past inputs. Now, that's all you can do if the
data set is fixed and given. But for an embodied agent, you can actually influence your future data
distribution. And so not just can you infer representations that are informative of past
inputs, you can also steer towards inputs in the future that you expect to be informative
of your representations. And so that explains the class of unsupervised exploration objectives,
right, get diverse data. And it's another way, or maybe to connect it to the information
maximization. Basically, the question is how here's a potential future trajectory, how much
information does that share with my representations? In other words, how much will it tell me about how
I should change my representations? And so by collecting a diverse data set, you can an agent
can set itself up for just, you know, you have a diverse data distribution, you can therefore do
better on tasks later on that you're given because you already can learn a lot about the world.
Similarly to how from the data you have, if you learn general representations, they will also make
it faster to adapt to a new task later on and help you if the reward is sparse and so on. And then
finally, not just can you maximize mutual information between your representations and future
inputs, but also between your actions and future inputs or could be either primitive actions or
more abstract actions like goals or skills and so on, latent variables. And now that
gives you a form of what's known as empowerment, where you're trying to choose actions that have
a certain measurable outcome in the environment that have high mutual information with what will
happen in the future with your future sensory inputs. And so that category basically means
your agent without any task specific rewards can learn how to influence the environment.
And so that's the third category of things we can do as an embodied agent to set ourselves up for
solving new tasks quickly as they come later on. Yeah, so the three categories are representation
learning in the most complete sense learning world models, exploration, and learning how to
influence the environment. And they play together, one benefits the other, right, but they are
distinct objective functions. And they actually don't work that well by themselves. For example,
if you don't have a diverse data set, then if you just have random actions, you can't learn a very
good world model, right? It'll just not see interesting enough data to be valid in a wide
distribution of states. And if you're learning how to influence the environment, but you're doing
that based on narrow data, you're not exploring, actually has the opposite effect of exploration.
And I think people are starting to realize that, especially for skill discovery methods, like
variational intrinsic control and diversity is all you need, but it's also true for goal conditioned
policies. And it was an issue in director that we had to address as well, where initially your
replay buffer is not very diverse, and you're training your low level policy to go to different
points in that distribution of states you've already seen, right? So now your low level policy
knows how to go back to the places that it's been to, and it doesn't know how to go to new places.
And so you really have to make an effort to get the thing to not lock in the data distribution,
but to actually explore new things outside of the distribution. Because otherwise, you'll just
collect more of the data you've already seen, your replay buffer fills up with that, and you get
locked in more and more into your narrow data distribution. So for example, in director, there
was an issue where an earlier version didn't have the expiration bonus. And it just really liked
green walls, because in DM lab, initially it saw a bunch of green walls. And so the policy basically
just learned to go to green walls. And then the whole replay buffer would fill up with that, and
everything it would ever practice on is going to green walls. And so you then, to get these
algorithms to work in practice, you need exploration. And I think people are starting to realize that,
just from working on these algorithms. But this framework actually explains it to you
from first principles, that you need to combine representation learning and exploration,
and learning how to influence the environment to get a general agent.
So do your existing, I guess you mentioned director, but do your other, does Dreamer fit
into this? How does Dreamer fit into this framework? And does this framework suggest
to you new types of agents that we haven't seen, or what capabilities can you,
will you have now that you have this framework? Can you mix and match and mix up agent designs
that no one's thought of to this framework? Yes, yes, you can. So the way Dreamer fits in
is that it learns the world model, but it doesn't have any unsupervised exploration
or goal reaching capabilities. Now, we have a paper called plan to explore that I think we
talked about during the previous episode. Yes. Yeah, that implements the unsupervised
exploration idea. And so that's a powerful combination. Now you can run your algorithm
on its own, it will learn a model and explore diverse data using that model, those two
objectives will reinforce each other. And eventually you'll end up with a world model
that's valid in a lot of different states. And you can very quickly solve new tasks with,
which at least with fairly simple environments, we showed in the paper and think that's an
exciting direction to scale up. And director builds in the unsupervised way of influencing
the future by having a goal condition policy, which I think is the easiest way to implement
that part of the framework. And it also builds in exploration at the high level because the manager
chooses goals that are exploratory. So director is the first algorithm that actually combines
all these three aspects in at least one form, it learns a world model, it does unsupervised
exploration, and it learns a goal condition policy. And empirically, it works very well on
sparse reward tasks. So I think there's a lot of promise there, we're starting to try it out on
robots now. But the design space is much larger than that, right? There are a lot of details in
how to implement these different components. And the framework doesn't answer those questions,
those are just empirical questions, and we'll need RL researchers to not fear about missing
out on large language models and actually solve these really important long term problems that
will bring us towards general AI. And yeah, so there's a lot to be done in that space,
the framework tells you those are the important things to focus on from first principles,
all the things that the agent can do are learning world model, exploring, and learning how to
influence the environment, plus following domain specific task specific preferences,
which could come from human feedback and demonstrations and so on.
Danijar Hafner, this has been fantastic, as usual. Thanks so much for sharing your time
and your insight with the talk our audience today, Danijar Hafner.
Thanks for having me, Robin. I had a great time and looking forward to the next time.