TalkRL podcast is all
reinforcement learning all the time.
Featuring brilliant guests,
both research and applied.
Join the conversation on
Twitter at @TalkRLpodcast.
I'm your host Robin Chauhan.
Rohin Shah is a research scientist
at deep mind and the editor and main
contributor of the alignment newsletter.
Thanks so much for
joining us today, Rohin.
Thanks for having me Robin
let's get started with, um, how
do you like to describe your
area of interest on my website?
The thing that I say is that I'm
interested in sort of the longterm
trajectory of AI, because it seems like
AI is becoming more and more capable
over time with many people thinking that
someday we are going to get to artificial
general intelligence or AGI, uh, where
AI systems will be able to replace humans
at most economically valuable tasks.
And that just seems like such an important
event in the history of humanity.
Uh, it seems like it would
radically transform the world.
And so it seems very important to
both important and interesting to
understand what is going to happen and
to see how we can make that important
stuff happened better so that we get
good outcomes instead of bad outcomes.
That's a sort of very general
statement, but I would say that
that's a pretty big area of interest.
And then I often spend most of my time
on a particular question within that,
uh, which is what are the chances that
these AGI systems will be misaligned
with humanity in the sense that
they will want something other than.
Uh, they will want to do things other
than what humans want them to do.
So a, what is the risk of that?
How can it arise and B how can we
prevent that problem from happening?
So we're going to talk, uh, about some
of this in more general terms later on.
And, but first let's, let's get
a little more specific about
some of your recent papers.
First we have in the minor, all basketball
competition on learning from human
feedback, and that was benchmark for
agents that solve almost lifelike tasks.
So I gather this is based on the mine
RL, a Minecraft based RL environment.
We saw some competitions on using
that before, but here you're doing
something different with the minor RL.
Can you tell us about basalt
and what's the idea here?
So I think the basic idea is that a
reward function, which is a typical.
Tool that you use in
I'm sure your list.
I expect your listeners probably
know about that or word function.
If you have to write it down by hand
is actually a pretty, not great way of
specifying what you want an AI system to
do, like reinforcement learning treats
that reward function as a specification
of exactly what the optimal behavior
is to do in every possible circumstance
that could possibly arise when you'd
have to have that reward function.
Did you think of every possible
situation that could ever possibly
arise and check whether your reward
function was specifying the correct
behavior and that situation?
No, you did not do that.
And so we already have lots and lots
of examples of cases where people
like try to right there, write down
their reward function, thought I
thought would lead to good behavior.
And they actually around
reinforcement learning or some
other optimization algorithm with,
uh, with that reward function.
And the AI found some totally
unexpected solution that did get
high award, but didn't do what
the designer wanted it to do.
And so this motivates the
question, like, all right, how can.
Specify what we want the
agent to do without using
handwritten reward functions.
The general class of approaches that has
been developed in response to this is, uh,
what I call learning from human feedback,
or LFH H F w the idea here is that you
consider some possible situations where
the air could do things, and then you
like ask a human, Hey, in these particular
situations, what should the AI system do?
So you're making more local acquirees,
um, and, uh, local specifications, rather
than having to reason about every possible
circumstance that can never arise.
And then given all of this human, this,
like, uh, given a large data set of human
feedback on various situations, uh, you
can then train and, uh, train an agent to
meet that specification as best as it can.
So people have been developing these
techniques and includes things like
imitation learning, where you learn
from human demonstrations of how
to do the task or learning from
comparisons where humans can be.
Uh, look at videos of two agents, two
videos of agent behavior, and then say,
you know, the left one is better than
the right one, or it includes corrections
where the agent does something on humans.
Like at this point you should
have like taken this other action
instead that would have been better.
These are all the ways that you can
use human, uh, human feedback to train
an agent, to do the, do what you want.
But so people have developed a lot
of algorithms like this, but the
evaluation of them as kind of added.
Um, people just sort of make up some, uh,
new environment to test their method on.
Uh, they don't really compare on
any like, uh, on, on a standard
benchmark that everyone is using.
So the big idea with basalt was to,
um, was to change that, to actually
make a benchmark that could reasonably
fairly compare all of these, uh,
all of these different approaches.
So we like, we wanted it to mimic
the real-world situation as much as
possible in the real world situation.
You just have like some notion
in your head of what task you
want your AI system to do.
And then you have to, you have to take
a learning from human feedback algorithm
and give it the appropriate feedback.
So similarly, in this benchmark, we
instantiate the agent and a Minecraft
world, and then we just tell the
designer, Hey, you've got to train
your agent to say, make a waterfall.
That's one of our tasks, uh,
and then take a picture of it.
So we just tell the
designers, you have to.
So now the designer has in their
head a like notion of what the agent
is supposed to do, but there's no
formal specification, no reward,
function, nothing like that.
So they can then do whatever they want.
They can write down at a board function by
hand, if that seems like an approach they
want to do, they can use demonstrations.
They can use preferences, they
can use corrections, they can
do active learning and so on.
Uh, but their job is to like make an
agent that actually does the task.
Ideally they want to maximize,
uh, performance and minimize costs
both in terms of compute and in
terms of how much human feedback
it takes to train the agent.
So I watched, uh, the presentations
of the top two solutions and
it seemed like they were.
Very different approaches.
Uh, the first one Kairos I would
say is, seem like a lot of hand
engineering and I think they use 80,000
plus labeled images and built some
very specific components for this.
They kind of decompose the problem, which
I think is a very sensible thing to do.
But then also, uh, the
second one was obsidian.
They produce this inverse cue learning
method, a new method, which has seemed
like a more general theoretical solution.
I just wonder if you have any comments
on the different types of solutions
that came out of this or those kind of
two main classes that you saw or did
any classes of solutions surprise you?
Yeah, I think that's basically a right.
I don't think they were
particularly surprising and that.
We spent a lot of time making sure
that the tasks can trivially be
solved by just doing, um, hand
engineering, like classical program.
So even, even the top team
did rely on a behavior cloned
navigation policy, uh, that used
in your own network, but is true.
They'd done did a bunch of engineering on
top of that, which I think is, according
to me is just a benefit of this set up.
It shows you like, Hey, if you're just
actually trying to get good performance,
do you train a neural network end to end?
Or do you put in a, or do you put
in domain knowledge and how much
domain knowledge do you put in
and uh, how, how do you do it?
And it turns out that in this particular
case, the domain knowledge, well, they
did end up getting first, but a team
of city and was quite close behind.
So I would say that the two experiences
were actually pretty comparable.
And I do agree that I would say one is
more of an engineering geese solution.
Then the other one is more.
So it seems to me like the goals here were
things that could be modeled and learned.
Like it seems feasible to learn the
concept or to train a network, to
learn the concept of looking at a
waterfall that had enough labels.
And I guess that's what
some contestants did.
But do you have any comments on if
we were to, to want goals that are
harder to model than these things?
I was trying to think of examples
that came up with like our knee
or dance choreography scoring.
Like how would you even begin
to, to model those things?
Do we have to just continue improving
our modeling toolkit so that we can make
models of these, uh, reward functions?
Or is there some other strategy?
Uh, it depends exactly what you mean
by improving the modeling toolkit,
but basically I think the answer is
yes, but you know, the way that we
can improve our modeling toolkit, it
may not look like explicit modeling.
So for example, for irony, I
think you could probably get
a decent, well, maybe not.
Uh, it's plausible that you could get
a decent, uh, reward model out of a
large language model that like does
in fact how the concept of iron irony.
Um, if I remember correctly, large
language models are not actually that
great, that humorous, so I'm not sure
if they have the concept of irony,
but I wouldn't be surprised that if
further scaling did in fact, give
them a concept of irony, such that we
could use, uh, we could then use them
to have rewards that involve irony.
I think that's the same sort
of thing as like waterfall.
Like I agree that we can learn
the concept of a waterfall,
but it's not a trivial concept.
If you asked me to program it
by hand, I would have no idea.
Like the only input.
You get pixels as an input.
If you're like, here's
a rectangle of pixels.
Please write a program that
detects the waterfall on there.
I'm like, oh God, that
sounds really difficult.
I don't know how to do it, but we
can, if we apply machine learning,
then like turns out that we can
recognize these sorts of concepts.
And similarly, I think it's not
going to be like, I definitely
couldn't write the program, uh,
directly, that can recognize R and D.
But if you do machine learning, if
you use machine learning to model all
the texts on the internet, uh, the
resulting model does in fact have a
concept of irony that you can then
try to use in your reward functions.
And then there's a Twitter
thread related to disinformation.
And I shared a line from your paper
where you said learning from human
feedback offers the alternative
of training recommender systems to
promote content that humans would
predict would improve the users.
Well, And I thought that
was really cool insight.
Is that something you're interested
in pursuing or are you, you
see that, uh, being a thing?
I don't know whether or not it
is actually feasible currently.
Uh, one thing that needs to be true of
recommender systems is they need to be
cheap to run because they are being run.
So, so many times every day, I
don't actually know this for a fact.
I haven't actually done any Fermi
estimates, but my guess would be that
if you try to actually run TPD three
on say, um, Facebook posts in order to
then, uh, to then rank them, I think
that would just be, that would probably
be prohibitively expensive for Facebook.
So there's a question of like, can
you get a model that actually makes a
reasonable predictions about the user
as well, being that can also be run
cheaply enough, that it's not a huge, uh,
expensive cost to whoever is implementing
the recommendation system and also.
Does it take a, like sufficiently small
amount of human feedback that you aren't
bottlenecked on cost, uh, from, from
the humans, providing the feedback.
And also do we have algorithms
that are good enough to, uh, train
recommender systems this way?
I think the answer is plausibly.
To all of these.
Uh, I haven't, it's just that I haven't
actually checked myself nor have I even
like, tried to do any feasibility studies.
The line that you're quoting
was more about like, okay,
why do this research at all?
And I'm like, well, someday in the
future, this should be possible.
And I stick by that, like someday
in the future, things will
become significantly cheaper.
Learning from human feedback.
Algorithms will be a lot better and so on.
And then like, it will just totally
make sense to you recommend your systems
trained with human feedback, unless we
found something even better by then.
It's just not obvious to me
that it is the right choice.
I look forward to that and, uh, uh,
I'm really concerned, like many people
are about the disinformation and the
divisiveness, uh, of social media.
So that sounds great.
I think everyone's used to
very cheap reward function.
Uh, pretty much across the board.
So I guess what you're kind of pointing
to with these reward functions is
potentially more expensive to evaluate
reward functions, which has maybe
hasn't been a common thing until now
both more expensive reward functions.
And also the model that you train with
that or word or function might be,
might still be very expensive to do
inference with presumably recommender
systems right now are like compute
these, uh, you know, run a few linear
time algorithms on the post in order
to like compute a like a hundred or a
hundred thousand features, then do a dot
product with a hundred thousand weights.
See which, and then like
rank things in the order.
By those numbers.
And that's like, you know, maybe a
million flops or something, which is
a tiny, tiny number of flops, whereas
like a forward pass, the GPD three is
more is several hundred billion flops.
Uh, so that's a, like, uh, 10 to
the five X increase in the amount
of computation you have to do.
Uh, actually, no that's one part and
pass through GPT three, but there
are many words in a Facebook post.
So multiply the 10 to the five by the
number of words in the Facebook posts.
Uh, and now we're at like maybe more
like 10 to the seven times cost increase
just to do inference, even as you mean
you were, you had successfully trained a
model that could do it recommendations.
And in the end result may be lowering
engagement for the benefit of less
divisive content, which is maybe not
in the interest of the, of the social
media companies in the first place.
There's also a question of, I
agree whether the companies will
want to do this, but I think if.
I don't know if we like showed that
this was feasible, uh, that would give
regulator is I'm much more like, I
think a common problem with regulation
is that you don't know what to regulate
because there's no alternative on the
table for what people are already doing.
And if we were to come to them and
say, look, there's this learning
from human feedback approach,
we've like, calculated it out.
They should, they should only increase
costs by two X or maybe, uh, uh,
yeah, this should, maybe this is
like just the same amount of costs.
Um, and it shouldn't be too hard for
companies to actually train such a model.
They've already got
all the infrastructure.
It should barely be like, I
don't know, a hundred thousand
dollars to train the model once.
And like, if you like lay out that
case, I think it's much, I would
hope at least that it would be a
lot easier for the regulators to be
like, yes, everyone, you must train.
Recommender systems to be optimizing
for what humans would predict as good as
opposed to whatever you're doing right
now that could really change the game.
And then the bots or the divisive
posters are now trying to gain that,
that new reward function and then
probably find some different strategies.
Yeah, you might, you might
imagine that you have to like
keep retraining in order to.
Deal with new strategies that are,
uh, that people are finding in
response to like, we can't do this.
I don't have any special information
about that on this from working at
Google, but I'm told that Google is
actually like pretty good at defeating
defeating spammers, for example, like
in fact, my Gmail spam filter works
quite well as far as I can tell,
uh, despite the fact that spammers.
Uh, constantly trying to evade
it and we'll, hopefully we
could do the same thing here.
Let's move on to your next
paper preferences implicit
in the state of the world.
I understand this paper is closely
related to your dissertation.
We'll link to your dissertation
in the show notes as well.
I'm just going to read a quote and I
love how you distilled this key insight.
You said the key insight of this paper
is that when a robot is deployed in an
environment that humans have been acting
in, the state of the environment is
already optimized for what humans want.
Can you, um, tell us the general idea here
and what do you mean by that statement?
Maybe like put yourself in the
position of a robot or an AI system
that knows nothing about the world.
Maybe it's like, all right, sorry.
Like it knows the laws
of physics or something.
It knows that like there's gravity,
it knows that like, there is solid.
It's like what's in gases,
liquids, uh, tend to, you know,
take the shape of the container
that they're in, stuff like that.
Um, but it doesn't know anything
about humans or maybe like, you know,
it was, it was, we imagined that
it's sort of like off in other parts
of the solar system or whatever,
and it hasn't really seen it yet.
And then it comes to her and
it's like, whoa, earth has these
like super regular structures.
There's like these like very,
uh, cuboidal, um, structures with
glass panes at regular intervals.
Um, that often seem to have lights inside
of them, even though, even at night
when there isn't light outside of, uh,
outside of them, this is kind of shocking.
You, you wouldn't expect this
from a random configuration of
atoms, um, or something like that.
There is some sense in which
state order, if the world that,
that we humans have imposed upon,
it is like extremely surprising.
Um, if you don't know about humans
already being there and what they want.
So then you can imagine, uh,
asking your AI system, Hey,
you see a lot of order here.
Uh, can you like figure out an
explanation for why this order is there?
Um, perhaps, uh, and then you.
And maybe you get, and then you give
it the hint of like, look, it's, we're
going to give you the hint that it was
created by somebody optimizing the world.
What sort of things might
they have been optimizing for?
And then you, like, you know, you look
around and you see that like, oh, liquids.
They tend to be in these like glasses.
It would be really easy to tip over the
classes and have all the liquid spill out.
But like that mostly doesn't happen.
So people must want to have
their liquids in glasses.
And probably I shouldn't knock out.
They're like kind of fragile.
You could like easily just like move them
a little bit to the, to the left or right.
And they would like fall down and break.
Um, and once they are broken,
you can then reassemble them.
But nonetheless, they're still not broken.
So like probably someone like
actively doesn't want them to break
and is leaving them on the table.
So really I would say the idea is.
The order in the world did not
just happen by random chance.
It happened because of human optimization.
And so from looking at the order of
the world, you can figure out what
the humans were optimizing for.
That's the basic idea
under length of paper.
So there's some kind of relationship
here to inverse reinforcement
learning where we're trying to
recover the reward function from,
from observing an agent's behavior.
But here you're not observing
the agent's behavior.
So it's not quite in verse aro.
Would, how would you describe the
relationship between what you're
doing here and a standard inverse RL?
So in terms of the formalism, um,
in verse RL, so that says that you
observe the human's behavior over time.
So that's the sequence of
states and actions that the
human took within those states.
Whereas we're just saying no, no, no.
We're not watching the human's behavior.
We're just going to see only the,
the state, the current state.
That's the only thing that we see.
And so you can think of this
in the framework of inverse
You can think of this as.
Either the final state of the
trajectory or a state samples from
the stationary distribution, from an
infinitely long trajectory, uh, either
of those would be reasonable to do, but
you're only observing that one thing
instead of observing the entire state
action history, um, starting from a
random initialization of the world.
But other than that, you just make
that one change and then you run
through all the same map and you
get a slightly different algorithm.
And that's basically what we,
uh, did to, to make this paper.
So with this approach, I guess
potentially you're opening up a huge
amount of kind of unsupervised learning
just from observing what's happening.
And you can kind of almost
do it instantaneously in
terms of observation, right?
You don't have to watch billions
of humans for thousands of years.
Um, it does require that your
AI system knows like the laws of
physics or as we would call it
in RL, the transition dynamic.
Or, well, it needs to be there to know
that, or have some sorts of data from
which it can learn that because if
you're just, if you just look at the
state of the world and you have no
idea of what the laws of physics are
or how, how things work at all, you're
not going to be able to figure out
how it was optimized into this state.
Like if you want to infer that humans
don't want their basis to be broken.
It's an important fact in order to
infer that that if a vase is broken,
it's very hard to put it back together.
And that is a fact about the transition
dynamics, which we assumed by Fiat
that the, that the agent knows.
But yes, if you had a.
Enough data sets itself,
supervised learning, could teach
the agent a bunch of dynamics.
And also then, and then like also
the agent could go about, go around
looking at the state of the world,
in theory, it could then, uh, and for
a lot about what humans care about.
So I very clearly remember meeting you
at new Europe's, uh, 2018 deep workshop
in Montreal, the poster session.
And I remember your poster on
this, um, and you showed a dining
room that was all nicely arranged.
And, uh, and, and you were saying
how a robot could learn from
how things things are arranged.
And, and I just want to say, I'll say
this publicly, I didn't understand,
uh, at that point what, what you
meant or why that could be important.
Um, and it was so different.
Your angle was just so different
than everything else that was
being presented, um, that day.
And I really didn't get it.
So I, I, and I'll own that.
Uh, it was, it was my loss.
And, uh, so thanks for your patience.
It only took me three and a half years
or something to get to come around.
Uh, sorry, I didn't communicate.
I clicked or I suppose I
don't think it was no, I don't
think it was at all on you.
Um, but I, uh, maybe I just
lacked the background to see why
I like to understand, um, let,
let me, let me put it this way.
Like how often do you find people who
have some technical understanding of
AI, but still, maybe don't appreciate,
uh, some of this line of work, including
alignment and things like that.
Is that a common thing?
I think that's a reasonably common.
And what do you attribute that to?
Like what's going on there
and is that changing at all?
Or I think it's pretty interesting.
I don't think that these people
would say that like, oh, this is
a boring paper at all, or this is.
I'm incompetent paper.
I think they would say yes, the person
who wrote this paper is in fact, has
in fact done something impressive by
the standards of like, was like, you
know, did you need to be intelligent and
like, do good math in order to do this?
I think they are more likely to say
something like, okay, but, so what,
and that's not entirely unfair.
Like, you know, it was the
deep RL workshop and here I
am talking about like, oh yes.
Imagine that you'd like,
know all the dynamics.
And I'll say you're like only getting
to look at the state of the world.
Uh, and then you like, think about
how vases can be broken, but then
they can't be put back together.
And voila, you've learned that
humans don't like to break faces.
There's just something.
So different from all of the things
that our L easily focuses on.
Like it doesn't have any
of the puzzle rights.
There's no like, you know, deep
learning, there's no exploration.
There's no, um, uh, there's
no catastrophic forgetting
no, nothing like that.
And to be clear, all of those seem
like important things to focus on.
And I think many of the people who were
at that workshop, we're focusing on
those and are doing good work on them.
Uh, and I'm just doing
something completely different.
That's like, not all that interesting
to them because they want to
work on reinforcement learning.
I think they're making a mistake
in the sense that like AI alignment
is important and more people should
work on it, but I don't think
they're making a mistake in that.
They're probably correct about what
does and doesn't interest them.
Just so I'm clear, I was not
critiquing your math or the
value of anything you were doing.
It was just my ability to understand
the importance of this type of work.
And I didn't think you were okay.
So I will say that that day, when I
first encountered your, your poster,
I was really hung up on edge cases.
Uh, like, um, you know, there's in the
world, the robot might observe there's
hunger and there's traffic accidents.
And there's things that things
like, like not everything is perfect
and we don't want the robot to
replicate all these, all these flaws
in the world or the dining room.
There might be, you know,
dirty dishes or something.
And so the world is clearly not
exactly how we want it to be.
So how, how is that, is that an issue or
is that, is that, uh, is that not an issue
or is that just not the point of this?
Uh, not, not addressed here?
It depends a little bit.
I think in many cases it's not
an issue if you imagined that the
robot somehow sees the entire world.
Um, so for example,
you mentioned a hunger.
Uh, I think the robot would notice
that we do in fact spend a lot of
effort, making sure that at least
large number of people don't go hungry.
We've built these giant vehicles,
both trucks and cargo ships, and
so on, then move food around in a
way that seems at least somewhat
optimized to get food to people who
like that food and want to eat it.
So there's lots of
effort being put into it.
There's not like the maximum
amount of effort being put in.
Which I think reflects the fact
that there are things that we
care about other than food.
So, so I do think it would
be like, all right, humans
definitely care about having food.
I think it might then like if you, if
you use the assumption that we have in
the paper, which is that humans are, the
humans are noisily rational, then it might
conclude things like I, uh, yes, Western
countries care about getting food to.
Um, Western Western citizens to
the citizens of their country.
And they care a little bit about, uh,
other people having food, but like, not
that much, it's like a small portion
of their, uh, governments aid budget.
So like there's a positive weight
on there and fairly small weight.
And that seems like maybe not the
thing that we wanted to learn, but like
also I think it is in some sense, an
accurate reflection of what Western
countries care about if you go by their
actions rather than what they say.
So I, uh, I'm going to move on to
benefits of assistance over rewarding.
And this one was absolutely fascinating
to me actually, mind blowing.
I highly recommend people read
all of these, but, but definitely
I can point to this one as,
um, something surprising to me.
So that was you as the first author.
And, uh, can you share, what
is the, what's the general
idea of this paper around?
I should say that this general
idea was not novel to this paper
it's been proposed previously.
I am not going to remember the
paper, but it's by friend at all.
It's like towards a dish decision
theater, that tech model of
assistance or something like that.
Um, and then there's also cooperative
inverse reinforcement learning
from chai where I did my PhD.
The idea with this paper was just to
take that the models that had already
been proposed in these papers and
explain them why they were so nice.
I was like particularly keen on
these models as opposed to, um, other
things that the field could be doing.
So the idea here.
Is that generally we want to build
AI systems that help us do stuff.
And you could imagine two different
ways that this could be done.
Uh, first you could imagine a system
that has two separate modules.
One module is doing is
trying to figure out.
The humans want or what the
humans want the system to do.
And the other module is then is trying
to then do the things that the first
module said the people wanted it to do.
And that's kind of like the, um, when we
talked about learning from human feedback
earlier on in modeling reward functions,
is that what, what that would exactly?
Um, I think that is.
That that's often what
people are thinking about.
I would make a diff distinction
between how you train the AI system
and what the AI system is doing.
This paper, I would say is more
about what the AI system is doing.
Whereas the learning from human
feedback stuff is more about,
um, how you train the system.
So in the, what the AI system is
doing framework, I would call this a
value learning or reward learning, and
then the alternative is assistance.
And so, although there's like some
surface similarities between learning
from human feedback and award Lang,
it is totally possible to use learning
from human feedback algorithms to train
an AI system, then acts as the, that
then acts as though it doesn't assist.
It is in the assistance.
Paradigm is also possible to
use learning from human feedback
approaches to train an AI system.
Then act as though that then
acts as though it does a, in
the reward learning paradigm.
So that's one distinction.
To recap, the value learning or
reward learning, uh, side of the two,
two models is two separate modules.
One that like figures out what
the humans want and the other that
then acts to optimize those values.
And the other side, which, which we
might call assistance is where you
still have both of those functions, but
they're combined into a single module.
And the way that you do this is you
have the AI system posit that there
is some true unknown reward function
data, only the human, the human, who
is a part of the environment, uh,
knows this data and their behavior
depends on what the data actually is.
And so now they can just test to
act on the, in order to maximize
data, but it doesn't know data.
So it has to like look at how
the human is behaving within the
environment in order to like, make some
inferences about what data probably.
Uh, and then as it gets more and more
information about data that allows
it to take more and more like, uh,
actions in order to optimize data.
But fundamentally this like, uh,
learning about data is an instrumental
action that the agent predicts
would be useful for helping it to
better optimize data in the future.
So if I understand correctly, you're
saying assistance is superior because
it can, the agent can reason about
how to improve its model of, of
what the human wants or how do you
describe Y Y it's you, you get all
these benefits from assistance.
I think that benefits come
more from the fact that these
two functions are integrated.
There's the value learning,
uh, there weren't learning or
value learning and the control.
So like acting to optimize the value.
So we can think of these
two functions in assistance.
They're merged into a single
module that does like nice, good
basion reasoning about all of it.
Whereas in the value learning
paradigm, they're separated.
And it's this integration
that provides the benefits.
You can make plans, which is
generally the domain of control,
but those plans can then depend on.
Uh, the agent believing that in
the future, it's going to learn
some more things about the reward
function data, which would normally
be the domain of value learning.
So that's an example where control
is, uh, using information, future
information about valley learning
in order to make its plans.
Whereas when those two modules
are separated, you can't do that.
Um, and so like one example that we have
in the paper is you is like, you imagined
that, uh, you've got a robot, uh, who
is, who asked to cook dinner for Alice.
Alice is currently a well not
cooked dinner, bake a pie for Alice.
Um, Alice is currently at the office,
so the robot can't talk to her.
And unfortunately the robot about
doesn't know what kind of tie she
wants, maybe apple blueberry or cherry,
but like the robot could guess, but
it's guests is not that likely to be.
Uh, however, it turns out the, you
know, the, the steps to make the pie
crusts are the same for all three pies.
So an assistive robot can reason.
Hey, uh, my plan is first, make the pie
crest, then wait for Alice to get home.
Then ask her what fillings she wants.
Then put the filling in.
And that entire plan consists of both
taking actions on the environment,
like making the crust and putting
in the filling, and also includes
things like learn more about
data by asking Alice a question.
Um, and so it's like integrating all
of these into a single plan, whereas
that plan cannot be expressed in
the value learning paradigm, the
query as an action in the action.
So I, um, I really like the, uh,
you laid out some levels of task
complexity, and I'm just going to
go through them really briefly.
You mentioned traditional CS is,
uh, giving instructions to computer
on how to perform a task and then
using AI or ML for simpler tasks
would be specifying what the task is.
Um, and the machine
figures out how to do it.
I guess that's standard RL formulation.
And then I, the heart for heart attacks
specifying the task is difficult.
So the agents can learn may, may learn
a reward function from human feedback.
Um, and then, and then the, and then
you mentioned assistance paradigm as,
as the next level where the human is
part of the environment has latent
goals that the robot does not know.
How do you see this ladder?
Like, does this describe, is this a
universal, um, classification scheme?
Is, is, are we done?
Is that the highest level?
I think it question.
I haven't really thought about it before.
You can imagine a different version of the
highest level, which is like here, we've
talked about the assistance framing where
you're like, there is some objective, but
you have to infer it from human feedback.
There is a different version that
maybe is more in line with the way
things are going with deep learning
right now, which is more like
specifying the task is difficult.
So we're only going to like
evaluate behaviors that the AI
agent shows and maybe like also
tried to find some hypothetical
behaviors and evaluate those as well.
Uh, so that's a different way that you
could talk about those highest level
where you're like evaluating specific
behaviors, rather than trying to specify
the task across all possible behaviors.
And then maybe that would
have to be the highest.
And now you could just keep inventing
new kinds of human feedback inputs,
uh, and maybe those can be thought of
as higher levels beyond that as well.
Um, so then, um, one detail I
mentioned, I, I saw in the paper, you
mentioned a two phase of assistance
is equivalent to reward learning.
And I, I puzzled over that line
and I couldn't really quite,
uh, understand what you meant.
Can you say a little bit more about that?
What does that mean?
And how do you conclude that there,
through those two things are equivalent?
So there are a fair number
of definitions here.
I won't, maybe I won't go through
all of it, but just, uh, for, so that
listeners know we had definition,
we had formal definitions of like
what counsel's assistance and what
counts as a reward learning, uh,
and the, the word learning set.
Case we imagined, we like
imagined it as first.
You have a system that like asks like
human questions are actually, it doesn't
have to ask the human questions, but
first we have a system that interacts
with the human somehow and like develops
a guess of what the reward function is.
And then, uh, that yes, of what the
reward function is, which could be a
distribution over to awards is passed
on, uh, to a system that then acts to
maximize the expected value of the,
sorry, the expected to award, according
to that distribution over towards.
So once it's done it's communication,
it's learned to reward and in phase
two, it's not, it doesn't have
any query as action at that point.
That's what you're saying.
Um, and so then the, you know, two phase
is the two phase communicative assistance,
the two phase and the communicative.
Both have technical definitions, but they
roughly mean exactly what you would expect
them to mean in order to make this true.
Um, so you mentioned three
benefits of using assistance,
this assistance paradigm.
Can you briefly explain
what those benefits are?
The first one, which I already
talked about, um, his plans,
conditional on feature feedback.
So this is the example of where the
robot can make a plan that says,
Hey, first, I'll make the pie crust.
Then I'll wait for Alice to
get back from the office.
Then I'll ask her what filling she wants.
Then I'll put in the appropriate filling.
So they're there.
The plan was conditional on the answer
that Alice was going to give in the future
that the robot predicted she would give.
But like, couldn't actually
ask the question now.
So that's one thing that, uh, can
be done in the assistance paradigm,
but not in the, um, value learning
or toward learning paradigm.
Uh, a second one is what we call
relevance where active learning.
Uh, so active learning is the idea
that instead of the human passively,
giving the robot, sorry, instead of
the human giving a bunch of information
to the robot and the robot passively
taking it and using it to update its
estimate of data, the robot actively
asks the human quite human questions
that seem most relevant to updating its
understanding of the reward data, and
then the human answers, those questions.
So that's active learning that
can be done in both paradigms.
The thing that assistants can do is to
have the robot only ask questions that
are actually relevant for the plans
that's going to have in the feature.
So to make this point that I might,
you might imagine that like, you
know, you get a hustled robot, um,
that your hustled robots booting up.
And if it was in the reward,
lending paradigm and test
like figure out data, right.
And so it's like, all right.
Do you tend to like, uh, at what
time do you tend to prefer a dinner?
Um, so I can cook that for you.
And that's like a pretty reasonable
question and you're like, yeah, I
usually eat around, um, 7:00 PM.
Uh, and it's got a few more questions
like this, and later on, it's like,
well, if you ever wanted to paint your
house, what colors did we paint it?
And you're like, kind of like
a blue, I guess, but like,
why are you asking me this?
And then it's like, if aliens
come and then they'd from mark.
Where would, what would be your
preference of place to hide it, hide in.
And you're like, why, why
are you asking me this?
But the thing is like, all of these
questions are in fact relevant for,
for their reward function data.
The reason that you don't that like, if
this were a human, instead of a robot, the
reason they went to ask these questions
is because the situations too, it's,
they're relevant probably don't come up.
But in order to like, make that
prediction, you need to be talking more
to the control, uh, sub module, the, with,
uh, the control module, which is like,
I think that our word learning paradigm
doesn't do they control somebody modules?
The one that's like, all right,
we're gonna take, we're probably
going to take these sorts of actions.
That's going to lead to
those kinds of feeds.
And so like, you know, probably
aliens from Mars aren't
ever going to be relevant.
So if, if you have this like one unified
system, uh, then it can be like, well,
okay, I know that like aliens from
Myers, I probably not going to show
up, uh, anytime in the near future.
And I don't need to ask about
those preferences right now.
If they, if I do find out that aliens
from Mars are likely to land, uh, soon
then I will ask that question, but I
can leave that to later and not bother,
um, Alice until that actually happens.
Um, so that's the second one.
And then the final one is that.
You know, so far, I've been talking
to cases where the robot is learning
by asking the human questions.
And the human just like gives
answers that are informative
about the reward function data.
Uh, the third one is that, you know, you
don't have to ask the human questions.
You can also learn from their
behavior just directly while
they're going about their day
and optimizing their environment.
A good example of this is like your robot
starts helping out around the kitchen.
It starts by doing some like very obvious
things like, okay, there is some dirty
dishes, just put them in the dishwasher.
Um, meanwhile the humans going around and
like starting to collect the ingredients
for baking a pie, sort of, I can see
this, notice that that's, that's the case.
And I'm like, go and get out the like
mixing bowl on the egg beater and so on.
Um, in order to help.
Uh, like the sort of just like
seeing what the human is up to and
then like immediately starting to
help with that is the sort of thing
that you can only, like this is all
happening within a single episode,
rather than being across episodes.
The like value learning or borderline
could do it across episodes where
like first the robot looks and watches
the human, uh, act in the environment
to make an entire cake from scratch.
And then the next time when the robot is
actually Indian, It goes and helps the
human out, but in the assistance paradigm,
it can do that learning and help out with
making the cake within the episode itself,
as long as it has enough understanding
of how the world works and what data is
likely to be, uh, in order to actually
like did these with enough confidence,
that those actions are good to take.
When you described the robot that
would ask all these irrelevant
questions, I couldn't help.
I'm a parent.
I couldn't help with thinking,
you know, that's the kind of
thing a four-year-old would do.
Try ask you every random question.
That's not irrelevant right then.
And it seems like you're,
you're kind of pulling into a
more mature type of intense.
A lot of this is like, like this, the
entire paper, uh, has this assumption
of like, we're going to write down
math and then we're going to talk about
agents that are optimal for that math.
We're not going to bother thinking
of, we're not going to think
about like, okay, how do we in
practice get the optimal thing.
We're just like, is the optimal thing,
actually, the thing that we want.
Uh, and so one would hope that yes, uh,
if we're assuming the actual optimal
agent, it should in fact be, um, more
mature than four year olds, one hopes.
So how do you, um, relate, can you
relate this assistance paradigm
back to standard in inverse RL?
What is the relationship
between these two paradigms?
So in verse RL, zooms that it's an
example of the reward learning paradigm.
Um, it assumes that you get full
demonstrations of the entire task.
And then you have, and then
you like, uh, executed by the
human tele operating the robot.
There's like versions of it.
That don't seem the teller operation
part, but usually that's an assumption.
And then given the, you know, tell our
operated robot demonstrations of how to
do the task, the robot does, then it's
supposed to infer what the task actually
was and then be able to do it itself in
the future without any tele operation.
So without uncertainty, is that true
with the inverse RL paradigm assumes
that we were not uncertain in the end?
It doesn't necessarily seem that I
think in many deep IRL algorithms that
does end up being an assumption that
they use, but it's not a necessary one.
Uh, it can still be uncertain.
And then I would plan typically with
respect to maximizing the expectation of.
The reward function, although you could
also try to be conservative or risks,
risk sensitive, and then you would be
max, uh, you, you wouldn't be maximizing
expected reward and maybe you'd be
maximizing like worst case reward if
you want it to be maximally conservative
or something like that, or a fifth
percentile reward or something like that.
So, so there can be uncertainty, but like
the human isn't in the environment and
there's this episodic assumption where
like the demonstration is one episode
and then when the robot is acting,
that's a totally different episode.
And that also isn't true.
In the assistance case, you talk
about active reward learning
and interactive reward learning.
Can you help us understand those,
those two phrases and how they differ?
So active reward learning is just
when, uh, the robot has the ability,
like in the reward learning paradigm,
the robot has given the ability to
ask questions, um, rather than just.
Just getting to observe
what the human is doing.
So hopefully that one
should be relatively clear.
The interactive reward learning
setting is, uh, it's mostly just
a thing we made up because it was
a thing that people often brought
up as like, maybe this will work.
So he wanted to talk about it and show
why it doesn't, it doesn't in fact work.
Uh, but the idea there is that you
alternate between you still have
your two modules, you have one reward
learning module and one control
module, and they don't talk to each
other, but instead of like just doing
one, the word planning thing, and
then, and then doing control forever.
You do, like, I don't know, you do 10,
10 steps of reward learning, then 10
steps of control, then 10 steps over
war line, then 10 steps of control.
And you keep iterating
between the two stages.
So why is computational complexity
really high for algorithms that
try to optimize over assistance?
I think you mentioned that here.
So everything I've talked about has
just sort of a zoom that the agents are
optimal by default, but if you think
about it, what the optimal agent has to
do is it has to, you know, maintain a
probability distribution over all of the
possible reward functions that Alice could
have and then updated over time as it
sees more and more of Alice's behavior.
And as you probably know, full base and
updating over a large list of hypothesis,
uh, is very computationally intractable.
Another way of seeing it is
that, you know, if you take this
assistance paradigm, you can.
Through a relatively simple reduction,
turn it into a partially observable
markup decision process or Palm DP.
The basic idea there is to treat
the reward, function data as like
some unobserved part of the state.
Uh, and then that reward function
is whatever that unobserved
part of the state would say.
Uh, and then the, um, Alice's
behavior is thought of as part of the
transition dynamics, which depends
on the unemployed part of the state.
That is the status data.
Uh, so that's the rough reduction to
how you phrase assistance as a Palm DP.
Uh, and then Palm DPS are known to be
very computationally intractable to solve
again for basically the same reasons that
I was just saying, which is that like, to
actually solve them, you need to maintain
a patient, a probability distribution
over all the, uh, ways that the
unemployed parts of the state could be.
And that's just
computationally and tracked.
So do you plan to work on this, on
this particular line of work further?
I think I don't plan to do further
direct research on this myself.
I still basically agree with the
point of the paper, which is look
when you're building your AI systems,
they should be reasoning more.
They should be reasoning in the way
that the assistance paradigm suggests
where there's like this integrated
reward, learning and control,
and they shouldn't be reasoning.
And the way that the value of
learning, uh, paradigms, just where
you first figure out what human
values are and then optimize for them.
And so I think that point is a
pretty important point and will
guide how we build a AI systems in
the future, or it will guide how,
what we have our AI systems do.
And I think I will continue to push for
that point, including like, Projects
that deep DeepMind, but I probably
won't be doing more like technical
research on the math and those papers
specifically, because I think I, like
it said, the things that I wanted
to say, uh, there's still more work.
There's still plenty of work that
one could do such as like trying to
come up with algorithms to directly
optimize the maths that we wrote down.
Um, but that seems less
high leveraged to me.
Moving to the next paper on the
utility of learning about humans for
human AI coordination, that was Carol
at all with yourself as a coauthor.
Um, can you tell us the
brief, uh, general idea here?
I think this paper was written
in, in the wake of some pretty
big successes of self-pleasure.
Um, so self play is the algorithm
underlying well self player, like very
similar variants are the out, is the
algorithm underlying open AI five a
which plays Dota alpha star, which
plays StarCraft alpha and alpha zero,
which play, you know, go chests charity.
And so on at a superhuman level, these
were like some of the, yeah, some of the
biggest results in AI around that time
and sort of suggested that like self
play was going to be a really big thing.
And the point we were making in this
paper, Is that self play works well when
you have a zero sum, uh, two players
zero-sum game, uh, which has a like
perfectly competitive game, uh, because
it's effectively going to cause you to
explore the full space of strategies,
because it, if you're like playing against
yourself in a competitive game, if there's
any fly in your strategy, then gradient
descent is going to like push you in
the direction of like exploiting that
flaw because you're, you know, you're
trying to beat the other copy of you.
So you're always given to get better,
uh, in contrast in common payoff
game, which are the most collaborative
games, um, where each agent gets the
same payoff, no matter what happens,
uh, but the paths can be different.
Uh, you don't have this,
um, similar incentive.
Uh, you don't have any
incentive to be unexplainable.
Like all you want is to come up with
some policy that like, if played against
yourself will get the maximum reward,
but it doesn't really matter if you are.
If you would like play badly with
somebody else, like a human, like if
that were true, that wouldn't come
up in self play, self play would be
like, nah, every in every single game
you play, you got the maximum reward.
There's nothing to do here.
So there's no forests that's like
causing you to be robust to all of the
possible players that you could have.
Whereas in the competitive game, if
you weren't drove us to all of the
players that could possibly arise,
then you're exploitable in some way.
And then the grading dissenters,
incentivized to find that exploit after
which you have to become robust to it.
Is there any way to reformulate it so
that there is that competitive pressure?
You can actually do this.
And so I know you've had Michael
Dennis, um, and I think also Natasha
shacks on this podcast before,
and both of them are doing work.
That's kind of like this,
uh, with paired, right.
That was just shakes and.
Oh, the way you do it, as you just
say, all right, we're going to make
the environment a, our competitor, the
environment is going to like try and like
make itself super complicated, uh, in a
way that defeats, uh, whatever policy,
uh, we were trying to use to coordinate.
And so then this makes sure that
you have to be robust to whichever
environment you find yourself in.
So that's like one way to get
robustness to, well, it's getting
you to robustness, to environments.
It's not necessarily getting
robustness to your partners.
Um, when, like, if you, for example,
you wanted to cooperate with the
human, but you could do a similar
thing there where you say we're going
to also take the partner agent and
we're going to make it be adversarial.
Now this doesn't work great if you
like, literally make it adversarial
because sometimes in many like
interesting collaborative games,
Um, like, like over cooked, which is
the one that we were studying here.
If your partner is an adversary,
they can just guarantee
that you get minimum reward.
It's not, it's often
not difficult in this.
And over cooked, you just like
stand in front of the station where
you deliver the dishes that you've
cooked and you just stand there.
And that's what the adversary does.
And then the agent is just like,
well, okay, I can make a soup,
but I can never deliver it.
I guess I never get the reward.
Uh, so, so it doesn't quite that like
naive, simple approach doesn't quite
work, but you can, instead you can
like try to have a, uh, slightly more
sophisticated method where, you know,
the, instead of being an adversarial
partner, it's a partner that is.
Trying to keep you on the
edge of your abilities.
And then you like, uh, as you, uh, and
then like, once your agent learns how to
like, do well with the one, uh, with your
current partner, then like the partner
tries to make itself a bit harder to do.
And so on.
So there, there are a few, there's a few
papers like this that I I'm kindly failing
to remember, but, but there are papers.
I tried to do this sort of thing.
I think many of them did end up just
like following, uh, both the self
play work and those paper of ours.
And basically I think you're right.
You can in fact do some clever
tricks to make things, uh, to make
things better and to get around this.
It's not quite as simple
and elegant as self play.
And I don't think the results are quite
as good as you get what self play.
Cause it's still not
exactly the thing that.
So now we have a contributed question,
which I'm very excited about from, uh, Dr.
Natasha Jacques' senior research scientist
at Google AI and postdoc at Berkeley.
And we were lucky to have Natasha
as our guest on episode one.
So Natasha Natasha asks the most
interesting questions are about why
interacting with humans is so much
harder flash, so different than
interacting with simulated RL agents.
So Rohin, what is it about humans
that makes them, um, harder?
Yeah, there are a bunch of factors here,
maybe the most obvious one and probably
the biggest one in practice is that you
can't just put humans in your environment
to do like a million steps of gradient
descent on, uh, which often we do in
fact do with our simulated RL agents.
And so like, if you could just somehow
put a human in the loop, uh, in a million
for a million episodes, maybe then the
resulting agent would in fact, just be
really good at coordinating with humans.
In fact, I might like take out
the, maybe there and I will, I will
actually predict that that resulting
agent will be good with humans.
As long as you had like, uh, like
reasonable diversity of humans, um,
and that you had to collaborate with.
So my first and biggest answer is.
You can't get a lot of data from
humans in the way that you can
get a lot of data from simulated
RL agents, uh, or equivalently.
You can't just put the human into the
training loop the way you can put a
simulated RL agent into the training loop.
Uh, so that's answer number one.
And then there is another answer, uh,
would seem significantly less important,
which is that humans are just not as
are, sorry, are significantly more
diverse than simulated RL agents.
Typically humans don't
all act the same way.
Uh, even an individual human
will act pretty different.
Um, from one episode to the next
humans will like learn over time.
Uh, and so there, not only is there
a policy like kind of, kind of
stochastic, but their policy isn't
even stationary that policy changes
over time as they learn how to play
the game and become better at it.
Um, and that's another thing that RL,
like, usually our El seems that that
doesn't, that is not in fact true that
like episodes are drawn IED because
of this like non station Harrity and
stochastic stochasticity and diversity,
you would imagine that it, like you have
to get a much more robust policy, uh,
in order to work with humans instead
of working with simulated RL agents.
And so that, uh, ends up being, uh,
that ends up being harder to do.
Sometimes people try to like take
their simulated RL agents and
like make them more stochastic
to be more similar to humans.
Um, for example, by like maybe taking a
random action with some small probability.
And I think usually this
ends up still looking kind of
like artificial and forest.
When you like look at the resulting
behavior such that it still doesn't
require that robust a policy in
order to collaborate well, but those
agents, um, and humans are just
like more challenging than that.
Let's briefly move to the next
paper, evaluating the robustness
of collaborative agents.
That was not at all with
yourself as a co-author.
Can you give us the short version
of what this paper is about?
Like we just talked about how, in
order to get your agency work well
with humans, they need to be, they
need to learn a pretty robust policy.
And so one way of measuring how good your
aides and sorry, uh, collaborating with
humans is while you just like, have them
play with humans and see how well that
goes, which is a reasonable thing to do.
Um, and people should definitely do
it, but this paper proposed a like
maybe simpler and more reproducible
tests that you can run more often.
Um, which is just, I mean, it's
the basic idea from software
engineering is just a unit test.
Uh, and so it's a very simple idea.
The idea is just write some unit tests
for the robustness of your agents, right?
Some cases in which you think.
Action is unambiguously clear in cases
that you may be expect not to come up,
uh, during, uh, during training and then
just see whether agent does in fact do
the right thing, uh, on those inputs.
And that can give you, like, if you're,
it's in passes, all of those tests,
that's not a guarantee that it's robust.
Um, but if it fails, some of those
tests then knew, definitely sound
found some failures of robustness.
I think in practice, uh, the agents that
we tested all like failed many tests.
I w yeah, I don't remember
the exact numbers off the
top of my head, but I think.
Some of the better agents were
getting scores of maybe 70%.
Could we kind of say that this
is related to the idea of,
of sampling from environments
outside of the train distribution?
Because we think that like in, in,
in samples that are related to the
distribution, that the agent would,
uh, encounter after it's deployed,
would you, would you phrase it that
way or is it, is it going in different?
I think that's pretty close.
I would say basically everything
about that seems correct.
Except the part where you say
like a, and it's probably going
to arise in the test distribution.
I think usually I just wouldn't
even try to like, um, check
whether or not it would, uh, up
here in the test distribution.
I just, I guess, like,
that's very hard to do.
You don't know what's going, like, if
you knew how the test distribution was
going to look and in what way it was
going to be different from the train
distribution, then you should just change
your train distribution to be the test
distribution, but like the fundamental
challenge of robustness as easily that
you don't know what your test is to be in.
That's going to look like.
So I would say it's more.
We try to deliberately and find situations
that are outside the training situation,
but where a human would agree that
there's like one unambiguously correct
answer, um, and test it on those cases.
Like maybe this will lead us to be too
conservative because like, actually
the test was in a state that will never
actually come up in the test distribution.
But given that we, it seems very hard
to know that I think, um, it's still
a good idea to be driving these tests
and to take failures fairly soon.
And this paper mentions
three types of robustness.
Can you, um, briefly touch
on, on the three types?
So this is basically a categorization
that we found helpful in generating
the tests, uh, and it's, uh, somewhat
specific to reinforcement learning agents.
So the three types were state robustness,
which is, um, a case where like, basically
these are test cases on which the
main thing that you've changed is the
state in which the agent is operating.
Then there's agent robustness, which
is, uh, when one of the other agents
in the environment, uh, exhibit
some behavior that's like, uh,
unusual and not what you expected.
And then that can further be,
uh, decomposed into two types is
agent robustness without memory
where, uh, even like where the,
the test doesn't require the.
AI system to have any memory.
There's like a correct action.
That seems determinable even if
the system doesn't have a memory.
Uh, so this might be what you want to use.
If you, for some reason they're using, uh,
an MLP or a CNN as your architecture, and
then there's agent robustness with memory,
uh, which is where the distribution shift
happens from, uh, and, uh, partner agent,
and the environment doing something
that where you have to actually like,
look at the behavior over time, notice
that, uh, something is violating what
you expected during training, and then
take some corrective action as a result.
Uh, so there you need memory
in order to understand.
Um, how the partner agent is doing
something that wasn't what you expected.
And then I guess when we're
dealing with a high dimensional
state, there's just a ridiculous
number of permutations situations.
And we've seen in the past that, um, that
deep learning, especially it can be really
sensitive to small seemingly meaningless
changes in this high dimensional state.
So how do we, how, how could we
possibly think about scaling this
up to a point where, uh, we don't
have to test every single thing.
I think that basically this particular
approach, you mostly just, shouldn't
try to scale up in this way.
It's more meant to be a like first
quick sanity check that is already
quite hard to pass, uh, for kind systems
where you're talking scores like 70%.
I think once you get to like score
is like 95, 90 9%, uh, then it's
like, okay, that's the point to
like, start thinking about scaling
up, but like, suppose we got.
Uh, what do we then do?
I don't think we really want to scale up
the, like the specific process of humans,
think of tests, humans write down tests.
Uh, then we like run
those on the air system.
I think at that point, uh, we want to
migrate to a more like alignment flavored,
uh, viewpoint, which I think we were going
to talk about in the near future anyway.
Uh, but to give, uh, give some
advance, uh, to talk about
that a little bit in advance.
I think once we like scale up, we want
to try and find cases where the AI system
does something bad that it knew was bad.
It knew that it wasn't the thing
that its designers intended.
And the reason that this allows you
to scale up is because now you can.
Go and inspect the AI system and try to
find facts that it knows and like leverage
those in order to create your test cases.
And one hopes that the set
of things that the AI knows.
Still plausibly, a very large space, but
hopefully not an exponentially growing
space, the way the state space is and
the intuition for why this is okay.
Is that like, yes, the AI system might
end up, may end up having accidents and
that wouldn't be caught if we were only
looking for cases where the AI system
made a mistake that I knew was a mistake.
But like, usually those
things aren't that bad.
Uh, they can be if your AI system is like
in a nuclear power plant, for example,
or, uh, in some like, uh, in a weapon
system, perhaps, but like in many cases,
it's not actually that bad for the,
your AI system to make an accidental.
The really bad areas are the ones where
the system is like intentionally making
an error, uh, or making something that is
bad from the perspective of the designers.
Those are, those are like
really bad situations and you
don't want to get into them.
And so I'm most interested in like
thinking of like how we can avoid that.
Uh, and so then you can like try
to leverage the agent's knowledge
to construct and put study.
You can then test the VA system on.
So this is a great segue
to the alignment section.
Um, so how do you define
a alignment in AI?
Maybe I will give you two definitions,
uh, that are like slightly
different, but mostly the same.
So one is that an AI system is
misaligned, so I'm not aligned, uh,
if it takes actions that it needs.
Uh, where against the
wishes of its designers.
That's basically the definition that
I was just giving earlier a different,
more positive definition of AI alignment
is an, is that an AI system is aligned
if it is trying to do what its, uh,
designers intended for it to do.
And is there some, um, agreed
upon taxonomy of like top
level topics and alignment?
Um, like how does it relate to
concepts like AI safety and human
feedback, that different things
that we talked about today?
How do we, how would we, uh, arrange
these in a kind of high level?
There is definitely not a
canonical textonomy of topics.
There's not even a canonical definition.
So like the one I gave doesn't include
the problem, for example, of how you
resolve disagreements between humans,
on what the AI system should do.
It just says, all right, there is
some designers, they wanted something.
That's what the AI system
is supposed to be doing.
Uh, and it doesn't talk about
like, all right, the process
by which those designers decide
what the AI system intends to do.
That's like not, not a part of
the problem as I'm defining it.
It's obviously still an important problem.
Just like not part of this definition,
uh, as I gave it, but other people
would say, no, that's a bad definition.
You should include that problem.
So there's not even a
So I think I will just give you maybe
my techsonomy of alignment topics.
So in terms of how alignment
relates to AI safety, uh, there's
this sort of general big picture
question of like, how do we get.
Or we'll add, be beneficial for humanity,
which you might call AI safety or
add beneficial illness or something.
And on that you can break down into a
few possible, uh, possible categories.
I quite like the I'm gonna forget where
the, where I, where this taxonomy comes
from, but I liked the taxonomy into
accidents, misuse and structural risks.
So accidents are exactly
what they sound like.
Accidents happen when an AI system
does something bad and nobody intended
for that VA system to do that thing.
Um, Missy's also exactly
what it sounds like.
It's when it's, when somebody gets an AI
system to do something, and that's the
thing that it got the AI system to do was
something that we didn't actually want.
So think of like terrorists,
um, using AI assistant.
Um, to like assassinate people, uh, and
unstructured risks are maybe less obvious
than the previous tube, but structural
risks happen when, you know, if, as
we infuse AI systems into our economy,
do any new sorts of problems arise?
Do we get into like racist
to the bottom on safety?
Do we get to, do we have like a
whole bunch of increased economic
competition that causes us to sacrifice
money, to sacrifice many of our
values in the name of Trent activity?
Uh, stuff like that.
So that's like one starting categorization
accidents, CS structural risk, and
within accidents you can have, you
can then further separate into.
Uh, accidents where the system knew that
the thing that was doing was bad and
accents where the system didn't know
that the thing that it was doing was bad.
And the first one is AI alignment,
according to my definition, which
again is not a canonical Def I
think it's maybe the most common
definition, but it's like not canonical.
So that was like how alignment relates
to AI safety and then like, how does the
stuff we've been talking about today?
Relate to alignment.
Again, people will disagree with me on.
But according to me, the way to build
a line to AI systems and the sense
of, eh, uh, systems that don't make
take bad actions that they knew were
bad is that you use a lot of human
feedback to train your AI system to
where like the human feedback, you
know, it rewards the AI system when
it does things, stuff that humans want
and, uh, punished as the air system.
When the system does things that
the human doesn't want, this
doesn't solve the entire problem.
You, you basically then just want
to like make your human, the people,
providing your feedback as powerful as.
Make them as competent as possible.
So maybe you could do some
interpretability with the model that
you're training, um, in order to
like, understand how exactly it's like
reasoning, how it's making decisions,
you can then feed that information to
the humans who are providing feedback.
And thus, this can then maybe allow them
to, uh, not just select AI systems that
get the right outcomes, but now they
can select it as systems, like get the
right dog comes for the right reasons.
And that can help you get more robustness.
Uh, you could imagine that you have
some other air systems that are in
charge of like finding new hypothetical
inputs on which the system that
you're training takes a bad action.
Um, so like this, uh, systems and
like here's this hypothetical.
Uh, here's this input on which your
AI system is doing a bad thing.
And then they came into
like, oh, that's bad.
Let's put it in the training data set, um,
and give good feedback on it and so on.
So then I think the salt would be
maybe the most obviously connected
here where it was about how do you just
train anything with human feedback,
which is obviously a core thing I've
been talking about in this plan.
Um, preferences implicit
in the state of the world.
It's less clear how that relates here.
I think that paper makes
more sense in a plan.
That's more like traditional
value alignment where you're as
a system maintain, I like has an
explicit distribution over it data
that it's updating by evidence.
So I think that one is less relevant
to the, to the, to the subscription.
The benefits of assistance
paper is I think.
Primarily a statement about
what the air system should do.
And so like what we want our human
feedback providers to be doing is to be
seeing, Hey, is this AI system, like,
thinking about what, uh, what it's users
will want, um, if it's uncertain about
what the users will want, does it like
ask for clarification or does it just
like guess, um, we probably wanted to ask
for clarification rather than guessing
if it's a sufficiently important thing.
Uh, but if it's like some probably
insignificant thing, then it's
like fine, if it can guess.
And so through the human feedback that
you can then like train a system, that's
being very assistive, the overcooked
papers, uh, on the, you tell it to
you of learning about learning about
humans for human error, coordinate.
Uh, that one is, I think, not that
relevant to this plan, unless you
happened to be building an AI system
that is playing a collaborative game,
the evaluating the robustness paper is,
uh, more relevant in that, like part
of the thing that these human feedback
providers are going to be doing is to
like, be constructing these hypothetic,
be constructing inputs on which the
AI system, uh, behaves badly and then
training VA system, not to behave badly
on those inputs, uh, send that sense.
It's, uh, it also fits
into this overall story.
Can you mention a bit about
your alignment newsletter?
Um, like what, what, how do you, how
do you define that newsletter and
how did you, how did you start that?
And what's happening with the newsletter?
Now, the alignment newsletter is
supposed to be a weekly newsletter
that I write that summarizes.
Just recent content
relevant to AI alignment.
It has not been a very weekly and the
last couple of months because I've
been busy, but I do intend to go back
to making it a weekly newsletter.
I mean, the origin story is kind of funny.
It was just, we, this was while I
was a PhD student at the center for
human compatible AI at UC Berkeley.
Uh, we were just discussing that, like,
there were a lot of papers that were
coming out all the time, uh, as people
will probably be familiar with and it
was hard to keep track of them all.
Um, and so someone suggested
that, Hey, maybe we should have
a rotation of people who just.
Uh, search for all of the new papers
that ever arrived in the past week.
And just send an email out
to everyone just like lists
giving links to those papers.
So other people don't have
to do the search themselves.
And I said like, look, I, you
know, I just do this every week.
I I'm just happy to take on those jobs,
sending an, uh, sending one email with a
bunch of links is not a hard, uh, we don't
need to have this rotation of people.
Um, so I did that internally to chai,
uh, then like, you know, a couple of
weeks later, I like added a sentence
that was telling people, Hey, this is
what this is like the topic, um, here
is, you know, maybe you should read it
if you are interested in X, Y, and Z.
Uh, and so that happened for a while.
And then I think I started writing.
A slightly more extensive summaries
so that people didn't have to read the
paper, uh, unless it was something they
were particularly interested in, uh, and
flight around that point, people were
like, this is actually quite useful.
You should make it public.
Uh, and then I like tested it a bit
more, um, maybe for another, like
three to four weeks internally to try.
And then I, um, after
that I released a public.
Uh, it still did go up under
a fair amount of improvement.
I think maybe after like 10 to 15
newsletters was when it felt more stable.
And now it's like, apart from the
fact that I've been too busy to do it
recently, it's been pretty stable for
the last, I don't know, two years or so.
Well, uh, to the audience, I
highly recommend the newsletter.
And, uh, like I mentioned, you know,
when I first met you and heard about
your alignment newsletter early
on at that point, I really wasn't.
Um, I didn't really appreciate the, the
importance of alignment, uh, issues.
And, and I gotta say that really
changed for me when I read the
book human compatible by professor
Stuart, Russell, who I gather is
your one of your PhD advisors.
And so that book really helped
me appreciate the importance
of alignment related stuff.
And it was part of the reason that I, that
I sought sought you out to interview you.
So I, I'm happy to recommend that
a plug that book to the audience,
uh, professor Russell's awesome.
And it's a very well-written book
and, uh, and full of great insight.
I also strongly recommend this book.
And since we're on the topic of the
alignment newsletter, you can read
my summary of, uh, steroid Russell's
book in order to get a sense of
what it talks about, uh, before
you actually make the commitment of
actually reading the entire book.
Um, so you can find that on my
website under a alignment newsletter,
there's a list of past issues.
I think this was newsletter edition 69.
Not totally sure you can check that.
And what was your website again?
I it's just my first name and last name.
I highly recommended doing
that, um, to the audience.
And so I wanted to ask you about how,
you know, how alignment work is done.
So a common pattern that, you know,
we might be familiar with that in,
in many ML papers is to show a new
method and show some experiments.
Um, but his alignment, uh, is work in
alignment, fundamentally different.
Like what does the work
entail in, in alignment?
Is there a lot of thought experiments
or how would you describe that?
Uh, there's a big variety of things.
So some alignment work, um, is
in fact pretty similar to, uh,
existing, uh, T to typical ML work.
Um, so for example, there's
a lot of alignment work.
That's like, can we make
human feedback algorithms.
Uh, and you know, you start with
some baseline and some task or
environment in which you want to
get an AI system to do something.
And then you like try to improve
upon the baseline, using some
ideas that you thought about it.
Uh, and like, you know, maybe
it's somewhat different because
you're using human feedback.
Whereas typical ML res uh, MLRA switch
doesn't involve human feedback, but
that's not that big a difference.
It's still like mostly the same skills.
Uh, so that's probably the kind that's
closest to existing ML research.
There's also like a lot of
interpretability work, which again is
just like working with actual machine
learning models and trying to figure
out what the heck they're doing.
Also seems pretty, it's like not the same
thing as like get a better performance
on those tasks, but it's still like
pretty similar to the general fee to like
some parts of the, of machine learning.
So that's like one kind of one
type of alignment research.
And then there's, you know, on the
complete other side that there is a
bunch of stuff where you're like, where
you think very abstractly about what
feature AI systems are going to look like.
So like, maybe you're like, all right,
maybe you think about how some story by
which you might, by which AGI might arise.
Like we run such and such algorithm,
maybe what set some improvements.
And the arc in various architecture
is with like such and such data
and you get a, and it turns out
you can get AGI out of this.
Uh, then you maybe like think
in this hypothetical, okay.
Uh, does this AGI ended
up getting misaligned?
If so, how, how does it get misaligned?
Um, well you tell that story and they're
like, okay, now I have a story of like
how they, uh, AGI system was misaligned.
What would I need to do in order to
like, prevent this from happening?
Um, so you can do like pretty elaborate,
uh, conceptual thought experiments.
I think these are usually good as a
way of ensuring that the things that
you're working on are actually useful.
I think there are a few people
who do these sorts of conceptual
arguments, almost always.
And do them well, such that I'm
like, yeah, this stuff they're
producing, I think is probably
going to matter in the future.
But I think it's also very easy
to end up not very grounded in
what's actually going to happen.
Such that you end up saying things that
won't actually be true in the future
and could notably like some somewhat,
there is some reasonably easy to find
argument today that could convince
you that the things you're saying are
not going to matter in the future.
So it's pretty hard to do this
research because of the lack of
actual empirical feedback loops.
But I don't think that has doomed.
Um, I think people do in fact get, um,
some interesting results out of this
and often the results side of this,
that the best results out of this line
of work, uh, usually seem better to
me than the results that we get out
out of the empirical line of work.
So you mentioned in your newsletter
and then there's an alignment forum.
If I understand that that's what
that was spring out of less wrong.
Is that, is that.
I don't know if I would say
it's sprang out of less wrong.
It was meant to be at least somewhat
separate from it, but it's definitely
very, it's definitely affiliated with
less wrong and like everything on
it gets cross posted to less wrong.
And so these are pretty
I mean, from my point of view, um,
but to the audience who maybe is just
getting started with these ideas, can
you recommend, uh, you know, a couple
of resources that might be good for
them to get like an on-ramp for them?
Um, I guess including the
human compatible, but anything
else you'd want to mention?
So human compatible is a
pretty good suggestion.
Um, there are other books as well.
Um, so super intelligence is
more on the philosophy side.
Uh, the alignment problem by Brian
Christian is less on the like, uh,
has a little bit less on like what,
what might solutions look like?
It has more of the like intellectual
history behind how, how these
concerns started rising on life.
3.0 by max Tegmark.
I don't remember.
How much it talks about alignment.
I assume it does a decent amount.
Uh, but that's, that's another
option apart from books.
I think so the alignment for M
has, um, sequences of blog posts
that are, that, that don't require
quite as much, um, technical depth.
So for example, it's got the
value learning sequence, which
I, well, which I have wrote half
curated other people's posts.
Um, so I think that's a good introduction
to some of the ideas and alignment.
Uh, there's the embedded agency
sequence also on the Atlantans
forum and the iterated amplification
sequence and the alignment for him.
Oh, there's the, there's an
AGI safety fundamentals course.
And then you can just Google it.
It has a publicly available curriculum.
I believe, I think really ignore all
the other suggestions, look at that
curriculum and then read things on.
There is probably actually my advice.
Have you seen any, uh, depictions
of, of alignment issues in science
fiction or, um, these, these ideas
come up for you when you, when
you watch or read, read Spotify?
They definitely come up to some extent.
I think there are many ways in which
the depictions aren't realistic, but
like they do come up or I guess even
outside or just, uh, even mythology,
like the whole Midas touch thing seems
like a perfect example of a misalignment.
The king might example is a good example.
Those are good examples.
If you, if you expand to include
mythology in general, I feel
like it's probably everywhere.
Um, especially if you include things
like you asked for something and.
What you're literally asked for,
but not what you actually meant.
That's really common, isn't it?
I mean, we've got, like, I
could just take any story.
Your budget is, and probably
this little feature.
Um, so they really started the, uh,
alignment, uh, literature back then,
I guess, thousands of years old,
the problem of there are two people.
One person wants the other person
to do something that's just like
as a very important, fundamental
problem that you need to deal with.
There's like tons of stuff also
in economics about those rights,
that principal agent problem and
like the island and problem is
not literally at the same thing.
And the principal agent problem.
It seems that the agent had already has
some motivation, some utility function.
And you were like trying to incentivize
them to do the things that you want.
Whereas in the AI alignment,
you've got to build it.
Patient that you're delegating to.
And so you have more control over it.
So there are differences, but like
fundamentally the like entity a once
entity B to do something for it, entity a
is like just a super common pattern that
human society has thought about a lot.
So we have some more
Uh, this is one from Nathan Lambert,
a PhD student at UC Berkeley
doing research on robot learning.
And, uh, Nathan was our
guest for episode 19.
So Nathan says a lot of AI
alignment and AGI safety work
happens on blog posts and forums.
Uh, what's the right manner to draw more
attention from the academic community.
Any comment on that?
I think, um, I think that this is
basically a reasonable strategy where
like, by, by doing this work on blog posts
and forums, people can move a lot faster.
Uh, like ML is pretty good and
that, uh, like relative to other
academic fields, you know, it doesn't
take years to publish your paper.
It only takes some months
to publish your paper.
Uh, but blood present forums, it can
be days to talk about your ideas.
Um, so you can move a lot faster if
you're trusting in everyone's ability
to like, understand which work is
good, um, and what to build on.
Uh, and so that's like, I think the
main benefit of blog posts and forums,
but then as a result, anyone who isn't
an expert correctly, doesn't end up
reading the blog posts and forums,
because there's not, it's a little
hard if you're not an expert to extract
the signal and ignore the noise.
So I think then there's like a
separate group of people and not say
they're not a separate group, but
there's a group of people who then
takes a bunch of these ideas and then
tries and then converts them into.
More vigorous, uh, and correct.
And academically presented,
um, ideas and, and papers.
And that's the thing that you can like,
uh, show to the academic community
in order to draw more attention.
In fact, we've just been working
on a project along these lines
at DeepMind, which hopefully will
release soon talking about the
risks from, uh, inner misalignment.
So yeah, I think roughly my story is you
figure out conceptually what you want
to do via the blog posts and forums.
And then you'll like make it rigorous
and have experiments and like demonstrate
things with, um, actual examples
instead of hypothetical ones, uh,
and the format of an academic paper.
And that's how you then like,
make it, um, credible enough and
convincing enough to draw attention
from the academic committee.
And then Taylor Killian asks
to Taylor's a PhD student at
U of T and the vector Institute.
Taylor was our guest for episode 13.
And Taylor asks, how can we
approach the alignment problem when
faced with heterogeneous behavior
from possibly many human actors?
I think under my interpretation of
this question is that, you know,
humans sometimes disagree on what
things to value and similarly
disagree on what behaviors they, they
exhibit and want the AI to exhibit.
Um, so how do you get the AI to decide on
one set of values or one set of behaviors?
And as I talked about a little bit
before, I mostly just take this question
and like it is outside of the scope of
the things that I usually think of that
I'm usually just, I'm usually thinking
about the designers have something
in mind that they want the system.
Did the AI system actually do
do that thing or at least it,
is it trying to do that thing?
I do think that this problem is in
fact an important problem, but I think
what you, the way, what your solution,
like the solutions are probably going
to be more like political, um, or like
societal rather than technical, where,
you know, you have to negotiate with
other people to figure out what exactly
you want your AI systems to be doing.
And then you like take that, take
that like simple spec and you
hand it off to the AI designers.
And then the idea of
saying it's all right.
Now we will make an AI
system with the spec.
So, so I would say it's like, yeah,
there's a separate problem of like how
to go from human society to something
that we can put inside of an AI.
This is like the domain of a
significant portion of social science.
Uh, and it has technical aspects too.
So like social choice theory, for
example, I think has at least some
technical people trying to do a mechanism
design to, to solve these problems.
And that seems great.
And people should do that.
It's a good problem to solve.
Um, as unfortunately not one,
I have thought about very much,
but I do feel pretty strongly
about the factorization into.
One part of, you know, one problem,
which is like, figure out what exactly
you want to put into the AI system.
And then the other part of the problem,
which I call the alignment problem,
which is then how do you take that thing
that you want to put into the system
and actually put it into the AI system.
And Taylor also asks, how do we
best handle bias when learning
from human expert demonstrations?
This is a good question.
And I would say is an open
question and in the field.
So I don't have a great answer to it,
but some approaches that people have
taken, one simple thing is to get a, uh,
get demonstration from a wide variety
of humans and hope that to, to the
extent that they're making mistakes,
some of those mistakes will cancel out.
You can invest additional effort.
Like you get a bunch of demonstrations
and then you invest a lot of
effort into evaluating the quality
of each of those demonstrations.
And then you can like label
each demonstration with
like, How high quality it is.
And then you can design an algorithm that
like takes the quality into account when
learning, or, I mean, the most simple
thing is you just like discard everything.
That's too low quality and only
keep the high quality ones.
But, uh, there are some algorithms
that have been proposed that can
make use of the low quality ones
while still trying to get to the
performance of the high quality ones.
Another approach that people have,
um, tried to take is to like, try
and guess what sorts of biases, um,
are present and then try to build
algorithms that correct for those biases.
Uh, so in fact, one of my older papers
looks into an approach, uh, of this farm.
Like we did get results that were
better than the baseline, but I don't
think it was all that promising.
Uh, so I mostly did not continue
working on that approach.
So it just seems kind of hard to
like, know exactly which biases,
uh, are going to happen and to
then correct for all of them.
So those are a few thoughts on
how you can try to handle bias.
I don't think we know the
best way to do it yet.
Thanks so much.
Uh, to Taylor and Nathan and
Natasha for contributed questions.
Um, you can also contribute questions
to our next, uh, interviews.
Uh, if you show up on our
Twitter at taco bell podcast.
So we're just about wrapping up here,
a few more questions for you today.
Rohin, what would you say is the
holy grail for your line of research?
I think the holy grail is to
have a procedure for training AI
systems, that particular task.
Um, where we tell them where we can apply
arbitrary human understandable constraints
to how the system achieves those tasks.
So for example, we can be like,
we can build an AI assistant that
scheduled your meetings, but.
And sh and like, but unsure is
that it's always very respectful
when it's talking to other people
in order to schedule your emails.
And there's never like, you
know, discriminating based on
sex or something like that.
Or you can like build an agent that plays
Minecraft and you can just deploy it on
an entirely new multiplayer server that
includes both humans and AI systems.
And then you can say, Hey, you should
just go help such and such player
with whatever it is they want to do.
And the agent just does that.
And they're like abides by the norms
on that, uh, on the multi-player
server server that had joined, or
you can build a recommender system.
That's just optimizing for what humans
think, uh, is good for recommender
systems to be doing while, uh, rather
than optimizing for say engagement.
If we think that engagement is a
bad thing to be optimizing for.
So how do you see your, uh,
your research career plan?
Um, do you have a clear roadmap
in mind or are you, uh, doing
a lot of exploration as you.
I think, I feel more like there's
maybe I wouldn't call it a roadmap.
But there's a clear plan.
Uh, and the plan is we talked
about a bit about it earlier.
The plan is roughly train models
using human feedback, and then
like empower the heat, the humans,
providing the feedback as much as he
can, um, ideally so that they can know
everything that the model knows and
select the models that are getting the
right outcomes for the right reasons.
I'd say like, that's the plan.
That's like an ideal to which we aspire.
Uh, we will probably not actually
reach it, knowing everything that
the model knows is a pretty high
bar and probably we won't get to it.
But there are like a bunch of
tricks that we can do that get
us closer and closer to it.
And the closer we get to it, the
better, the better we're doing.
Um, and some like, let us find more
and more of those tricks find which
ones are the best, see how like cost
efficient, how costly they are and so on.
Um, and ideally this just leads to our,
to a significant improvement in our
ability to do these things every time.
Um, I will say though, that it took me
several years to get to those points.
Like most of the, uh, most of the previous
years of my career, I have in fact been
a significant amount of exploration,
uh, which is part of why, like, not all
of the papers, uh, that we've talked
about so far really fit into the story.
Is there anything else you want
to mention to our audience today?
Um, so I, I'm probably going to start a
hiring round at DeepMind for my own team.
Probably sometime in the next month from
the time of recording today is March 22nd.
So yeah, please do apply.
If you're interested in
working on the AI alignment.
Rohin Shah, this has been an absolute
pleasure and, and a total honor, by
the way, I want to thank you for on
behalf of myself and in our audience.
Thanks for having me on.
It was really fun to actually
go through all of these papers,
uh, in a single session.
I don't think I've ever done that before.