Natasha Jaques 2

[00:00.000 - 00:11.120]
Talk RL podcast is all reinforcement learning all the time, featuring brilliant guests,
[00:11.120 - 00:17.320]
both research and applied. Join the conversation on Twitter at Talk RL podcast. I'm your host
[00:17.320 - 00:23.440]
Robin Chauhan.
[00:23.440 - 00:27.680]
Dr. Natasha Jaques is a senior research scientist at Google Brain, and she was our first guest
[00:27.680 - 00:32.440]
on the show three and a half years ago on Talk RL episode one. Natasha, I'm super honored
[00:32.440 - 00:36.440]
and also totally stoked to welcome you back for round two. Thanks for being here today.
[00:36.440 - 00:39.960]
Well, thank you so much for having me. I'm stoked to be back.
[00:39.960 - 00:44.200]
So when we did that first interview back in 2019, I remember you're just wrapping up your
[00:44.200 - 00:49.960]
PhD at MIT. And I can tell you've been super busy and lots, lots of things have been happening
[00:49.960 - 00:55.520]
in RL and AI in general since then. So can you start us off with like, what do you feel
[00:55.520 - 01:00.240]
have been like the big exciting advances and trends in your field since you completed your
[01:00.240 - 01:01.240]
[01:01.240 - 01:05.440]
Yeah, well, I think it's kind of obvious, right? I mean, everyone's obsessed with the
[01:05.440 - 01:11.360]
progress in large language models that have been happening, you know, chat GPT, how the
[01:11.360 - 01:17.280]
API is getting deployed. I think that's kind of the, I mean, image and language models,
[01:17.280 - 01:19.440]
diffusion models, there's so much going on.
[01:19.440 - 01:23.840]
Yeah, like you said, all this buzz around chat GPT, and reinforcement learning from
[01:23.840 - 01:28.000]
human feedback and the dialogue models in general. And of course, you were really early
[01:28.000 - 01:33.280]
in that space. And a lot of the key open AI papers actually cite your work in this space.
[01:33.280 - 01:39.000]
And there's a few of them. Can, can you talk a bit about how your work in that area relates
[01:39.000 - 01:44.560]
to to what open AI is doing today and what these these models are doing today?
[01:44.560 - 01:53.040]
Sure, yeah. So I guess, like, let me take you back to 2016, when I was thinking about
[01:53.040 - 01:57.920]
how do you take a pre trained language model, but in that case, I was looking at actually
[01:57.920 - 02:04.640]
LSTM, so like, early stuff, and actually fine tune it with reinforcement learning. And in
[02:04.640 - 02:10.000]
that time, I was actually looking not at language, per se, but at like, music generation and
[02:10.000 - 02:15.600]
even generating molecules that might look like drugs. But I think the I think the molecules
[02:15.600 - 02:20.280]
examples is a really good way to see this. So basically, the idea was like, we have a
[02:20.280 - 02:25.400]
data set of known molecules, so we could train a supervised model on it and have it generate
[02:25.400 - 02:30.240]
new molecules. But those molecules don't really have like the properties that we want, right?
[02:30.240 - 02:34.960]
We might want molecules that are more easily able to be synthesized as a drug. So we have
[02:34.960 - 02:42.000]
scores that are like the synthetic accessibility of the molecule. But neither so neither thing
[02:42.000 - 02:45.960]
is perfect. If you just train on the data, you don't get optimized molecules. If you
[02:45.960 - 02:50.480]
just optimize for synthetic accessibility, then you would get molecules that are just
[02:50.480 - 02:57.280]
like long chains of carbon, right? So they're useless as a drug, for example. So what you
[02:57.280 - 03:01.040]
can see, like in this problem, you can you can use like reinforcement learning to optimize
[03:01.040 - 03:05.400]
for drug likeness or synthetic accessibility, but it's not perfect. The data is not perfect.
[03:05.400 - 03:09.600]
So how do you combine both? So what we ended up proposing was this approach where you pre
[03:09.600 - 03:14.220]
trained on data, and then you train with RL to optimize some reward, but you minimize
[03:14.220 - 03:18.600]
the KL divergence from your pre trained policy that you train on data. So we call that like
[03:18.600 - 03:23.080]
your pre trained prior. And this approach lets you flexibly combine both supervised
[03:23.080 - 03:28.140]
learning, get the benefit of the data, and RL, where you kind of optimize within the
[03:28.140 - 03:35.080]
space that's within the space of things that are probable in the data distribution for
[03:35.080 - 03:39.960]
sequences that have high reward. And so you can see how this is obviously related to what's
[03:39.960 - 03:45.040]
going on with our LHF right now, which is that they pre train a large language model
[03:45.040 - 03:49.320]
on data set. And then they say, let's optimize for human feedback, but we're still going
[03:49.320 - 03:53.580]
to minimize that KL divergence from that pre trained prior model. So there's still an end
[03:53.580 - 03:58.080]
up using that technique. And it turns out to be, to be pretty important to the framework
[03:58.080 - 04:04.760]
for to the RLHF framework. But I was also working on our LHF, the idea of like learning
[04:04.760 - 04:10.800]
from human feedback. In around 2019, we took that same KL control approach. And we actually
[04:10.800 - 04:15.880]
had dialogue models try to optimize for signals that they got from talking to humans in a
[04:15.880 - 04:24.600]
conversation. But what we were doing is, instead of having the humans like rate, which dialogue
[04:24.600 - 04:30.640]
entries were good or bad, or do the preference ranking that open AI is doing with RLHF, we
[04:30.640 - 04:34.720]
wanted to learn from implicit signals in the conversation with the humans. So they don't
[04:34.720 - 04:38.480]
have to go out of their way to provide any extra feedback. What can we get from just
[04:38.480 - 04:44.140]
the text that they're typing? So we did things like analyze the sentiment of the text. So
[04:44.140 - 04:49.320]
if the person sounded generally happy, then we would use that as a positive reward signal
[04:49.320 - 04:54.240]
to train the model. Whereas if they sounded frustrated or confused, that's probably a
[04:54.240 - 04:58.240]
sign that the model is saying something nonsensical, we can use that as a negative reward. And
[04:58.240 - 05:01.760]
so we worked on actually optimizing those kind of signals with the same technique.
[05:01.760 - 05:06.800]
I mean, it sounds so much like what chat GPT is doing. Maybe the function approximator
[05:06.800 - 05:10.880]
was a bit different. Maybe the way you got the feedback was a bit different, but under
[05:10.880 - 05:12.680]
the hood, it was really RLHF.
[05:12.680 - 05:17.460]
Well, there's key differences. So open AI is taking a different approach than we did
[05:17.460 - 05:23.200]
in our 2019 paper on human feedback, where they train this reward model. So we don't
[05:23.200 - 05:25.800]
do that. So what they're doing is they're saying, we're going to get a bunch of humans
[05:25.800 - 05:32.280]
to rate, which of two outputs is better. And we're going to train a model to approximate
[05:32.280 - 05:37.580]
those human ratings. And that idea is coming from way earlier, like open AI's early work
[05:37.580 - 05:42.200]
on deep RL from human preferences, if you remember that paper. And in contrast, the
[05:42.200 - 05:49.960]
stuff I was doing in 2019 was offline RL. So I would use actual human ratings of a specific
[05:49.960 - 05:58.640]
output, and then train on that as one example of a reward. But I didn't have this generalizable
[05:58.640 - 06:02.720]
reward model that could be applied across more examples. So I think there's a good argument
[06:02.720 - 06:07.720]
to be made that the training of reward model approach actually seems to scale pretty well,
[06:07.720 - 06:09.920]
because you can sample it so many times.
[06:09.920 - 06:15.600]
Can we talk about also the challenges and limits of this approach? So in the last episode,
[06:15.600 - 06:22.800]
38, we featured OpenAI founder and inventor of PPO, John Schulman, who did a lot of the
[06:22.800 - 06:28.680]
RLHF work at OpenAI. And he talked about instruct GPT, the sibling model to chat GPT, because
[06:28.680 - 06:33.120]
chat GPT wasn't released yet. And there is no chat GPT paper yet. But the paper explained
[06:33.120 - 06:37.320]
that that it required a lot of human feedback. And the instructions for the human ratings
[06:37.320 - 06:41.320]
was really detailed and super long. And so there was a lot of there was a significant
[06:41.320 - 06:46.440]
cost in getting all of that human feedback. So I just I guess I wonder what you think
[06:46.440 - 06:53.280]
about that? Is there is that cost going to limit how useful RLHF can be? Or is that not
[06:53.280 - 06:55.480]
a big deal? Because it's totally worth it?
[06:55.480 - 06:59.400]
Yeah, I mean, that's a great question. And going back and reading the history of papers
[06:59.400 - 07:05.160]
they've been doing on RLHF, even before instruct GPT, like in the summarization stuff, it seems
[07:05.160 - 07:11.680]
like one of the key enablers of getting RLHF to work effectively, is actually investing
[07:11.680 - 07:16.400]
a lot into getting quality human data. So between these, they have these two summarization
[07:16.400 - 07:20.000]
papers where one, I guess wasn't working that well, then they have a follow up where they
[07:20.000 - 07:23.840]
said, one of the key differences, we just did a better job recruiting raters that were
[07:23.840 - 07:28.400]
going to agree with the researchers, we were taking a high touch approach of like, being
[07:28.400 - 07:32.520]
able to be in a shared slack group with the raters to answer their questions and make
[07:32.520 - 07:37.560]
sure they stay aligned. And like that investment in the quality of the data that they collected
[07:37.560 - 07:42.440]
from humans was key in getting this work. So it is obviously expensive. But what I was
[07:42.440 - 07:48.000]
struck by in those papers and also in instruct GPT is that, as you'll notice in instruct
[07:48.000 - 07:55.680]
GPT, the what was it like the 1.3 billion parameter model trained with RLHF is outperforming
[07:55.680 - 08:03.200]
the 175 billion parameter model trained with supervised learning. So it's like 100x the
[08:03.200 - 08:08.640]
size of a model is outperformed by just doing some of this RLHF. And obviously training
[08:08.640 - 08:13.520]
100x size model with supervised learning is extremely expensive in terms of compute. So
[08:13.520 - 08:17.160]
I don't know what like, I don't think open AI released the actual numbers and dollar
[08:17.160 - 08:21.320]
value that they spent on collecting human data versus like training giant models. But
[08:21.320 - 08:26.560]
you could make a good case that RLHF actually is cost effective because of it could reduce
[08:26.560 - 08:28.280]
the cost of training larger models.
[08:28.280 - 08:32.880]
Okay, that part makes sense to me. But then when I think about the, you know, this data
[08:32.880 - 08:40.000]
set that's been collected, it's I mean, they're using the data for on policy training. From
[08:40.000 - 08:44.920]
what I understand, they're using PPO, which is on policy methods. So and on policy methods,
[08:44.920 - 08:51.000]
generally, or the way I see them is you can't reuse the data, because they depend on the
[08:51.000 - 08:56.280]
data from this model sample or from a very close by model. So if you start training on
[08:56.280 - 09:00.960]
this data, and the model drifts away, then is that data set going to be still useful?
[09:00.960 - 09:04.720]
Or is it could it could ever be used for another model? Like, are these like, like disposable
[09:04.720 - 09:08.240]
data sets that are just only used for that model in one point in time?
[09:08.240 - 09:12.360]
I wouldn't say it's disposable, like I would still use that data, because the data they
[09:12.360 - 09:17.080]
actually use is like comparisons of summaries, and then they use it to train the reward model.
[09:17.080 - 09:20.920]
And so your reward model can be kind of like trained offline in that way and used for your
[09:20.920 - 09:27.480]
policy. But this the actual comparisons they do for my understanding is they compare like,
[09:27.480 - 09:32.200]
not only their current RL model, but they're comparing the supervised baseline, they're
[09:32.200 - 09:36.360]
comparing the instructions from the data set. So you kind of get this like general property
[09:36.360 - 09:41.040]
of like, is this summary better than another summary? Right. And I think that's kind of
[09:41.040 - 09:45.920]
a reusable, reusable truth about the data, you just look at as their general summaries,
[09:45.920 - 09:49.840]
and this is what makes a high quality summary, then why couldn't that apply across different
[09:49.840 - 09:53.800]
models? And that those data sets are totally reusable. And maybe we can cost effectively
[09:53.800 - 09:56.240]
build up these libraries of data sets that way.
[09:56.240 - 10:00.600]
Yeah, like to put more fine a point on it, the data that they use to train their reward
[10:00.600 - 10:06.040]
model comes from a bunch of models that isn't just their RL model. So they are using quote
[10:06.040 - 10:10.680]
unquote, off policy data to train their reward model. And it's working.
[10:10.680 - 10:14.560]
The human feedback is like only valid for a limited amount of training. Like John was
[10:14.560 - 10:19.280]
saying, if you train with that same reward model for too far, your performance ends up
[10:19.280 - 10:23.720]
falling off at some point. So I guess the implication is that you would have to keep
[10:23.720 - 10:27.400]
collecting additional human feedback after every stage, like after you've trained to
[10:27.400 - 10:31.520]
a certain degree to improve it further might require a whole new data set. We don't really
[10:31.520 - 10:34.720]
get into that with that with the chat with john. But I wonder if you had any comment
[10:34.720 - 10:35.720]
about that part.
[10:35.720 - 10:39.400]
I can't say as much for what's going on with open AI is work. But I can't say I observed
[10:39.400 - 10:44.640]
this phenomenon in my own work trying to optimize for reward, but still do something probable
[10:44.640 - 10:49.320]
under the data. And you can definitely sort of over exploit the reward function. So like
[10:49.320 - 10:54.680]
when I was training dialogue models, we had this reward function that would reward the
[10:54.680 - 11:00.200]
dialogue model for having a conversation with a human such that the human seemed positive
[11:00.200 - 11:03.880]
seem to be responding positively, but that the dialogue model itself was outputting sort
[11:03.880 - 11:09.240]
of like, high sentiment, text and stuff like that. And we had a very limited amount of
[11:09.240 - 11:12.880]
data. So I think we might have like quickly overfit to the data and the rewards that were
[11:12.880 - 11:18.640]
in it. And what you see is the policy kind of like, collapse a little bit on. So its
[11:18.640 - 11:22.680]
objective is to stay with stay within something that's probable under the data distribution,
[11:22.680 - 11:27.560]
but maximize the reward. RL is ultimately even though we're using maximum entropy RL,
[11:27.560 - 11:33.840]
it's trying to find the optimal policy. So it doesn't really care like it, it ended up
[11:33.840 - 11:37.440]
having sort of a really restricted set of behaviors where it could get kind of repetitive
[11:37.440 - 11:42.720]
and sort of exploit the reward function. So our agent with those rewards kind of got overly
[11:42.720 - 11:47.520]
positive, polite and cheerful. So I always joke that it was like the most Canadian dialogue
[11:47.520 - 11:57.160]
agent you could train. We can say that because we're two Canadians. Exactly, exactly. But
[11:57.160 - 12:02.320]
yeah, it was kind of collapsing. Like the diversity came at a cost of like diversity
[12:02.320 - 12:06.560]
in the text that was output. So I wonder if there's something similar going on with their
[12:06.560 - 12:11.800]
results about like training too long on the reward model actually leads to diminishing
[12:11.800 - 12:19.080]
and then eventually like negative returns. And it seems that the reward model isn't perfect.
[12:19.080 - 12:22.280]
If you look at the accuracy of the reward model on the validation data, it's like in
[12:22.280 - 12:27.560]
the seventies or something. So it's not perfectly describing what is quality. So you really
[12:27.560 - 12:32.080]
overfit to that reward model. It's not clear that it's going to be comprehensive enough
[12:32.080 - 12:36.680]
to describe good, good outputs. I gather that like some of your past work in this, in this
[12:36.680 - 12:41.000]
area was like doing RL at the token level, like considering each token as a separate
[12:41.000 - 12:45.320]
action, maybe sequence tutor and side learning from your way off policy paper. Was that how
[12:45.320 - 12:50.960]
it worked? Was it individual token actions? Yes. But I would mention that so is instruct
[12:50.960 - 12:56.080]
GPT if you dig into it. So what they end up doing is what you can do, it's a little easier
[12:56.080 - 12:59.640]
in policy gradients because you can get the probability of the whole sequence by just
[12:59.640 - 13:04.160]
summing the probabilities over the individual tokens. But at the end of the day, your loss
[13:04.160 - 13:08.640]
is still being propagated into your model at the token level by increasing or decreasing
[13:08.640 - 13:13.240]
token level probabilities. Oh, so you're saying when they because because the paper says that
[13:13.240 - 13:18.160]
it framed it as a bandit. And to me, that meant the entire sample, all the tokens together
[13:18.160 - 13:22.840]
were taken as one action. But you're saying because of the way it's constructed, then
[13:22.840 - 13:28.240]
it still breaks down the token level probabilities. Yeah, you can write the math as like, reward
[13:28.240 - 13:33.800]
of the entire sequence for word of the entire output times probability of the entire output.
[13:33.800 - 13:37.640]
But under the hood, the way you get probability of the entire output is a sum of the token
[13:37.640 - 13:42.600]
level probabilities. So the way that that's going to actually change the model is to affect
[13:42.600 - 13:47.120]
token level probabilities. This is why I like having this podcast because that that question
[13:47.120 - 13:51.400]
is for a while like, who am I who's gonna explain this to me? So thank you for clearing
[13:51.400 - 13:56.320]
that up. For me, Natasha, that's really cool. No problem. So does that mean there's no benefit
[13:56.320 - 14:00.840]
to looking at a token level? Or like, is it always going to be this way? Because like,
[14:00.840 - 14:05.280]
I think john was saying that it's like more tractable to do it this way as a whole sample.
[14:05.280 - 14:08.920]
So what they're actually doing that might be a little bit different than token level
[14:08.920 - 14:15.040]
RL normally is like, their discount factor is one. So they apply the same reward to all
[14:15.040 - 14:20.800]
of the tokens in the sequence. And there's no discount where like, you're getting like
[14:20.800 - 14:23.320]
later in the sequence, you're discounting the reward you're going to get at the end
[14:23.320 - 14:26.640]
of the sequence or whatever, or earlier in the sequence, you're just getting so that
[14:26.640 - 14:30.000]
is a difference. That makes sense. It seems to be working well for them. Yeah, because
[14:30.000 - 14:33.560]
it matters just as much what you say at the end, like if you say not in capital letters,
[14:33.560 - 14:35.480]
then that's kind of important.
[14:35.480 - 14:41.720]
Yeah, exactly. And I think in my work, if I recall correctly, we had experimented. So
[14:41.720 - 14:46.080]
we experienced we had rewards that were at the sequence level as well, even at the level
[14:46.080 - 14:51.240]
of the whole dialogue. So we had stuff about like, how long does the conversation go on,
[14:51.240 - 14:56.160]
which is of course, across many dialogue turns. And then we had sentence level rewards that
[14:56.160 - 15:00.640]
were spread equally over the tokens in the sentence. But for something like conversation
[15:00.640 - 15:05.240]
length, we did have a discount factor, you aren't sure the conversation is going to go
[15:05.240 - 15:09.160]
on as long as it is at the beginning. So you discount that reward. But once you're already
[15:09.160 - 15:14.120]
having a long conversation, then the reward is higher. And it was very difficult to optimize
[15:14.120 - 15:17.120]
those discounted rewards across the whole conversation.
[15:17.120 - 15:19.720]
So you combined rewards at different levels?
[15:19.720 - 15:20.720]
Yeah, yeah.
[15:20.720 - 15:25.760]
Which kind of reminds me of this recursive reward modeling. There was a paper from like
[15:25.760 - 15:31.600]
at all out of DeepMind, who was in 2018. It seems like the idea here is taking this whole
[15:31.600 - 15:37.400]
RLHF further and stacking them for more complex domains, where we have models that help the
[15:37.400 - 15:43.000]
humans provide the human feedback and stacking them up. Do you have any thoughts about recursive
[15:43.000 - 15:47.200]
reward models? Do you think that's a promising way forward? Or like, are we gonna need that
[15:47.200 - 15:48.200]
[15:48.200 - 15:52.040]
I mean, so my understanding of their example of like a recursive reward model is the user
[15:52.040 - 15:56.520]
wants to write a fantasy novel, but evaluate like writing a whole novel, and then having
[15:56.520 - 16:00.160]
that evaluated would be very expensive, and you get very little data. So you could have
[16:00.160 - 16:07.760]
a bunch of RLHF trained assistants that do things like check the grammar or summarize
[16:07.760 - 16:11.880]
the character development up to this point or something like that. And that can assist
[16:11.880 - 16:17.440]
the user in doing the task. So I think like, sure, that idea makes sense. If you want to,
[16:17.440 - 16:21.480]
if I were to make a company that's helping people write novels, I would do it at that
[16:21.480 - 16:27.680]
level rather than at the level of the whole novel, right? So so that's definitely cool.
[16:27.680 - 16:32.400]
But in terms of like, pushing forward the boundaries of RLHF, I think what I would bet
[16:32.400 - 16:36.180]
on, and maybe I'm just biased, because this is literally my own work, but I would still
[16:36.180 - 16:42.560]
bet on this idea of trying to get other forms of feedback than just like humans comparing
[16:42.560 - 16:47.040]
to answers and rate like ranking them. So I'm not saying my work is the perfect answer,
[16:47.040 - 16:52.200]
but we were trying to get this type of implicit signal that you're getting during the interaction
[16:52.200 - 16:57.080]
all the time. And so, you know, when you're speaking about, oh, RLHF is so expensive to
[16:57.080 - 17:03.480]
collect the human data. Well, what if you could be getting data for free in any way
[17:03.480 - 17:08.120]
that's pervasively in your interactions? And so it doesn't cost anything additional to
[17:08.120 - 17:12.720]
find it. So like, okay, imagine you're using open AI playground or something to play with
[17:12.720 - 17:19.560]
chat GPT. How many times did you like rephrase the same prompt until you got some behavior
[17:19.560 - 17:24.200]
and then stopped? Yeah, they must be like, could that be it? But not yet. Do you think
[17:24.200 - 17:28.880]
so? I don't know. You would hope so. Because otherwise, how are they going to scale this?
[17:28.880 - 17:32.400]
Like they, they also have thumbs up and thumbs down. But they don't, they kind of have the
[17:32.400 - 17:36.520]
limited feedback though, right? And it's not always about whether the sentiment is good.
[17:36.520 - 17:42.400]
Like you could be wanting to write something scary. Exactly. Yes. Sentiment isn't perfect.
[17:42.400 - 17:45.880]
You could also look at like, okay, I prompt GPT, I get some output. Like if they had a
[17:45.880 - 17:50.360]
way to like edit that output in the editor, which I don't actually know if they do in
[17:50.360 - 17:54.720]
playground, I have to, I have to look at that again. But any edits I made to the text would
[17:54.720 - 17:58.800]
be a signal that I didn't like it, like I need to fix this. So that could be a signal
[17:58.800 - 18:03.000]
you could be training on with RLHF. I feel like that's just going to be more scalable.
[18:03.000 - 18:06.840]
And ultimately, it's not the ground truth of the human rating of quality. But what we
[18:06.840 - 18:11.000]
show in our work, it's like even though sentiment is very and the other stuff, we didn't just
[18:11.000 - 18:14.720]
use sentiment, we use a bunch of stuff. But even though those are imperfect, and only
[18:14.720 - 18:19.720]
proxy measures, optimizing for those things still did better than optimizing for the thumbs
[18:19.720 - 18:24.160]
up thumbs down that we built into the interface, because just no one wants to bother providing
[18:24.160 - 18:28.000]
that. You have to go out of your way out of the normal interaction that you're trying
[18:28.000 - 18:32.880]
to use to like sort of altruistically provide this extra feedback and people just don't.
[18:32.880 - 18:39.060]
So yeah, I think more scalable signals is the right direction. That makes so much sense.
[18:39.060 - 18:43.380]
Are you up for talking about AGI?
[18:43.380 - 18:45.000]
Depends what the question is.
[18:45.000 - 18:48.680]
So first of all, do you think it's like it's something we should be talking about and thinking
[18:48.680 - 18:52.960]
about these days? Or is it like a distant fantasy? That's just not really worth talking
[18:52.960 - 18:53.960]
[18:53.960 - 18:58.000]
Oh, man, I always get a little bit frustrated with like AGI conversations, because nobody
[18:58.000 - 19:02.380]
really knows what they're talking about when they say AGI. Like it's not clear what the
[19:02.380 - 19:07.800]
definition is. And if you try to pin people down, it can get a little bit circular. So
[19:07.800 - 19:12.560]
like, you know, I've had people tell me, oh, AGI is coming in five years, right? And I
[19:12.560 - 19:17.760]
say, okay, well, so how do you reconcile that with the fact that CEOs of self driving car
[19:17.760 - 19:22.880]
companies think that fully autonomous self driving is it coming for 20 years? Right?
[19:22.880 - 19:27.040]
So if AGI is in five, and then my definition of AGI might be it can do everything a human
[19:27.040 - 19:33.560]
can do, but better. That doesn't make sense, right? If it can't drive a car, it's not AGI.
[19:33.560 - 19:37.440]
But then people will say, oh, but it doesn't have to be embodied. And it can still be AGI.
[19:37.440 - 19:42.000]
And okay, but then what is it doing? Like, it's, it's just such a muddy, muddy concept,
[19:42.000 - 19:43.000]
[19:43.000 - 19:46.840]
I've also been in these arguments or discussions. And then in the end, we just realized we have
[19:46.840 - 19:51.400]
different definitions. And then there's no point in arguing about two words that mean
[19:51.400 - 19:52.400]
different things.
[19:52.400 - 19:59.240]
All of that aside, I do think I have been really impressed and even a little bit concerned
[19:59.240 - 20:04.160]
about the pace of progress. Like it stuff is happening so fast that if you want to just
[20:04.160 - 20:12.360]
define AGI as highly disruptive, fast advancements in AI technology, I think we're already there.
[20:12.360 - 20:18.440]
Right? Like, look at chat GBT, right? Universities are having to revise their entire curriculum
[20:18.440 - 20:23.320]
around writing take home essays, because you can just get chat GBT to write it you an essay
[20:23.320 - 20:28.280]
better than an undergrad can. So it's already super disruptive. Like where we are now is
[20:28.280 - 20:29.680]
already super disruptive.
[20:29.680 - 20:35.800]
Yeah, it might not be like AGI do all the jobs AGI. But if it's, it's general, it's,
[20:35.800 - 20:40.160]
to me, chat GBT is the first thing I've seen that really is so general. Like nothing has
[20:40.160 - 20:45.280]
been that general before, that imagining where that generality could take us in a few years
[20:45.280 - 20:49.880]
does make me think your point about the self driving vehicles is well taken. Like I think
[20:49.880 - 20:54.360]
everyone recognizes it's been a bit of a shit show with people predicting that it's going
[20:54.360 - 20:58.040]
to come in two years and three years and it just keeps getting pushed back and the timelines
[20:58.040 - 20:59.040]
just get longer.
[20:59.040 - 21:03.160]
I think embodiment is really hard. I think fitting the long tail of stuff in the real
[21:03.160 - 21:06.960]
world is really hard. So you might have seen this example. I think like Andre Carpathi
[21:06.960 - 21:14.400]
talked about it for Tesla, where they had an accident because there was a, the car couldn't
[21:14.400 - 21:19.480]
perceive this thing that happened, which was a semi truck carrying a semi truck carrying
[21:19.480 - 21:25.640]
a semi truck. So like a truck on a truck on a truck. And they were just like that. I hadn't
[21:25.640 - 21:29.400]
even seen that before. It wasn't in the support of the training data. And of course we know
[21:29.400 - 21:34.280]
these models, like if they get off the support of the training data, don't do that well.
[21:34.280 - 21:39.280]
So how will you ever curate a dataset that's going to cover every single thing in the real
[21:39.280 - 21:44.200]
world? I would argue that you can't, especially because the real world is non-stationary.
[21:44.200 - 21:48.760]
It's always changing. So new things are always being introduced. So sort of definitionally,
[21:48.760 - 21:54.960]
you can't cover everything that might happen in the real world. And so, you know, that's
[21:54.960 - 21:58.400]
why I'm excited about some of these approaches. It sounds like you talked about this on a
[21:58.400 - 22:02.760]
previous episode, but like, um, I've been working on this like adversarial environment
[22:02.760 - 22:07.320]
design stuff or unsupervised environment design stuff for RL agents, where you actually try
[22:07.320 - 22:13.400]
to search for things that can make your model fail and like generate those problems, um,
[22:13.400 - 22:18.360]
and train on them. And I think that could be an approach that is more tenable than just
[22:18.360 - 22:23.560]
supervised learning on a limited dataset. Totally. Yeah. We spoke with your colleague,
[22:23.560 - 22:28.960]
Michael Dennis, who was a co-author of yours on the paired paper. Is that right? Yes. Yeah,
[22:28.960 - 22:33.600]
exactly. Yeah. And I met him as at the poster session at, I think it was ICML. I love that
[22:33.600 - 22:36.560]
right away. And then I wasn't surprised at all to find your name on it. I didn't know
[22:36.560 - 22:41.160]
that at first. That makes total sense. That's exactly the type of thing Natasha would come
[22:41.160 - 22:45.680]
up with. The idea of embodiment, basically robotics is super hard or anything that has
[22:45.680 - 22:51.280]
to touch real world sensors. And it seems what chat GPT has shown us is if we can stay
[22:51.280 - 22:57.280]
in the abstract world of text, we actually have like magic powers even today in 2022,
[22:57.280 - 23:03.240]
2023. Um, we could do a lot with the techniques we already have in the, we were staying in
[23:03.240 - 23:10.040]
the world of texts and abstract thought and now, and, and code and, um, abstract symbols
[23:10.040 - 23:14.920]
basically. So maybe it goes to the back to that point of, of the real world and robotics
[23:14.920 - 23:18.880]
just being turning out to be the really hard stuff, the animal intelligence being super
[23:18.880 - 23:23.080]
hard and the abstract thought that we used to think we made us so special is turning
[23:23.080 - 23:27.640]
out to be maybe way easier. We've already solved go that we thought was impossible not
[23:27.640 - 23:33.480]
long ago. And, and, uh, Chad GPT is doing, showing us a level of generality we could
[23:33.480 - 23:39.800]
not expect from robotics, you know, maybe for ages. Yeah. And I mean, you probably remember
[23:39.800 - 23:43.120]
the name of this principle better than I do, but it's sort of the principle that, uh, things
[23:43.120 - 23:47.080]
for, that are really hard for us to solve, like chess and go are actually easy to get
[23:47.080 - 23:51.200]
AI to solve. Maybe because we have more awareness of the process, but like the most low level
[23:51.200 - 23:54.760]
stuff about, you know, manipulation, like how do you pick something up with your hand
[23:54.760 - 23:59.760]
is a very challenging problem editor's note. I forgot. So I looked it up afterwards. This
[23:59.760 - 24:03.840]
is more of X paradox. I want to share like my favorite anecdote when thinking about why
[24:03.840 - 24:09.440]
embodiment is so hard. I've been working on this, this problem of, um, language conditioned
[24:09.440 - 24:13.360]
RL agents. So they take a natural language instruction, they try to follow it and do
[24:13.360 - 24:18.720]
something in the world. Right. And, uh, so I was in, in that space, I was reading this
[24:18.720 - 24:23.080]
paper from deep mind, which is, uh, imitating interactive intelligence and they have this
[24:23.080 - 24:27.440]
sort of simulated world where a robot can walk around and it's kind of like a video
[24:27.440 - 24:32.480]
game, like a low res video game kind of environment. So not super high res visuals, but it can
[24:32.480 - 24:36.880]
do things like, um, it'll get an instruction, like pick up the orange duck and put it on
[24:36.880 - 24:41.720]
the bed or pick up the cup and put it on the table or something like that. Right. And they
[24:41.720 - 24:46.400]
invested like two years. There's a team of 30 people. I heard they spent millions of
[24:46.400 - 24:52.960]
dollars on this project, right? They collect this massive human dataset of, um, people
[24:52.960 - 24:58.160]
giving instructions and then trying to follow those instructions in the environment. And
[24:58.160 - 25:02.280]
the dataset they collect is so massive that I think half of the instructions in the dataset
[25:02.280 - 25:06.600]
are exact duplicates of each other. So they'd have two copies of it, pick up the orange
[25:06.600 - 25:11.440]
duck and put it on the table or whatever. Um, and they train on this to the best of
[25:11.440 - 25:16.000]
their ability. And guess what, their success rate in actually following these instructions,
[25:16.000 - 25:19.920]
like guess what percentage of the time they can successfully follow the instructions in
[25:19.920 - 25:24.280]
this environment. I'm just trying to take a cue from you. I don't, I vaguely remember
[25:24.280 - 25:29.640]
this paper, but I'm going to guess it was terrible. Like 5%, not 5%, but it's 50%. 50%.
[25:29.640 - 25:34.960]
Okay. What do you feel about that number? Is it is shockingly low or low for that much
[25:34.960 - 25:40.480]
investment and for a pretty simple problem. Like it just, it's surprising that they can't
[25:40.480 - 25:45.960]
do better. And I think that just illustrates like how hard this, you know, we've seen that
[25:45.960 - 25:49.840]
you can tie a text and images together pretty effectively. Like we're seeing all of these
[25:49.840 - 25:53.000]
texts to image generation models that are compositional. They're beautiful. They're
[25:53.000 - 25:58.440]
working really well. Um, so I don't think that's the problem, but just like adding this
[25:58.440 - 26:04.200]
idea of navigating a physical body in the environment to carry out the task while perceiving
[26:04.200 - 26:09.520]
vision and linking it to the text just becomes so hard and it's very hard to get anything
[26:09.520 - 26:10.520]
[26:10.520 - 26:15.840]
Yeah. 50%. I don't know. It's higher than I thought. But if we look at like, uh, we,
[26:15.840 - 26:20.240]
so we talked to Carol Houseman here a few episodes back and working on this, the say
[26:20.240 - 26:27.360]
can, which is the kitchen robot that you can give verbal, which becomes textual instructions
[26:27.360 - 26:31.800]
and it is using RL and it is actually doing things in a real kitchen with, you know, in
[26:31.800 - 26:36.600]
the real world and some sponging things up. And, and, um, I mean, a few things struck
[26:36.600 - 26:40.800]
me about that. Like they were doing something that sounds kind of similar to what you're
[26:40.800 - 26:46.720]
describing and, but I was amazed by how much they had to divide up the problem and how
[26:46.720 - 26:51.240]
much work it was to build all the parts because they had to make separate value functions
[26:51.240 - 26:56.400]
for all their skills. And then, but I think connecting it to the text seemed to be kind
[26:56.400 - 26:57.400]
of the easier part.
[26:57.400 - 27:03.000]
Well, so they actually, they actually don't connect text to embodiment. I would argue.
[27:03.000 - 27:08.520]
So first let me say Carol's an amazing person. He's great. Say can is so great of a paper
[27:08.520 - 27:13.000]
that Google is amazingly excited. And I think, so I'm actually doing some work. That's like
[27:13.000 - 27:18.040]
a followup to say can, and it's literally the most crowded research area I've ever been
[27:18.040 - 27:22.700]
in. Like there's so many Google interns working on followups to say can like everyone's excited.
[27:22.700 - 27:28.240]
So it's great work. So not trash the work at all, but they actually do separate the
[27:28.240 - 27:33.760]
problem of understanding the language and doing the embodied tasks almost completely
[27:33.760 - 27:38.080]
because the understanding of the language is entirely offloaded to a pre-trained large
[27:38.080 - 27:44.240]
language model. And then the executing of tasks is train. You train a bunch of low level
[27:44.240 - 27:49.480]
robotic policies that are able to like pick something up or do this. And you just select
[27:49.480 - 27:55.120]
which low level robotics policy to execute based on what looks probable under the language
[27:55.120 - 28:00.560]
model and what has the highest value estimate for those different policies. But there's
[28:00.560 - 28:08.080]
no network that's really doing high level language understanding and embodied manipulation
[28:08.080 - 28:13.320]
at the same time. Yeah. I thought it was innovative how they separated that so they didn't really
[28:13.320 - 28:18.560]
have to worry about that. They kind of like offloaded that whole problem to the LLM without
[28:18.560 - 28:21.940]
having the LLM know anything about robotics. It's definitely innovative and it works super
[28:21.940 - 28:27.320]
well and I think that's why the paper is exciting. But it's kind of, to me, like I was really
[28:27.320 - 28:31.920]
excited about this idea of an embodied agent that could really understand language and
[28:31.920 - 28:36.440]
do embodied stuff at the same time because if you think, okay, talking about what is
[28:36.440 - 28:42.440]
AGI, if we just use a definition of something that's like the maximally general representation
[28:42.440 - 28:48.920]
of knowledge, then you should have something that can not only understand text, but understand
[28:48.920 - 28:52.520]
how the text is mapped to images in the world because that's already going to expand your
[28:52.520 - 28:58.400]
representation, but understand how that maps to physics and how to navigate the world.
[28:58.400 - 29:01.880]
And so it'd be so cool if we could have an agent that actually like in the same network
[29:01.880 - 29:06.920]
is encoding all of those things. This is also just really reminding me of why I really like
[29:06.920 - 29:10.640]
talking with you, Tasha, because you're so passionate about this stuff. And also you
[29:10.640 - 29:17.660]
don't pull any punches. You will call a spade a spade no matter what. And you see the big
[29:17.660 - 29:24.480]
picture and you're so critical and sharp. And that's honestly the spirit that I was
[29:24.480 - 29:26.760]
looking for with this whole show.
[29:26.760 - 29:33.440]
I hope I'm not sounding too critical. I mean, I love this work, so.
[29:33.440 - 29:38.500]
I think my feedback on Seikan on a very high level is that they're depending on the language
[29:38.500 - 29:44.200]
model to already know what makes sense in that kitchen. But if they were in an untraditional
[29:44.200 - 29:47.600]
kitchen or they invented a new type of kitchen or they were in some kind of space where the
[29:47.600 - 29:52.980]
language model didn't really get it, then none of that would work. They're depending
[29:52.980 - 29:57.760]
on common sense of the language model to know what order to do things in the kitchen. And
[29:57.760 - 29:59.800]
they're assuming that common sense is common.
[29:59.800 - 30:03.400]
Yeah. And it's hard because they're kind of missing this like pragmatics thing too.
[30:03.400 - 30:07.680]
So humans could give you ambiguous instructions about what to do in the kitchen that could
[30:07.680 - 30:14.240]
only be resolved by looking around the kitchen. Like if they just said, get me that plate.
[30:14.240 - 30:19.240]
And there's multiple plates. How do you resolve that? Well, now you might want to use pragmatics
[30:19.240 - 30:24.640]
about like the plate that's closer to the human or something about like visually assessing
[30:24.640 - 30:27.920]
the environment and Seikan's not going to be able to do that, right?
[30:27.920 - 30:33.140]
Well they had the inner monologue edition, which added this idea of having other voices.
[30:33.140 - 30:37.200]
And so that might be able to, if they had another voice that was like describing what
[30:37.200 - 30:42.320]
the person's doing or looking at, inject it into the conversation. And inner monologue
[30:42.320 - 30:46.720]
to me seemed very promising. That was the second part of our conversation with Carol
[30:46.720 - 30:51.180]
and Fay. And that was fascinating to me and a little smooth because this robot has an
[30:51.180 - 30:57.800]
inner monologue going. But that let them leverage the language model and have more, a lot more
[30:57.800 - 30:59.400]
input into it.
[30:59.400 - 31:00.400]
That's cool.
[31:00.400 - 31:01.520]
And it seemed like an extensible approach.
[31:01.520 - 31:05.320]
That's cool. That can be quite promising. I don't know. I still just want to see a model
[31:05.320 - 31:10.160]
that does vision, text, and embodiment. I'm excited for that when that comes.
[31:10.160 - 31:14.760]
I see that you're planning to return to academia as an assistant professor at U Washington,
[31:14.760 - 31:15.760]
is that right?
[31:15.760 - 31:16.760]
That's right.
[31:16.760 - 31:22.100]
Cool. So that's an interesting choice to me after working at leading labs in the industry.
[31:22.100 - 31:26.760]
And I bet some people might be looking to move the opposite direction, especially a
[31:26.760 - 31:32.040]
lot of people have talked about the challenges of doing cutting edge AI on academic budgets
[31:32.040 - 31:36.960]
when more and more of this AI depends on scale. That becomes very expensive. So can you tell
[31:36.960 - 31:41.880]
us more about the decision? What drew you back to academia? What's your thought process
[31:41.880 - 31:42.880]
[31:42.880 - 31:47.080]
Yeah. I mean, so you might think like, if I want to contribute to AI, I need a massive
[31:47.080 - 31:52.440]
compute budget and I need to be training these large models and how can academics afford
[31:52.440 - 31:56.880]
that? But what I actually see happening as a result of this is that what's going on in
[31:56.880 - 32:02.440]
industry is that more and more people and researchers in industry are being encouraged
[32:02.440 - 32:08.680]
to sort of amalgamate into these large, large teams of 30 or 50 authors where they're all
[32:08.680 - 32:14.320]
just working on what looks more like a large scale engineering effort to scale up a research
[32:14.320 - 32:19.640]
idea that's kind of already been proven out. Right? So you'll see like, there's big teams
[32:19.640 - 32:25.060]
at Google that are now trying to work on RLHF and the RLHF they're doing is very similar
[32:25.060 - 32:28.240]
to what OpenAI is doing. They're just trying to actually scale it up and write their own
[32:28.240 - 32:33.440]
version of infrastructure and stuff like that. And I hear the same thing is going on. It
[32:33.440 - 32:39.080]
already was that case at OpenAI where they're a little less focused on publishing, a little
[32:39.080 - 32:44.640]
more focused on scaling up in big teams. Apparently pressure at DeepMind is doing something similar
[32:44.640 - 32:49.960]
where if you're pursuing your own little creative research direction, that's going to be less
[32:49.960 - 32:55.240]
tenable than actually jumping onto a big team and kind of contributing in that way. So if
[32:55.240 - 33:01.720]
you're interested in doing creative research, novel research that sort of hasn't been proven
[33:01.720 - 33:06.160]
out already and coming up with new ideas and testing them out, I think there's less room
[33:06.160 - 33:11.440]
for that in industry right now. And I actually care a lot about research freedom and the
[33:11.440 - 33:15.760]
ability to kind of like think of a clever idea and test it out myself and see if it's
[33:15.760 - 33:20.360]
going to work. And I think there's a real role for that. Obviously scaling this stuff
[33:20.360 - 33:25.960]
up in industry works really well, but what actually works is they do end up using ideas
[33:25.960 - 33:31.360]
that were innovated in academia and incorporating that into what they're scaling up. So we were
[33:31.360 - 33:36.000]
talking at the beginning of this podcast about just that idea of doing KL control from your
[33:36.000 - 33:40.340]
prior is something that I did on a very, very small scale in academia that ends up being
[33:40.340 - 33:45.800]
useful in the system eventually, right? In the system that gets scaled up. So I see the
[33:45.800 - 33:50.720]
role of academics to do that same kind of proof of concept work, like discover these
[33:50.720 - 33:55.920]
new novel research ideas that work and then industry can have the role of scaling them
[33:55.920 - 33:59.600]
up, right? And so it just depends on what you want to be doing. Like, do you want to
[33:59.600 - 34:03.680]
be on a giant team working on infrastructure or do you want to be doing the kind of more
[34:03.680 - 34:08.520]
researchy like testing out ideas thing? And for me, I'm much more excited about the ladder.
[34:08.520 - 34:13.400]
That makes total sense. And like, I guess you're getting the credit from the citations
[34:13.400 - 34:18.120]
from these big papers that really work, but maybe not so much the public credit because
[34:18.120 - 34:23.280]
like everyone's just points to check and they think that is AI, like open AI invented AI,
[34:23.280 - 34:26.480]
but they're building on like the shoulders of all these giants from the past, including
[34:26.480 - 34:30.000]
yourself and all the academics know this, but for the public, it's like, Oh look, they
[34:30.000 - 34:31.000]
solved AI.
[34:31.000 - 34:37.400]
That's interesting. Yeah. I mean, I think my, my objective is more about like, well,
[34:37.400 - 34:41.680]
I just enjoy the process of like testing out ideas and seeing if they work, but my objective
[34:41.680 - 34:47.040]
is much more like, did you end up contributing something that was useful rather than did
[34:47.040 - 34:49.480]
you get the glory?
[34:49.480 - 34:54.600]
That's very legitimate to legit. Okay. So, um, what do you plan to work on at UW? Have
[34:54.600 - 34:58.600]
you, do you have a clear idea of that or is that something that you'll decide?
[34:58.600 - 35:02.080]
I do have a clear idea because you kind of, they don't give you the job unless you can
[35:02.080 - 35:07.960]
kind of sell it and sell what you're going to do. So, um, yeah, I mean the pitch that
[35:07.960 - 35:12.360]
I was kind of pitching on the faculty job market is like, um, I want to do this thing
[35:12.360 - 35:16.960]
called social reinforcement learning. And the idea is what are the benefits you can
[35:16.960 - 35:21.640]
get in terms of improving AI when you consider the case that you're likely going to be learning
[35:21.640 - 35:26.400]
in an environment with other intelligent agents. So you can either think about that as like
[35:26.400 - 35:30.760]
setting up a multi-agent system to make your agent more robust. That would be like paired
[35:30.760 - 35:35.400]
would be in that kind of category of thing. Or you could think about this idea that, you
[35:35.400 - 35:38.280]
know, for most of what we want AI to do, you might be deployed in environments where there
[35:38.280 - 35:42.120]
are humans and humans are pretty smart and have a lot of knowledge that might benefit
[35:42.120 - 35:47.520]
you when you're trying to do a task. So not only thinking about how to flexibly learn
[35:47.520 - 35:52.040]
from humans, like when I think about social learning, I don't think about just indiscriminately
[35:52.040 - 35:58.520]
imitating every human, but maybe kind of the human skill of social learning is about identifying
[35:58.520 - 36:02.600]
which models are actually worth learning from and when you should rely on learning from
[36:02.600 - 36:07.160]
others versus your independent exploration. So I think that's like a whole set of questions.
[36:07.160 - 36:11.960]
And then finally, I want to just make AI that's useful for interacting with humans. So, you
[36:11.960 - 36:16.680]
know, how do you interact with a new human you've never seen before and cooperate with
[36:16.680 - 36:20.640]
them to solve a task? So kind of the zero shot cooperation problem, how do you perceive
[36:20.640 - 36:26.120]
what goal they're trying to solve? How do you learn from their feedback? And this is
[36:26.120 - 36:30.320]
including types of implicit feedback. And then finally, this whole branch of like, how
[36:30.320 - 36:34.440]
do you communicate with humans in natural language to solve tasks? So that's why I've
[36:34.440 - 36:38.320]
been working on this kind of language condition RL, how do you train language models with
[36:38.320 - 36:43.160]
human feedback, this whole set of things. That's the pitch.
[36:43.160 - 36:47.560]
Awesome. And they obviously loved it because you're hired.
[36:47.560 - 36:51.120]
It depends, but yeah, I'm excited.
[36:51.120 - 36:55.840]
So I mean, it sounds like a lot of stuff that I had to learn as a young person as a awkward
[36:55.840 - 37:03.520]
nerdy teen how to talk to humans. Who is human? Should I imitate? Right? Exactly. And then
[37:03.520 - 37:07.800]
can you do you want to talk about some of your recent papers since you've been on last,
[37:07.800 - 37:11.880]
which is three and a half years ago, I see there on Google Scholar, there's been lots
[37:11.880 - 37:16.720]
of lots of papers since then with your name on them. But there was a few that that we
[37:16.720 - 37:22.120]
had kind of talked about touching on today, including basis and sci fi. Should we talk
[37:22.120 - 37:23.120]
about those?
[37:23.120 - 37:27.640]
Sure. So I think maybe I'll also add another paper that was like sort of the precursor
[37:27.640 - 37:32.280]
to sci fi from my perspective, really touching on this idea of like, what is social learning
[37:32.280 - 37:37.040]
versus just like imitation learning versus RL. So I'm really thinking about this problem,
[37:37.040 - 37:41.680]
like you're in an environment with other agents that might have knowledge that's relevant
[37:41.680 - 37:45.680]
to the task, but you don't know if they do and they're pursuing self interested goals.
[37:45.680 - 37:49.840]
So you can think about like an autonomous car on the road. There are other cars that
[37:49.840 - 37:53.240]
are driving, but some of them are actually bad drivers. So you don't want to sort of
[37:53.240 - 37:58.240]
indiscriminately imitate or your robot in an office picking up trash. There are humans
[37:58.240 - 38:01.620]
that are going about their day. They don't want to stop and sort of explicitly teach
[38:01.620 - 38:05.220]
you what to do. They're trying to get work done. So how do you benefit from learning
[38:05.220 - 38:06.220]
from that?
[38:06.220 - 38:11.400]
So we had a couple of papers on this. The first paper was actually with Kamal Indus,
[38:11.400 - 38:17.560]
who's now at entropic. And he his paper was looking at do RL agents benefit from social
[38:17.560 - 38:22.440]
learning by default. So if you're in an environment with another agent that's sort of constantly
[38:22.440 - 38:28.480]
showing you how to do the task correctly, do you learn any faster than an RL agent that's
[38:28.480 - 38:35.680]
in an environment by itself? And his conclusion was actually, no, they don't. So default RL
[38:35.680 - 38:40.040]
agents are actually really bad at social learning. And his work showed that if you just add this
[38:40.040 - 38:44.680]
auxiliary prediction task, like predicting your own next observation, then you're implicitly
[38:44.680 - 38:49.200]
modeling what's going on with the other agents in the environment. That makes its way into
[38:49.200 - 38:53.920]
your representation and you're more able to learn from their behavior. And that the cool
[38:53.920 - 38:58.240]
part about this is, if you actually learn the social learning behavior, like how to
[38:58.240 - 39:02.840]
learn from other agents in your environment, then when you can actually generalize much
[39:02.840 - 39:07.680]
more effectively to a totally new task that you've never seen before, because you can
[39:07.680 - 39:12.560]
apply that skill of social learning to master the new task. So you sort of learned how to
[39:12.560 - 39:17.160]
socially learn. And those social learning agents end up generalizing a lot better than
[39:17.160 - 39:21.640]
agents that are trained with imitation learning or with RL and generalizing to new tasks.
[39:21.640 - 39:27.120]
So I think that's quite exciting. And then sci-fi learning was like a follow-up that
[39:27.120 - 39:33.080]
does the social learning in a much more effective way. So basically, it's going to be hard to
[39:33.080 - 39:37.440]
describe. It's a little, it's kind of uses the math of successor features. So it might
[39:37.440 - 39:44.800]
be a little hard to describe on a podcast, but the idea is you're going to model not
[39:44.800 - 39:49.760]
only your own policy, but every other agent's policy in the environment in a way that kind
[39:49.760 - 39:55.440]
of disentangles a representation of the states that they're going to experience from the
[39:55.440 - 39:59.880]
rewards that they're trying to optimize. So using this like successor representation.
[39:59.880 - 40:03.760]
And what that lets you do is you can kind of take out the part that models the other
[40:03.760 - 40:10.520]
agent's rewards and substitute your own reward function in with the other agent's policy.
[40:10.520 - 40:14.040]
And that lets you compute, hey, if I were to act like the other agent right now, if
[40:14.040 - 40:19.020]
I were to copy, you know, agent two over here, would I actually get more rewards under my
[40:19.020 - 40:25.980]
own reward function? And so you can, that lets you just flexibly choose who and what
[40:25.980 - 40:31.320]
to imitate and when. So at every time step, you can choose to rely on your own policy
[40:31.320 - 40:34.320]
or you can choose to copy someone else and you can choose who's the most appropriate
[40:34.320 - 40:40.240]
person to copy. And what we show is that that actually gets you better performance than
[40:40.240 - 40:44.560]
either purely relying on imitation learning, which is going to fail if the other agents
[40:44.560 - 40:50.680]
are doing bad stuff or purely relying on RL, which is you're going to miss out on a bunch
[40:50.680 - 40:53.920]
of useful behaviors that other agents know how to do if you're just trying to discover
[40:53.920 - 40:58.360]
everything yourself. So I think that whole direction is actually quite interesting to
[40:58.360 - 41:04.520]
me. I did skim that paper. And it seemed like it reminded me of an old multi agent competition
[41:04.520 - 41:10.600]
I once did, Bomberman. And it was quite challenging to work with these other agents. And it would
[41:10.600 - 41:15.440]
have been pretty cool to be able to imitate them, imitate them better. And I could imagine
[41:15.440 - 41:20.040]
that for humans, we're learning from other people all the time, not ever since probably
[41:20.040 - 41:25.200]
since birth. And and we haven't really spent as much time thinking about that in AI.
[41:25.200 - 41:27.680]
That's something I'm really excited about. I don't know if we talked about this last
[41:27.680 - 41:33.360]
time, but this whole idea that a big component of human intelligence and what sets us apart
[41:33.360 - 41:38.680]
from other animals or, you know, other forms of intelligence is that we rely so heavily
[41:38.680 - 41:43.720]
on social learning. Like we discover almost nothing completely independently, like, look
[41:43.720 - 41:47.520]
at research, right? So much of it is reading what everyone else has done and then making
[41:47.520 - 41:53.440]
a tiny tweak on top. Right? So it's just that kind of building on standing on the shoulders
[41:53.440 - 41:58.080]
of giants, learning from others, I see is really important. I also see social learning
[41:58.080 - 42:01.960]
as a path to address this sort of like truck on truck on truck problem we were talking
[42:01.960 - 42:08.040]
about earlier. Like you kind of need adaptive online generalization to solve some of these
[42:08.040 - 42:13.800]
safety critical at like problems. So imagine I'm a self driving car. And I encounter a
[42:13.800 - 42:18.280]
situation that I've never seen in my training data, which is like, there's a big flood.
[42:18.280 - 42:23.720]
And the bridge I'm trying to go under is completely flooded. Right? And if I just drive forward,
[42:23.720 - 42:30.080]
I can actually destroy my car and get the passengers in danger, right? But the other
[42:30.080 - 42:34.200]
humans are on the road are probably gonna be pretty smart and realize what they should
[42:34.200 - 42:38.480]
do or it'll have a better chance of realizing it than me, the self driving car. So maybe
[42:38.480 - 42:42.640]
I should be at that point, actually relying on more on social learning to take cues from
[42:42.640 - 42:48.160]
others and figure use that as a way to adapt to the situation, rather than just relying
[42:48.160 - 42:52.520]
on my pre training data. And this isn't just my idea. Like I think Anka Dragan has a nice
[42:52.520 - 42:57.840]
paper on this. When you're if you're a self driving cars uncertain, it should be copying
[42:57.840 - 43:01.360]
other agents. But I think I think there's something really promising there.
[43:01.360 - 43:06.240]
Yeah, coming back to that truck on truck on truck, like there's no limit to what things
[43:06.240 - 43:11.720]
you might stack. I used to live in India and the stuff you would see on a truck in India
[43:11.720 - 43:16.520]
is just so unpredictable. But but the way I recognize what it is, is I is I look at
[43:16.520 - 43:21.200]
the lower the lower part of it. And I'm like, Oh, it has truck wheels. No matter what weird
[43:21.200 - 43:26.560]
thing is on top, that is a truck. And I think the the models that we have right now aren't
[43:26.560 - 43:31.680]
very good at like, ignoring thing distract stuff. That's that's more a problem with the
[43:31.680 - 43:36.080]
function approximator. It's not I don't think it's a real RL issue. But, but um, that's
[43:36.080 - 43:40.800]
always disappointed me that we haven't, we haven't somehow got past that distracter feature.
[43:40.800 - 43:46.360]
That's a really insightful point. And I think, you know, there's many different things we
[43:46.360 - 43:51.040]
have to solve with AI. If I'm channeling like Josh Tenenbaum's answer to the problem you
[43:51.040 - 43:55.040]
just brought up, I mean, he would basically, well, I don't know how good of a job I can
[43:55.040 - 43:59.040]
do channeling Josh Tenenbaum, but he would say like, we need more symbolic representations
[43:59.040 - 44:03.060]
where we can generalize representation to understand that like, a truck with hay on
[44:03.060 - 44:08.360]
it is still fundamentally a truck. Like there's some fundamental characteristics that make
[44:08.360 - 44:12.100]
the definition of this thing. And we shouldn't be just if we're just doing like this purely
[44:12.100 - 44:16.920]
inductive deep learning thing of like, I've seen a bazillion examples of a truck, and
[44:16.920 - 44:20.520]
therefore I can recognize a truck. But if it goes out of my distribution, I can't recognize
[44:20.520 - 44:27.240]
it. I mean, maybe this is the problem of representation. And just to be very like, speculative,
[44:27.240 - 44:32.000]
I do think there's something promising about models that integrate language, speaking of
[44:32.000 - 44:36.120]
why I want to put language models into agents that actually like put an actual language
[44:36.120 - 44:41.420]
representation into an RL agent, like because language is compositional, you get these kind
[44:41.420 - 44:44.640]
of compositional representations that could potentially help you generalize better. So
[44:44.640 - 44:50.320]
like, if you look at like, image and language models, you know, like clip, or you look at
[44:50.320 - 44:55.440]
all these image generation models, we see very strong evidence of compositionality,
[44:55.440 - 45:00.960]
right? Like you get these prompts that clearly have never been in the training data. And
[45:00.960 - 45:05.660]
they're able to generate convincing images of them. And I think that's just because language
[45:05.660 - 45:11.360]
helps you organize your representation in a way that allows you to combine these components.
[45:11.360 - 45:15.200]
So maybe like a compositional representation of a truck is like, yeah, it's more like,
[45:15.200 - 45:18.840]
it definitely has to have wheels. But it doesn't matter what it's carrying.
[45:18.840 - 45:23.960]
This reminds me of a poster I saw at ICML called concept bottleneck model.
[45:23.960 - 45:29.720]
Oh, yeah. Exactly. I'm doing a concept bottleneck model for multi agent interpretability paper.
[45:29.720 - 45:34.520]
I think we're going to release it on archive very soon. I'm very excited about it. But
[45:34.520 - 45:36.160]
yeah, it's a it's a cool idea.
[45:36.160 - 45:40.440]
Great looking forward to that too. Yeah, I just want to say it's always such a good time
[45:40.440 - 45:44.680]
chatting with you. It's really enjoyable. I always learn so much. I'm inspired. I can't
[45:44.680 - 45:49.280]
wait to see what you come up with next. Thanks so much for sharing your time with with the
[45:49.280 - 46:12.400]
talk our audience. Thank you so much. I really appreciate being here.

Creators and Guests

Robin Ranjit Singh Chauhan
Robin Ranjit Singh Chauhan
๐ŸŒฑ Head of Eng @AgFunder ๐Ÿง  AI:Reinforcement Learning/ML/DL/NLP๐ŸŽ™๏ธHost @TalkRLPodcast ๐Ÿ’ณ ex-@Microsoft ecomm PgmMgr ๐Ÿค– @UWaterloo CompEng ๐Ÿ‡จ๐Ÿ‡ฆ ๐Ÿ‡ฎ๐Ÿ‡ณ
Natasha Jaques 2
Broadcast by