TalkRL: The Reinforcement Learning Podcast

John Schulman

October 18, 2022 / 44:21/E38

John Schulman, OpenAI cofounder and researcher, inventor of PPO/TRPO talks RL from human feedback, tuning GPT-3 to follow instructions (InstructGPT) and answer long-form questions using the internet (WebGPT), AI alignment, AGI timelines, and more!

[00:00.000 -- 00:01.960] the answer was affirmative.
[00:01.960 -- 00:05.680] We can get an agent to basically use a set of tools
[00:05.680 -- 00:06.520] that we give it.
[00:06.520 -- 00:09.440] In this case, the browsing commands, like searchings.
[00:09.440 -- 00:13.880] I would say I expect AI to be able to do a better job
[00:13.880 -- 00:16.860] than humans at most jobs that humans do now,
[00:16.860 -- 00:17.980] five years or so.
[00:19.660 -- 00:20.500] Talk RL.
[00:22.660 -- 00:26.700] Talk RL podcast is all reinforcement learning all the time,
[00:26.700 -- 00:29.900] featuring brilliant guests, both researched and applied.
[00:29.900 -- 00:33.520] Join the conversation on Twitter at Talk RL podcast.
[00:33.520 -- 00:35.160] I'm your host, Robin Chauhan.
[00:39.500 -- 00:41.900] John Shulman is a co-founder of OpenAI
[00:41.900 -- 00:44.480] and a researcher and engineer at OpenAI.
[00:44.480 -- 00:46.360] He is well known for major contributions
[00:46.360 -- 00:48.440] to the field of reinforcement learning,
[00:48.440 -- 00:50.760] including the TRPO algorithm,
[00:50.760 -- 00:52.920] that's Trust Region Policy Optimization,
[00:52.920 -- 00:56.000] GAE, Generalized Advantage Estimation.
[00:56.000 -- 00:58.120] Those are from his UC Berkeley dissertation.
[00:58.120 -- 01:02.080] And TRPO's Descendant Proximal Policy Optimization, or PPO.
[01:02.080 -- 01:06.040] His current focus at OpenAI is on RL from human feedback.
[01:06.040 -- 01:08.320] John, welcome to the show and thanks so much for being here.
[01:08.320 -- 01:09.360] Thanks a lot for having me.
[01:09.360 -- 01:11.380] You were literally one of the first people I thought of
[01:11.380 -- 01:13.840] when I started the show three years back.
[01:13.840 -- 01:14.880] Thanks, I'm honored.
[01:14.880 -- 01:17.320] It means a lot to me to have you here today.
[01:17.320 -- 01:20.920] I definitely remember your nuts and bolts of deep RL video
[01:20.920 -- 01:23.240] back in the day and watching that multiple times
[01:23.240 -- 01:24.360] and gaining a lot from that.
[01:24.360 -- 01:26.200] So I think you helped probably a generation
[01:26.200 -- 01:28.640] of RL practitioners back then.
[01:28.640 -- 01:31.280] By the way, there's going to be a reboot
[01:31.280 -- 01:33.360] of the nuts and bolts presentation.
[01:33.360 -- 01:37.320] I got invited to give a talk at NURIPS this year on it.
[01:37.320 -- 01:41.200] So I'll have to revamp the guidelines and everything.
[01:41.200 -- 01:42.120] So that'll be fun.
[01:42.120 -- 01:42.960] Oh, that's awesome.
[01:42.960 -- 01:43.780] Can't wait for that.
[01:43.780 -- 01:47.240] So you were clearly one of the earlier pioneers in deep RL.
[01:47.240 -- 01:49.640] So how did you choose to move your focus to RL
[01:49.640 -- 01:50.800] from human feedback?
[01:50.800 -- 01:52.560] And why is that an important problem?
[01:52.560 -- 01:53.740] Why is that important to you?
[01:53.740 -- 01:57.560] After GBD3 was trained, I was blown away by how smart it was.
[01:57.560 -- 02:00.040] And I realized the next frontier was figuring out
[02:00.040 -- 02:02.000] how to make language models actually useful.
[02:02.000 -- 02:03.800] I'm still really interested in RL,
[02:03.800 -- 02:07.400] but solving RL benchmarks isn't the end of the story.
[02:07.400 -- 02:10.360] To use your RL algorithm, you need a reward function.
[02:10.360 -- 02:12.680] But where does the reward function come from?
[02:12.680 -- 02:15.160] In RL benchmarks, you usually just code up
[02:15.160 -- 02:16.020] the reward function.
[02:16.020 -- 02:18.320] But if you're not in a simulator environment,
[02:18.320 -- 02:19.160] that doesn't work.
[02:19.160 -- 02:23.280] So what we have to do in any kind of real world use case
[02:23.280 -- 02:25.160] is have humans look at what the AI did
[02:25.160 -- 02:26.680] and decide if it was good or bad.
[02:26.680 -- 02:29.200] So how exactly you define this reward
[02:29.200 -- 02:31.800] becomes a really challenging and important problem,
[02:31.800 -- 02:34.160] especially as the tasks get harder to evaluate.
[02:34.160 -- 02:37.240] Another angle on this is that language models are very smart,
[02:37.240 -- 02:40.400] but it's hard to get them to do anything useful.
[02:40.400 -- 02:43.200] A big part of that is they're not necessarily
[02:43.200 -- 02:44.240] trying to do what you want.
[02:44.240 -- 02:46.400] They're just trying to imitate the training corpus.
[02:46.400 -- 02:48.440] So that means there's a big opportunity
[02:48.440 -- 02:50.640] to improve them a lot by just giving them
[02:50.640 -- 02:51.600] the right objective.
[02:51.600 -- 02:55.280] That's what we can do by applying RL to these language
[02:55.280 -- 02:58.560] models using human feedback to define the reward.
[02:58.560 -- 03:02.560] Is using human feedback harder or very different in some way
[03:02.560 -- 03:04.360] than using a synthetic reward?
[03:04.360 -- 03:06.600] There are a lot of new complications.
[03:06.600 -- 03:09.800] Now you have to collect a data set dynamically.
[03:09.800 -- 03:12.160] So you're always in the business of building data
[03:12.160 -- 03:14.720] sets of human preferences.
[03:14.720 -- 03:17.160] Often the data quality there matters more
[03:17.160 -- 03:19.320] than various algorithmic details.
[03:19.320 -- 03:22.440] And you also have to think a lot about exactly how you're
[03:22.440 -- 03:24.360] giving the task to the human trainers
[03:24.360 -- 03:25.680] and various other things that you
[03:25.680 -- 03:27.360] wouldn't have thought about if you just
[03:27.360 -- 03:29.040] had a programmatic reward function.
[03:29.040 -- 03:31.080] Does the difference between human raters
[03:31.080 -- 03:34.200] or the noisiness of the reward signal cause any problems?
[03:34.200 -- 03:36.640] I would say the noise, definitely
[03:36.640 -- 03:40.320] you need to be below some threshold of noise
[03:40.320 -- 03:41.360] to learn anything.
[03:41.360 -- 03:44.160] I think, in general, if you have a large noisy data
[03:44.160 -- 03:47.640] set that can be as good as a smaller, clean data set.
[03:47.640 -- 03:50.640] So actually, noise isn't the thing that worries me the most.
[03:50.640 -- 03:53.600] It's more that there are sometimes consistent biases
[03:53.600 -- 03:54.680] that people have.
[03:54.680 -- 03:58.920] For example, in settings like question answering or settings
[03:58.920 -- 04:02.000] where you have a model writing some text,
[04:02.000 -- 04:04.160] often people prefer longer answers.
[04:04.160 -- 04:06.680] You end up with these very verbose answers.
[04:06.680 -- 04:08.880] If you're not careful with the instructions, that is.
[04:08.880 -- 04:12.000] I mean, you can also instruct people, the raters,
[04:12.000 -- 04:14.440] to reward brevity.
[04:14.440 -- 04:17.200] But if you're not careful, you can
[04:17.200 -- 04:19.360] incentivize the wrong kinds of behaviors.
[04:19.360 -- 04:21.480] So let's move to some of your recent work.
[04:21.480 -- 04:24.640] First up is WebGPT, browser assisted question
[04:24.640 -- 04:26.200] answering with human feedback.
[04:26.200 -- 04:30.000] That's Nakano et al with yourself as a co-author in 2021.
[04:30.000 -- 04:32.880] Can you tell us what is the main idea of this paper?
[04:32.880 -- 04:33.880] What is WebGPT?
[04:33.880 -- 04:37.720] In WebGPT, we basically took our language models
[04:37.720 -- 04:40.040] and we hooked them up to a web browser
[04:40.040 -- 04:42.520] so they could retrieve information from the web.
[04:42.520 -- 04:44.480] And they can write an answer by summarizing
[04:44.480 -- 04:45.960] the relevant pages from the web.
[04:45.960 -- 04:48.760] So that way if you're asking a question about current events
[04:48.760 -- 04:51.520] or a question that requires some detailed scientific
[04:51.520 -- 04:53.840] or technical knowledge, this AI can go out
[04:53.840 -- 04:56.680] and look up the answer and with detailed citations
[04:56.680 -- 04:57.560] to its sources.
[04:57.560 -- 05:00.320] So I would say there's kind of two interesting points
[05:00.320 -- 05:01.160] to this.
[05:01.160 -- 05:03.600] One is we were exploring whether you could turn language
[05:03.600 -- 05:05.360] models into a kind of agent.
[05:05.360 -- 05:07.840] There's a lot of data on the web of different texts
[05:07.840 -- 05:09.920] that people have written, but there's not a lot of data
[05:09.920 -- 05:13.360] that shows how to actually do some multi-step process.
[05:13.360 -- 05:15.400] So it's not that clear a priori
[05:15.400 -- 05:16.880] whether you can get a language model
[05:16.880 -- 05:19.600] to actually carry out some iterative process.
[05:19.600 -- 05:22.480] We just have a lot of data like writing essays
[05:22.480 -- 05:23.960] and having chats and so forth.
[05:23.960 -- 05:25.840] So that was one thing we were exploring here.
[05:25.840 -- 05:28.120] And I think the answer was affirmative.
[05:28.120 -- 05:32.280] We can get an agent to basically use a set of tools
[05:32.280 -- 05:34.880] that we give it, in this case, the browsing commands
[05:34.880 -- 05:37.480] like searching, scrolling, clicking on links.
[05:37.480 -- 05:40.560] The second theme of this paper was around truthfulness.
[05:40.560 -- 05:44.120] I mean, a big issue with language models is,
[05:44.120 -- 05:45.600] I mean, they're not very reliable
[05:45.600 -- 05:47.080] at giving you true information.
[05:47.080 -- 05:49.680] They know a vastly superhuman amount,
[05:49.680 -- 05:51.640] but if you prompt them in the wrong way,
[05:51.640 -- 05:54.520] they'll just output lots of plausible sounding nonsense.
[05:54.520 -- 05:57.680] So how to fix that is a big research question
[05:57.680 -- 05:59.800] or one of the biggest research questions
[05:59.800 -- 06:01.640] in the world of language models.
[06:01.640 -- 06:03.480] I think it's gonna be challenging to fully fix it,
[06:03.480 -- 06:06.960] but I think a big part of the story involves retrieval
[06:06.960 -- 06:10.520] and having models write answers that contain citations,
[06:10.520 -- 06:12.600] citations to trusted sources.
[06:12.600 -- 06:14.440] So a person who's checking over the answer
[06:14.440 -- 06:16.160] doesn't have to go and try to figure out
[06:16.160 -- 06:18.200] where the model might've gotten this idea.
[06:18.200 -- 06:20.520] They can go and directly look at the source
[06:20.520 -- 06:23.280] and see if it supports the AI's statement.
[06:23.280 -- 06:25.960] With WebGBT, we just wanted to see
[06:25.960 -- 06:28.520] if we do give the language model
[06:28.520 -- 06:30.400] a really flexible interface of the web,
[06:30.400 -- 06:33.240] can we have it answer hard questions truthfully
[06:34.440 -- 06:36.280] with the help of all these citations?
[06:36.280 -- 06:38.360] And it's actually really non-trivial
[06:38.360 -- 06:41.040] because if you look at the dataset we use,
[06:41.040 -- 06:43.280] the Reddit explained it like I'm five.
[06:43.280 -- 06:44.680] The questions are really varied,
[06:44.680 -- 06:46.840] like some of them are about science, history,
[06:46.840 -- 06:49.560] current events, like our raters didn't necessarily
[06:49.560 -- 06:51.520] know anything about these topics,
[06:51.520 -- 06:55.760] but still they had to judge the detailed answers.
[06:55.760 -- 06:57.640] So it would have been really hard to do it
[06:57.640 -- 06:59.960] without the supporting citations.
[06:59.960 -- 07:04.000] So we kind of validated that we could get good feedback
[07:04.000 -- 07:07.440] in a hard domain like this with the help of citations.
[07:07.440 -- 07:10.680] Can you talk about where the idea for WebGBT came from?
[07:10.680 -- 07:13.000] Is that an idea you've had kicking around for a while
[07:13.000 -- 07:15.800] or was it something that came up recently before the paper?
[07:15.800 -- 07:17.760] How did that play out?
[07:17.760 -- 07:19.800] Some of the ideas had been floating around,
[07:19.800 -- 07:22.400] like we thought that we actually had a project
[07:22.400 -- 07:26.160] at OpenAI very early on called World of Bits.
[07:26.160 -- 07:28.520] We were looking at controlling web browsers
[07:28.520 -- 07:31.120] or doing tasks that involved tasks on the internet
[07:31.120 -- 07:32.360] with the web browser,
[07:32.360 -- 07:34.520] but it was way too early at the time.
[07:34.520 -- 07:38.120] So we kind of abandoned it for a few years.
[07:38.120 -- 07:40.240] Actually we were trying to, back then we were trying to do it
[07:40.240 -- 07:41.480] with full visual input.
[07:41.480 -- 07:45.040] So we thought, yeah, we could give some instructions
[07:45.040 -- 07:48.880] to the agent, like go and figure out the address
[07:48.880 -- 07:51.000] of this building or something.
[07:51.000 -- 07:54.000] The agent would go and search the web
[07:54.000 -- 07:57.000] or use Google maps or whatever to figure out the answer.
[07:57.000 -- 07:58.760] And we were trying to do this all in pixels.
[07:58.760 -- 08:00.640] That obviously didn't work very well,
[08:00.640 -- 08:03.640] but now we have these great language models
[08:03.640 -- 08:05.680] on the work on text data.
[08:05.680 -- 08:08.960] We can also extract the text out of web pages
[08:08.960 -- 08:12.000] to get most of the information.
[08:12.000 -- 08:15.280] We can't really interact with a lot of dynamic websites.
[08:15.280 -- 08:16.960] Yeah, where there's a lot of JavaScript
[08:16.960 -- 08:18.000] and images and so forth,
[08:18.000 -- 08:19.960] but as long as it's just browsing
[08:19.960 -- 08:21.760] and reading texts, we're fine.
[08:21.760 -- 08:24.320] So yeah, we had good enough models
[08:24.320 -- 08:27.880] and that made it kind of feasible to revisit this idea
[08:27.880 -- 08:30.960] of using the internet as an environment.
[08:30.960 -- 08:33.640] So I would say that was one of the sources
[08:33.640 -- 08:36.760] of inspiration, that long kind of thread
[08:36.760 -- 08:39.320] about like using the internet as an environment.
[08:39.320 -- 08:44.320] Another motivation was just after we started playing
[08:44.680 -- 08:47.920] with GPT-3, we noticed that it had all these problems
[08:47.920 -- 08:51.400] with factual accuracy and the reliability
[08:51.400 -- 08:52.920] of the information it was giving us.
[08:52.920 -- 08:56.280] So that kind of motivated doing more research
[08:56.280 -- 08:58.960] on how to make language models more truthful.
[08:58.960 -- 09:01.040] We were kind of brainstorming what to do there
[09:01.040 -- 09:05.480] and we went through some docs and eventually decided
[09:05.480 -- 09:07.760] that we wanted to try some question answering
[09:07.760 -- 09:09.800] like using the web, looking up knowledge
[09:09.800 -- 09:11.560] on the web to help answer questions.
[09:11.560 -- 09:12.880] So actually the original version
[09:12.880 -- 09:15.000] of the project used trivia questions.
[09:15.000 -- 09:18.400] So there's this well-known dataset trivia QA
[09:18.400 -- 09:20.080] that has some basic trivia questions.
[09:20.080 -- 09:23.600] So we first worked a little bit on that dataset
[09:23.600 -- 09:26.960] and tried to see if we could boost the model's accuracy
[09:26.960 -- 09:29.840] by giving it web search.
[09:29.840 -- 09:33.040] And yeah, that actually worked quite straight.
[09:33.040 -- 09:34.160] That worked pretty easily.
[09:34.160 -- 09:36.120] So then we decided to move on
[09:36.120 -- 09:38.080] to long form question answering.
[09:38.080 -- 09:41.880] And so that gave us the, that was the project
[09:41.880 -- 09:43.880] we ended up working on for a while.
[09:43.880 -- 09:47.080] Seems like you use a few different datasets here
[09:47.080 -- 09:49.800] and a number of different training methods.
[09:50.760 -- 09:52.600] I'll just mention the last behavior cloning,
[09:52.600 -- 09:55.080] reward modeling, reinforcement learning
[09:55.080 -- 09:56.800] and rejection sampling.
[09:56.800 -- 10:00.520] So we were using a fairly standard methodology
[10:00.520 -- 10:03.240] which was actually adapted from previous work
[10:03.240 -- 10:05.600] on RL from human preferences.
[10:05.600 -- 10:09.120] So the pipeline is you first train a model
[10:09.120 -- 10:13.320] with supervised learning where you have human demonstrators
[10:13.320 -- 10:15.560] show how to do the task, like show how to map
[10:15.560 -- 10:17.160] from observations to actions.
[10:17.160 -- 10:19.280] Yeah, so that's the supervised learning
[10:19.280 -- 10:20.440] or behavior cloning step.
[10:20.440 -- 10:24.400] Then we train a reward model or a preference model.
[10:24.400 -- 10:28.320] It looks at two actions or two trajectories
[10:28.320 -- 10:29.720] and decides which one is better.
[10:29.720 -- 10:32.640] In this case, like in a question answering setting
[10:32.640 -- 10:33.880] you're looking at two answers
[10:33.880 -- 10:35.480] and deciding which answer is better.
[10:35.480 -- 10:37.440] And we use that to train a reward model
[10:37.440 -- 10:39.640] that assigns higher score to the good answers
[10:39.640 -- 10:40.480] than the bad ones.
[10:40.480 -- 10:41.840] Then you do reinforcement learning
[10:41.840 -- 10:43.160] against that reward function.
[10:43.160 -- 10:45.560] And of course you can iterate these last two steps
[10:45.560 -- 10:46.960] after you do a little RL.
[10:46.960 -- 10:49.520] Now you're, you've sort of exploited some of the flaws
[10:49.520 -- 10:52.080] of the reward model, like, or some of the noise
[10:52.080 -- 10:53.200] in the reward model.
[10:53.200 -- 10:55.120] And it's not necessarily accurate
[10:55.120 -- 10:56.760] on your new distribution of data.
[10:56.760 -- 10:59.040] You recollect more pairs of samples
[10:59.040 -- 11:01.680] and refit this preference model.
[11:01.680 -- 11:04.000] And then you do another iteration of RL.
[11:04.000 -- 11:06.160] So that's like, that's the whole RL
[11:06.160 -- 11:07.600] from human feedback pipeline.
[11:07.600 -- 11:11.080] And there's this other idea called rejection sampling
[11:11.080 -- 11:12.400] or best of end sampling.
[11:12.400 -- 11:14.840] And in general, you can do other kinds of search too.
[11:14.840 -- 11:18.680] Where instead of doing RL once you have your reward model
[11:18.680 -- 11:21.040] you can just search against that reward model.
[11:21.040 -- 11:23.440] So you can take a bunch of, collect a bunch of samples
[11:23.440 -- 11:25.960] and re-rank them with the reward model
[11:25.960 -- 11:28.960] and take the best one as your action.
[11:28.960 -- 11:30.520] Kind of like MPC?
[11:30.520 -- 11:31.360] Yeah, exactly.
[11:31.360 -- 11:33.440] Yeah, it kind of depends exactly
[11:33.440 -- 11:35.640] what setting you're in, what you can do.
[11:35.640 -- 11:38.400] If you're in a setting where there's some environment
[11:38.400 -- 11:41.040] you're interacting with, then you would have to simulate
[11:41.040 -- 11:44.160] your, you'd have to simulate the dynamics
[11:44.160 -- 11:45.920] of your environment, which yeah.
[11:45.920 -- 11:47.920] So that would look kind of like MPC.
[11:47.920 -- 11:51.360] In our case, we were, the only thing we had to learn
[11:51.360 -- 11:55.080] a model of was the human preference.
[11:55.080 -- 11:57.480] So like we're, it's a question answering setting.
[11:57.480 -- 11:59.760] So it's really like a contextual bandit problem.
[11:59.760 -- 12:02.520] So it's kind of straightforward to take a bunch of,
[12:02.520 -- 12:04.320] sample a bunch of actions where each action
[12:04.320 -- 12:06.880] is a full answer and re-rank them
[12:06.880 -- 12:11.640] and or search against the search over answers.
[12:11.640 -- 12:13.760] So in terms of the action space,
[12:13.760 -- 12:16.040] was it the action space, just the list of commands
[12:16.040 -- 12:17.800] or is it still generating tokens
[12:17.800 -- 12:20.440] like a regular generative mode?
[12:20.440 -- 12:21.800] We were generating tokens.
[12:21.800 -- 12:26.800] We had two phases of like in each episode of the RL tasks.
[12:26.800 -- 12:31.280] So there was first a browsing phase where the model goes
[12:31.280 -- 12:33.960] and it issues searches and clicks on things
[12:33.960 -- 12:36.560] and quotes relevant information.
[12:36.560 -- 12:38.400] Like if it sees something useful on the page,
[12:38.400 -- 12:40.920] it'll quote it using this quote command.
[12:40.920 -- 12:44.560] And then once it's done browsing,
[12:44.560 -- 12:48.480] it'll issue another command called end browsing
[12:48.480 -- 12:49.920] and it'll write its answer.
[12:49.920 -- 12:52.120] That's also expressed in tokens.
[12:52.120 -- 12:55.400] But really we rolled this all into one big RL task
[12:55.400 -- 12:57.440] where your episode involves browsing
[12:57.440 -- 12:58.640] and writing out the answer
[12:58.640 -- 13:01.480] and it's all one big RL episode.
[13:01.480 -- 13:02.840] Did you think this is gonna work well
[13:02.840 -- 13:04.440] or were you kind of surprised?
[13:04.440 -- 13:06.360] At the very beginning of the project,
[13:06.360 -- 13:09.000] we didn't know if it was gonna work or not.
[13:09.000 -- 13:10.920] Like after we did the initial experiments
[13:10.920 -- 13:12.560] with the trivia QA,
[13:12.560 -- 13:15.560] which actually didn't take that long to get running,
[13:15.560 -- 13:19.120] then it became pretty clear that it would work,
[13:19.120 -- 13:20.640] that the browsing part worked at least.
[13:20.640 -- 13:22.880] And we already know that we can get these models
[13:22.880 -- 13:26.760] to write pretty good long form text with a bunch of,
[13:26.760 -- 13:28.520] if you give them a bunch of snippets
[13:28.520 -- 13:31.080] of text that they can cite.
[13:31.080 -- 13:35.400] So I noticed the human raters task was quite complicated.
[13:35.400 -- 13:38.200] It was a long guide and there was many types of feedback
[13:38.200 -- 13:39.040] that they were giving.
[13:39.040 -- 13:40.440] But in the end, the paper said
[13:40.440 -- 13:42.720] that only the final rating was used.
[13:42.720 -- 13:44.640] So I was just curious if you had any comment about that.
[13:44.640 -- 13:46.040] Like why do you think maybe the model
[13:46.040 -- 13:47.440] couldn't use that extra feedback
[13:47.440 -- 13:50.840] or is this maybe just too much or not enough samples?
[13:50.840 -- 13:55.200] Yeah, that's been one frustrating finding so far.
[13:55.200 -- 13:58.480] In that project and also some other projects,
[13:58.480 -- 14:01.480] we've had the same finding that you have your raters
[14:01.480 -- 14:05.760] go through this long process for each comparison they do
[14:05.760 -- 14:08.240] where they're comparing a pair of answers.
[14:08.240 -- 14:10.440] And then you only use one bit of information
[14:10.440 -- 14:13.080] from this whole process,
[14:13.080 -- 14:14.720] which might've taken like half an hour.
[14:14.720 -- 14:15.840] It seems like it would be better
[14:15.840 -- 14:19.320] if we were able to extract more information,
[14:19.320 -- 14:21.680] more about the process they went through
[14:21.680 -- 14:22.920] in arriving at the answer.
[14:22.920 -- 14:25.040] So we did collect all sorts of other information
[14:25.040 -- 14:27.160] like we had them provide ratings
[14:27.160 -- 14:28.600] along several different axes
[14:28.600 -- 14:32.760] like coherence and factual accuracy and so forth.
[14:32.760 -- 14:35.960] But in the end, we didn't really get much of a boost
[14:35.960 -- 14:39.160] out of using any of this other information.
[14:39.160 -- 14:44.160] So I'd say it seems like it should be possible to do better.
[14:44.800 -- 14:46.520] But unfortunately this methodology,
[14:46.520 -- 14:49.840] which seems kind of dumb so far is hard to beat.
[14:49.840 -- 14:52.760] And people have tried various other ideas
[14:52.760 -- 14:55.120] for like how to use human feedback
[14:55.120 -- 14:57.080] instead of you getting these preference scores,
[14:57.080 -- 14:58.400] there are various other things you can do.
[14:58.400 -- 15:00.840] Like you can have them write critiques and edit
[15:00.840 -- 15:03.200] or maybe edit the responses.
[15:03.200 -- 15:07.080] Yeah, I think some of these things are also promising.
[15:07.080 -- 15:09.440] But yeah, this methodology
[15:09.440 -- 15:12.080] of collecting preference data works well.
[15:12.080 -- 15:15.160] Yeah, I think it's still an open area of research.
[15:15.160 -- 15:18.280] Oh yeah, regarding the really long instructions.
[15:18.280 -- 15:20.000] Yeah, I think for any of these tasks,
[15:20.000 -- 15:24.000] there is a lot of subtlety in how to do the task properly.
[15:24.000 -- 15:27.800] And so we ended up adding more and more details
[15:27.800 -- 15:29.640] of like what do you do in this situation?
[15:29.640 -- 15:30.960] What do you do in that situation?
[15:30.960 -- 15:33.320] I think it's starting to get pretty unwieldy
[15:33.320 -- 15:35.760] with these really long instruction manuals.
[15:35.760 -- 15:39.920] So there's some promising ideas for how to address this.
[15:39.920 -- 15:42.840] Like there's a paper from DeepMind recently,
[15:42.840 -- 15:45.920] Sparrow that used basically broke down the task
[15:45.920 -- 15:48.520] and they trained, they basically had people look
[15:48.520 -- 15:52.400] at one aspect of the response at a time.
[15:52.400 -- 15:54.640] And then they had a way of combining
[15:54.640 -- 15:56.480] these different rule specific,
[15:56.480 -- 15:58.680] they would train a bunch of rule specific reward models
[15:58.680 -- 16:00.440] and then combine them at the end.
[16:00.440 -- 16:02.520] Yeah, I think there's some other interesting ideas
[16:02.520 -- 16:05.320] for how to make this process better.
[16:05.320 -- 16:08.480] So I gather that from your answer about WebGPT
[16:08.480 -- 16:10.720] and the whole idea of WebGPT is that you want
[16:10.720 -- 16:14.400] the language model to have access to external knowledge.
[16:14.400 -- 16:17.560] But I wonder where you think the line should really be
[16:17.560 -- 16:19.680] in terms of what a language model should know
[16:19.680 -- 16:21.920] and what the language model should look up
[16:21.920 -- 16:24.240] and maybe what the language model should not know
[16:24.240 -- 16:25.600] or not purport to know.
[16:25.600 -- 16:27.120] Do you have opinions about that?
[16:27.120 -- 16:28.560] Yeah, let's see.
[16:28.560 -- 16:30.200] Like some people are advocating
[16:30.200 -- 16:32.480] for very small language models that have
[16:32.480 -- 16:35.480] like no external knowledge aside from language,
[16:35.480 -- 16:37.000] I guess would be the extreme position.
[16:37.000 -- 16:39.680] And then other people have talked about language models
[16:39.680 -- 16:41.000] that just know everything
[16:41.000 -- 16:43.440] as opposed to having an external knowledge source.
[16:43.440 -- 16:45.000] There's some interesting questions there.
[16:45.000 -- 16:48.440] So I think it is a little hard to separate knowledge,
[16:48.440 -- 16:51.160] factual knowledge from understanding.
[16:51.160 -- 16:55.120] So as humans, we get by like not memorizing
[16:55.120 -- 16:57.560] all sorts of facts and just knowing
[16:57.560 -- 16:59.720] that we can look them up if needed.
[16:59.720 -- 17:01.520] For working on a specific domain,
[17:01.520 -- 17:06.440] it is useful to like have a lot of facts internalized
[17:06.440 -- 17:08.520] so that you can recall them very quickly
[17:08.520 -- 17:11.480] and kind of combine them in your head.
[17:11.480 -- 17:14.840] So I wouldn't take an extreme position on either side.
[17:14.840 -- 17:18.400] I would say, I think retrieval is gonna be really useful
[17:19.520 -- 17:22.480] just at the very least for current events,
[17:22.480 -- 17:26.480] but also I don't think we wanna try to pack
[17:26.480 -- 17:29.960] all human knowledge into the weights of a neural net.
[17:29.960 -- 17:32.280] On the other hand, I think people have had a lot of luck
[17:32.280 -- 17:37.200] just scaling up models and like as they soak up
[17:37.200 -- 17:40.800] more factual knowledge, they also get better at reasoning
[17:40.800 -- 17:41.640] and other things.
[17:41.640 -- 17:44.280] And I think I haven't seen any demonstrations
[17:44.280 -- 17:48.080] of tiny models that just do lots of retrieval
[17:48.080 -- 17:50.320] and save all their weights for reasoning.
[17:50.320 -- 17:53.840] Yeah, I just haven't seen any evidence of this
[17:53.840 -- 17:57.480] or I haven't seen any successful attempts at making this.
[17:57.480 -- 17:59.640] Let's move on to training language models
[17:59.640 -- 18:01.680] to follow instructions with human feedback.
[18:01.680 -- 18:03.080] That was Wuyang et al.
[18:03.080 -- 18:05.640] And that was 2022 with yourself as a co-author.
[18:05.640 -- 18:08.040] Can you tell us the main idea with this paper?
[18:08.040 -- 18:09.760] This is the instruct GPT paper.
[18:09.760 -- 18:12.000] What is instruct GPT and what's going on here?
[18:12.000 -- 18:15.240] Instruct GPT is a language model that's fine tuned
[18:15.240 -- 18:16.480] to follow instructions.
[18:16.480 -- 18:19.000] And it's in fact the one that you can play with
[18:19.000 -- 18:23.280] if you go to the OpenAI website, you get a big text box
[18:23.280 -- 18:25.920] and you can write some text and then press the button
[18:25.920 -- 18:27.680] to generate a completion.
[18:27.680 -- 18:30.240] So the idea here was, I mean, language models
[18:30.240 -- 18:33.800] are pretty useful and you can sometimes get them
[18:33.800 -- 18:36.160] to do what you want by prompting them just right.
[18:36.160 -- 18:39.880] This idea of few-shot prompting has become pretty popular
[18:39.880 -- 18:41.560] where you give a few examples,
[18:41.560 -- 18:44.200] like a few question and answer examples.
[18:44.200 -- 18:45.720] And then if you ask another question,
[18:45.720 -- 18:48.520] it'll hopefully provide an answer in the same style.
[18:48.520 -- 18:51.600] So the idea, yeah, so you can get language models
[18:51.600 -- 18:53.240] to do great things with prompting,
[18:53.240 -- 18:55.240] but prompting is itself an art
[18:55.240 -- 18:56.480] and it's tricky to get right.
[18:56.480 -- 18:59.040] And it's also kind of not necessarily getting
[18:59.040 -- 19:01.600] the best possible performance out of the model.
[19:01.600 -- 19:03.120] If you just take a raw language model
[19:03.120 -- 19:06.000] and you try to talk to it, like you ask it a question,
[19:06.000 -- 19:08.840] it probably, it doesn't know that it should actually answer
[19:08.840 -- 19:10.560] that question as well as possible.
[19:10.560 -- 19:13.840] It, for all it knows, you want it to give a joke answer
[19:13.840 -- 19:15.320] or a riddle or something.
[19:15.320 -- 19:17.840] Yeah, so the idea of instruct GPT was,
[19:17.840 -- 19:21.120] let's make a kind of small change to our language models
[19:21.120 -- 19:22.880] so that they're much easier to use.
[19:22.880 -- 19:25.360] In particular, we're gonna train them to,
[19:25.360 -- 19:29.440] if you have a piece of text where there's an instruction,
[19:29.440 -- 19:32.840] the model will try to follow that instruction
[19:32.840 -- 19:34.120] to the best of its abilities.
[19:34.120 -- 19:36.480] And pretty much anything can be an instruction.
[19:36.480 -- 19:38.760] Like you can have a, the instruction can be
[19:38.760 -- 19:43.760] to continue a chat or it can be to summarize this text
[19:44.400 -- 19:48.740] or give me a list of names for my company
[19:48.740 -- 19:50.240] that sells widgets.
[19:50.240 -- 19:51.680] Yeah, instructions can be anything
[19:51.680 -- 19:54.960] and that makes this kind of model very powerful.
[19:54.960 -- 19:56.000] So that was kind of,
[19:56.000 -- 19:58.120] that's the idea of an instruction following model.
[19:58.120 -- 19:59.760] It's like a model that can do anything
[19:59.760 -- 20:01.460] that you specify with an instruction.
[20:01.460 -- 20:04.000] And by the way, I wasn't a core contributor to this work.
[20:04.000 -- 20:09.000] I was more involved with like getting the RL infrastructure
[20:09.360 -- 20:12.280] and some of the RL training details,
[20:12.280 -- 20:14.440] like helping out with that stuff.
[20:14.440 -- 20:16.840] But anyway, yeah, what we did in this project was
[20:16.840 -- 20:20.620] we ran this whole methodology that I just described
[20:20.620 -- 20:23.160] of RL from human preferences
[20:23.160 -- 20:24.900] in this instruction following setting.
[20:24.900 -- 20:28.080] So we did supervised fine tuning,
[20:28.080 -- 20:30.840] collected preference data, train a reward model
[20:30.840 -- 20:33.800] and then did RL against that reward model.
[20:33.800 -- 20:36.240] And one interesting detail is actually
[20:36.240 -- 20:40.080] whereas the original initial data was just collected
[20:40.080 -- 20:41.840] using contractors.
[20:41.840 -- 20:46.840] At a certain point we had the API and it's got this,
[20:47.040 -- 20:50.520] I mean, we have this playgrounds on the website
[20:50.520 -- 20:52.800] where this is where the big text box
[20:52.800 -- 20:54.800] where you can use the model.
[20:54.800 -- 20:57.200] So we took prompts that people,
[20:57.200 -- 20:59.680] that users had put into the playground
[20:59.680 -- 21:01.280] and use those for training,
[21:01.280 -- 21:04.680] like both to collect preference data and to do RL.
[21:04.680 -- 21:07.040] So, and this is like,
[21:07.040 -- 21:10.760] this is disclosed to users pretty prominently.
[21:10.760 -- 21:13.040] Like when people are using the playgrounds,
[21:13.040 -- 21:15.520] you get notified that your prompts might be used
[21:15.520 -- 21:16.480] for the training.
[21:16.480 -- 21:19.120] And we're also careful to train in such a way
[21:19.120 -- 21:20.860] that we don't memorize any information
[21:20.860 -- 21:23.080] that was in the prompts.
[21:23.080 -- 21:24.760] Like, and it explicit,
[21:24.760 -- 21:27.480] like we have a pretty like elaborate process
[21:27.480 -- 21:30.680] for making sure there's no like private information
[21:30.680 -- 21:32.840] being leaked into the model.
[21:32.840 -- 21:36.960] But anyway, yeah, that's basically the experimental setup.
[21:36.960 -- 21:39.680] And the result was that it works
[21:39.680 -- 21:42.060] like this methodology works quite well.
[21:42.060 -- 21:44.480] And you get a model that's vastly preferred
[21:44.480 -- 21:48.820] to the base model on this distribution of realistic prompts
[21:48.820 -- 21:50.880] that people are giving the model,
[21:50.880 -- 21:53.040] often which contain instructions.
[21:53.040 -- 21:56.040] So the raw, like the raw language models
[21:56.040 -- 21:58.760] generally do a really bad job following instructions.
[21:58.760 -- 22:02.920] But this RL trained instruction following model
[22:02.920 -- 22:04.120] is a lot better.
[22:04.120 -- 22:06.440] And it's something like,
[22:06.440 -- 22:08.220] if you just calculate how much better,
[22:08.220 -- 22:09.200] it's something like,
[22:09.200 -- 22:11.800] it's as good as a model that's a hundred times bigger.
[22:11.800 -- 22:13.200] That's a lot.
[22:13.200 -- 22:14.040] Yeah.
[22:14.040 -- 22:15.280] You wanted the model to be truthful.
[22:15.280 -- 22:17.640] Is that one of the criteria you wanted?
[22:17.640 -- 22:20.000] Yeah, truthfulness was one of the criteria.
[22:20.000 -- 22:22.200] That seems amazing to me that truthfulness
[22:22.200 -- 22:24.080] is something that I could learn by example.
[22:24.080 -- 22:26.480] Like does that mean that truthfulness is somehow
[22:26.480 -- 22:28.000] represented inside the network
[22:28.000 -- 22:31.240] or because there's no external way for the model to confirm
[22:31.240 -- 22:32.720] whether something is true or false?
[22:32.720 -- 22:35.440] So how might it know what is true
[22:35.440 -- 22:37.480] without any external reference?
[22:37.480 -- 22:38.960] I think to some extent,
[22:38.960 -- 22:42.420] there is some internal representation of truthfulness.
[22:42.420 -- 22:43.260] So I would say,
[22:43.260 -- 22:45.340] like one way to think about what language models do
[22:45.340 -- 22:48.200] is they're trained to imitate the whole internet.
[22:48.200 -- 22:50.520] And the internet is written by lots of different people
[22:50.520 -- 22:52.520] and has lots of different types of content
[22:52.520 -- 22:57.200] from fiction to nonfiction to like technical,
[22:57.200 -- 23:00.600] like detailed technical literature to like jokes
[23:00.600 -- 23:03.400] and like forum posts, whatever.
[23:03.400 -- 23:07.260] So the model is basically an ensemble of all these people
[23:07.260 -- 23:08.880] who wrote stuff on the internet,
[23:08.880 -- 23:11.000] the raw pre-trained model.
[23:11.000 -- 23:13.080] When you feed it a prompt,
[23:13.080 -- 23:15.580] what it's doing internally has to be something like
[23:15.580 -- 23:18.200] figuring out who wrote this prompt
[23:18.200 -- 23:20.020] and then trying to continue in that style.
[23:20.020 -- 23:21.880] So if it thinks it's reading,
[23:21.880 -- 23:26.180] just reading something on the Wall Street Bets Reddit,
[23:26.180 -- 23:28.440] it's gonna continue on that style.
[23:28.440 -- 23:30.640] But if it thinks it's in the New York Times,
[23:30.640 -- 23:33.320] it's gonna write in a very different way.
[23:33.320 -- 23:38.280] So effectively, the model must be calculating somewhere,
[23:38.280 -- 23:40.800] like what style is this or what ensemble,
[23:40.800 -- 23:43.900] what's the narrower ensemble of styles
[23:43.900 -- 23:46.400] that I'm trying to imitate now.
[23:46.400 -- 23:48.400] At the very least, when you do some kind of,
[23:48.400 -- 23:51.080] when you do training like either supervised fine tuning
[23:51.080 -- 23:52.840] or all from human feedback,
[23:52.840 -- 23:55.600] you can at least like narrow down the set of styles
[23:55.600 -- 23:59.500] the model is producing and try to imitate like the best
[23:59.500 -- 24:02.680] or the best person in the training set
[24:02.680 -- 24:04.300] or the best style in the training set.
[24:04.300 -- 24:06.480] And obviously best will differ a lot.
[24:06.480 -- 24:09.540] So what we'll end up with will depend on our instructions.
[24:09.540 -- 24:12.520] So if we tell, I don't know,
[24:12.520 -- 24:15.080] we'll end up with something that has kind of safe,
[24:15.080 -- 24:19.000] like not too controversial,
[24:19.000 -- 24:21.160] but a bit corporate,
[24:21.160 -- 24:23.240] we'll end up with something like that
[24:23.240 -- 24:25.680] depending on what our instructions are.
[24:25.680 -- 24:27.320] So at the very least,
[24:27.320 -- 24:29.880] like we can kind of narrow in on one style
[24:29.880 -- 24:32.160] instead of having the whole distribution
[24:32.160 -- 24:33.320] of styles on the internet.
[24:33.320 -- 24:35.780] I think probably there's more to it than that.
[24:35.780 -- 24:38.140] Like we're not just learning about style,
[24:38.140 -- 24:40.580] but the model probably is like internally
[24:40.580 -- 24:42.220] trying to determine if things are,
[24:42.220 -- 24:44.000] if statements are true or not,
[24:44.000 -- 24:47.320] like if the prompt contains incorrect information,
[24:47.320 -- 24:48.980] because that probably would be useful
[24:48.980 -- 24:51.560] for determining a likely completion.
[24:51.560 -- 24:53.340] I'm just talking about the raw pre-trained model.
[24:53.340 -- 24:54.520] So I think, yeah,
[24:54.520 -- 24:58.180] I think just the objective of predicting next tokens
[24:58.180 -- 24:59.520] probably gives you a lot.
[24:59.520 -- 25:02.120] It forces the model to like to determine
[25:02.120 -- 25:03.680] if things are true or not.
[25:03.680 -- 25:05.880] I think for RL fine tuning,
[25:05.880 -- 25:07.560] there's a lot more potential for the model
[25:07.560 -- 25:11.900] to actually like try to output something truthful
[25:11.900 -- 25:14.240] as opposed to trying to imitate a certain style.
[25:14.240 -- 25:16.120] Though it's hard to,
[25:16.120 -- 25:18.520] I guess it would be hard to like determine
[25:18.520 -- 25:21.400] if that's what the model is actually trying to do.
[25:21.400 -- 25:24.240] So it's almost like the prompt is guiding the model.
[25:24.240 -- 25:26.720] It's like, what corner of the internet do we want to,
[25:26.720 -- 25:28.320] do we want to imitate here?
[25:28.320 -- 25:31.240] And maybe we want to instruct GPG wants to,
[25:31.240 -- 25:33.520] to focus more on the most more truthful corners
[25:33.520 -- 25:35.800] of the internet and something similar to that.
[25:35.800 -- 25:36.880] Yeah, I would hope so.
[25:36.880 -- 25:38.680] At least I think that's a pretty good,
[25:38.680 -- 25:41.360] though maybe a little simplistic picture of what's going on.
[25:41.360 -- 25:42.200] At the very least,
[25:42.200 -- 25:44.920] we should be able to imitate the most truthful corner
[25:44.920 -- 25:45.760] of the internet.
[25:45.760 -- 25:47.760] So can you talk about a generalization
[25:47.760 -- 25:52.360] and how does this type of model perform out of distribution?
[25:52.360 -- 25:54.080] Like, I guess if it seems questions
[25:54.080 -- 25:56.480] that are a bit different than what it was trained on,
[25:56.480 -- 25:58.040] what happens if we get a little bit away
[25:58.040 -- 26:00.560] from the training data with the reward models?
[26:00.560 -- 26:02.320] I mean, language models in general,
[26:02.320 -- 26:03.840] generalize surprisingly well.
[26:03.840 -- 26:05.400] And I would say overall,
[26:05.400 -- 26:07.600] like these pre-trained models that are trained
[26:07.600 -- 26:09.760] on super diverse data sets from the internet,
[26:09.760 -- 26:12.920] they tend to generalize quite well, or surprisingly well,
[26:12.920 -- 26:15.200] at least it's surprising to those of us
[26:15.200 -- 26:19.000] who were around for the earlier days of machine learning
[26:19.000 -- 26:22.800] when everything was trained from scratch and very fragile.
[26:22.800 -- 26:25.640] For example, if you provide an instruction
[26:25.640 -- 26:29.280] in some other language, even a fairly rare language,
[26:29.280 -- 26:32.360] it'll often do a decent job following the instruction,
[26:32.360 -- 26:35.840] even if there's zero data in the whole instruction
[26:35.840 -- 26:39.360] following the training process that's in that language.
[26:39.360 -- 26:41.840] And that's just to carry over from the pre-training.
[26:41.840 -- 26:43.960] So I think generalization,
[26:43.960 -- 26:46.080] yeah, I think language models generalize quite well.
[26:46.080 -- 26:47.880] So you asked about reward models.
[26:47.880 -- 26:50.840] I think one of the tricky pieces about RL
[26:50.840 -- 26:52.400] from human feedback is how,
[26:52.400 -- 26:53.880] so you have this reward model
[26:53.880 -- 26:55.480] and you're actually training against it,
[26:55.480 -- 26:57.880] meaning you're training your policy to have high reward
[26:57.880 -- 27:01.200] and it's going to exploit the errors in the reward model.
[27:01.200 -- 27:04.280] So it's gonna eventually find adversarial examples
[27:04.280 -- 27:05.200] to the reward model.
[27:05.200 -- 27:07.200] This is worse than kind of normal
[27:07.200 -- 27:08.640] out of distribution behavior.
[27:08.640 -- 27:11.480] It's like targeted out of distribution examples.
[27:11.480 -- 27:13.800] So there are definitely some challenges
[27:13.800 -- 27:17.400] around getting reward models to generalize well
[27:17.400 -- 27:20.960] or generalize as far as possible from the training set.
[27:20.960 -- 27:22.760] Can these types of agents tell us
[27:22.760 -- 27:26.240] when they don't know something or is that a hard problem?
[27:26.240 -- 27:28.800] I'd say sort of, if you ask a question
[27:28.800 -- 27:31.480] that's kind of in the core of the model's knowledge,
[27:31.480 -- 27:34.160] it will know the answer and it'll know that it knows.
[27:34.160 -- 27:35.640] By the way, I'm talking about models
[27:35.640 -- 27:37.240] like for the instruct model.
[27:37.240 -- 27:40.360] If you ask it about something that's like very simple
[27:40.360 -- 27:42.160] at the core of its knowledge,
[27:42.160 -- 27:44.160] it'll know if you, there are certain things
[27:44.160 -- 27:45.920] that it knows that it doesn't know,
[27:45.920 -- 27:49.240] like current events where it's been trained
[27:49.240 -- 27:52.840] to know that it doesn't know certain things in real time.
[27:52.840 -- 27:55.000] But if you ask it about something
[27:55.000 -- 27:56.760] that's kind of on the edge of its knowledge,
[27:56.760 -- 27:59.480] it's gonna have a hard time.
[27:59.480 -- 28:01.640] It's necessarily gonna be inaccurate.
[28:01.640 -- 28:03.920] I mean, there have been a couple of papers
[28:03.920 -- 28:04.880] about this question.
[28:04.880 -- 28:08.080] So there was a paper from Entropic recently
[28:08.080 -- 28:09.360] called Language Models,
[28:09.360 -- 28:10.920] mostly know what they know.
[28:10.920 -- 28:15.120] And there's also a paper from FHI and OpenAI
[28:15.120 -- 28:17.680] called Getting Language Models
[28:17.680 -- 28:20.080] to Express Their Uncertainty in Words.
[28:20.080 -- 28:22.000] These language models,
[28:22.000 -- 28:24.160] as well as a lot of other models in machine learning
[28:24.160 -- 28:26.560] are trained to maximize likelihood.
[28:26.560 -- 28:28.680] So maximize log-prob of data.
[28:28.680 -- 28:29.920] You're already training them
[28:29.920 -- 28:32.480] to always predict a distribution of outputs.
[28:32.480 -- 28:35.440] So for language models, given a prefix,
[28:35.440 -- 28:38.920] it's predicting a distribution over the next token.
[28:38.920 -- 28:41.760] These predictions for the next token
[28:41.760 -- 28:44.720] generally are pretty well calibrated.
[28:44.720 -- 28:47.680] If it puts 80% probability on something,
[28:47.680 -- 28:49.160] and you look at all the times
[28:49.160 -- 28:51.920] when it puts 80% probability on something,
[28:51.920 -- 28:54.080] it's right 80% of the time.
[28:54.080 -- 28:56.400] That's just a result of the training objective.
[28:56.400 -- 28:59.960] The training objective strongly incentivizes the model
[28:59.960 -- 29:01.400] to be calibrated,
[29:01.400 -- 29:05.320] meaning it has a reasonable estimate of its uncertainty.
[29:05.320 -- 29:07.240] So at the single token level,
[29:07.240 -- 29:08.960] models definitely are calibrated.
[29:08.960 -- 29:10.880] The question is whether they're calibrated on,
[29:10.880 -- 29:14.680] whether this calibration extends to settings
[29:14.680 -- 29:18.000] where they are generating multi-token outputs,
[29:18.000 -- 29:20.360] or whether they can like judge the correctness
[29:20.360 -- 29:22.000] of some multi-token statement.
[29:22.000 -- 29:25.000] So I would say since models are calibrated
[29:25.000 -- 29:26.600] at the single token level,
[29:26.600 -- 29:29.640] I think that they definitely have the information
[29:29.640 -- 29:32.840] to be calibrated in these other settings.
[29:32.840 -- 29:35.960] So that's why I think the problem of models
[29:35.960 -- 29:38.640] knowing what they know isn't actually that hard,
[29:38.640 -- 29:42.240] or at least getting a model to express its uncertainty
[29:42.240 -- 29:44.080] pretty much as well as a human does,
[29:44.080 -- 29:46.560] doesn't feel like a insurmountable problem,
[29:46.560 -- 29:48.360] but there are some practical difficulties
[29:48.360 -- 29:50.120] to getting there.
[29:50.120 -- 29:52.720] People use the phrase AI alignment in different ways.
[29:52.720 -- 29:54.440] Can you talk about how you see alignment
[29:54.440 -- 29:57.680] in your work on RL from human feedback?
[29:57.680 -- 29:59.720] I think of alignment mostly as the problem
[29:59.720 -- 30:03.560] of getting the model to try to do the right thing.
[30:03.560 -- 30:05.000] So we can kind of make a distinction
[30:05.000 -- 30:08.240] between what the model is capable of doing.
[30:08.240 -- 30:10.200] Like if you just take a raw language model
[30:10.200 -- 30:13.240] and you ask it a question, like I said before,
[30:13.240 -- 30:14.720] it doesn't know that you actually wanted
[30:14.720 -- 30:17.120] to give the correct answer as opposed to,
[30:17.120 -- 30:20.160] it might think someone who's not very knowledgeable
[30:20.160 -- 30:21.000] is answering.
[30:21.000 -- 30:22.480] By doing some extra training,
[30:22.480 -- 30:24.800] we can get the model to actually try to do the right thing.
[30:24.800 -- 30:28.680] And so I would say that that's the main goal of alignment.
[30:28.680 -- 30:31.720] So there was an OpenAI blog post recently
[30:31.720 -- 30:34.560] that talked about the sequence in alignment.
[30:34.560 -- 30:38.800] One was training AI systems using human feedback,
[30:38.800 -- 30:42.800] two, training AI systems to assist human evaluation,
[30:42.800 -- 30:46.440] and three, training AI systems to do alignment research.
[30:46.440 -- 30:50.200] So is your current work mostly about this first item
[30:50.200 -- 30:51.800] and when and how do you see us
[30:51.800 -- 30:53.440] getting to these other stages?
[30:53.440 -- 30:56.240] I'm doing some work now on number two,
[30:56.240 -- 30:58.520] training AI systems to assist human feedback.
[30:58.520 -- 31:01.760] I think that sort of becomes increasingly necessary
[31:01.760 -- 31:05.120] as you start trying to get the systems
[31:05.120 -- 31:06.840] to solve harder and harder problems.
[31:06.840 -- 31:09.520] When you have models that are kind of very below human level
[31:09.520 -- 31:12.000] or maybe at human level at a certain task,
[31:12.000 -- 31:15.080] it's pretty straightforward to supervise them.
[31:15.080 -- 31:17.200] But once they're doing things that are very hard
[31:17.200 -- 31:19.480] or doing things that require a lot
[31:19.480 -- 31:21.960] of diverse technical knowledge,
[31:21.960 -- 31:24.480] it becomes pretty hard to provide
[31:24.480 -- 31:26.560] a useful supervision signal.
[31:26.560 -- 31:29.280] So we have to start doing things like one model
[31:29.280 -- 31:31.680] writes an answer to a question
[31:31.680 -- 31:35.320] and then another model provides a critique of that answer,
[31:35.320 -- 31:36.680] points out some flaws,
[31:36.680 -- 31:38.880] and then the human only has to judge
[31:38.880 -- 31:43.120] the first answer after looking at the critique,
[31:43.120 -- 31:45.440] meaning basically the critique helps the human
[31:45.440 -- 31:46.520] assess the answer.
[31:46.520 -- 31:48.840] So I think that kind of idea
[31:48.840 -- 31:51.000] is starting to become pretty relevant.
[31:51.000 -- 31:53.560] Colleagues and I are exploring that kind of idea now.
[31:53.560 -- 31:55.520] As for assisting alignment research,
[31:55.520 -- 31:56.960] there's some other work at OpenAI
[31:56.960 -- 31:58.600] that's starting to explore this.
[31:58.600 -- 32:02.040] It's also, that's sort of the furthest down the road.
[32:02.040 -- 32:05.080] So I saw Stuart Russell was on your PhD committee
[32:05.080 -- 32:07.680] and I really enjoyed his book, Human Compatible.
[32:07.680 -- 32:10.200] I wonder if you share the idea mentioned in the book
[32:10.200 -- 32:11.880] that the standard RL framing
[32:11.880 -- 32:14.760] with this fixed reward signal is problematic
[32:14.760 -- 32:16.360] and that agents, powerful agents,
[32:16.360 -- 32:18.960] should try to do what we want
[32:18.960 -- 32:21.880] and maintain some uncertainty about what it is we want
[32:21.880 -- 32:26.120] and the agents that are too certain will be problematic.
[32:26.120 -- 32:28.320] Do you have any thoughts on that idea?
[32:28.320 -- 32:31.560] Yeah, I totally agree with that idea.
[32:31.560 -- 32:34.120] So I think first it's really hard to write down
[32:34.120 -- 32:37.560] a simple reward function that actually captures
[32:37.560 -- 32:41.080] what we want or what any particular person wants.
[32:41.080 -- 32:43.720] I can say I want a little more of this
[32:43.720 -- 32:44.880] or a little more of that,
[32:44.880 -- 32:47.760] but you wouldn't want to take that to the extreme.
[32:47.760 -- 32:52.600] If we build agents that try to cater to our wishes,
[32:52.600 -- 32:55.200] we should make sure they're,
[32:55.200 -- 32:58.240] like they have a lot of, they have uncertainty
[32:58.240 -- 33:00.080] about what we want or what we value.
[33:00.080 -- 33:03.480] And that'll also cause them to be a little more cautious
[33:03.480 -- 33:07.600] and say, not disturb anything that might be important to us.
[33:07.600 -- 33:10.600] So yeah, I agree with that.
[33:10.600 -- 33:13.360] Like Stuart Russell gave a very good
[33:13.360 -- 33:17.040] like problem definition of what we want AI to do.
[33:17.040 -- 33:18.440] Like we want it to basically,
[33:18.440 -- 33:21.040] we want to jointly like play this game
[33:21.040 -- 33:23.760] where AI is trying to figure out what we want
[33:23.760 -- 33:24.840] and then trying to do that.
[33:24.840 -- 33:27.600] But simultaneously maintaining some uncertainty
[33:27.600 -- 33:28.640] about what we want.
[33:28.640 -- 33:30.560] I would say if you start to look
[33:30.560 -- 33:31.920] at how to get that in practice,
[33:31.920 -- 33:34.400] it actually looks quite a bit like the kind of RL
[33:34.400 -- 33:37.920] from human feedback that we're working on at OpenAI
[33:37.920 -- 33:41.280] and others are working on at other places.
[33:41.280 -- 33:44.720] I think, yeah, I see what we're doing
[33:44.720 -- 33:47.320] as a practical implementation
[33:47.320 -- 33:50.720] of getting towards this behavior that Russell described.
[33:50.720 -- 33:53.160] Do you think of AGI as an abstract goal
[33:53.160 -- 33:55.560] or are we gonna see a model come out one day
[33:55.560 -- 33:58.040] and people are gonna say, oh, that's the first AGI model?
[33:58.040 -- 34:01.640] Like, what does it have to do for people to say that?
[34:01.640 -- 34:04.920] I think people will say that many times
[34:04.920 -- 34:07.200] then realize that it doesn't quite do everything
[34:07.200 -- 34:08.080] that you want.
[34:08.080 -- 34:10.600] I think we're gonna have a lot of like a long series
[34:10.600 -- 34:14.320] of models that are superhuman at most things
[34:14.320 -- 34:16.640] or at a certain class of things,
[34:16.640 -- 34:20.840] but they also have some failure modes and weaknesses.
[34:20.840 -- 34:24.640] Like I expect us to see multiple models
[34:24.640 -- 34:26.600] that are proclaimed as AGI
[34:26.600 -- 34:30.360] and then only after interacting with it a while,
[34:30.360 -- 34:33.880] do you realize it's not quite there.
[34:33.880 -- 34:35.520] What would you say is the relationship
[34:35.520 -- 34:39.760] between AGI and RL and AGI and these large language models?
[34:39.760 -- 34:41.680] How do those concepts fit together?
[34:41.680 -- 34:46.680] I'd say that RL is a useful component of training AGI
[34:47.160 -- 34:49.240] or an almost essential component.
[34:49.240 -- 34:52.440] The thing RL lets you do is it lets you optimize
[34:52.440 -- 34:54.960] any objective for the agents,
[34:54.960 -- 34:59.280] any objective that is a function of the agent's behavior.
[34:59.280 -- 35:03.720] So with pre-training, like what we do for language models,
[35:03.720 -- 35:05.760] you're kind of choosing an objective
[35:05.760 -- 35:09.400] that lets us do something with all the training data
[35:09.400 -- 35:11.720] we have, which is all this internet text.
[35:11.720 -- 35:14.200] So we choose this maximum likelihood objective,
[35:14.200 -- 35:17.000] which is basically the only, or not the only thing,
[35:17.000 -- 35:20.200] but it's like a sensible way to absorb all this knowledge.
[35:20.200 -- 35:24.040] But then if we really want to optimize the agent's behavior
[35:24.040 -- 35:25.440] for a specific objective,
[35:25.440 -- 35:29.040] RL is kind of the only framework that lets you do that.
[35:29.960 -- 35:32.240] Okay, John, we have a few questions from the audience
[35:32.240 -- 35:33.280] and I'm just going to pick the two
[35:33.280 -- 35:36.240] that have the highest score in terms of Twitter likes.
[35:36.240 -- 35:40.760] So the first is from Eric Chang, VP of AI at Haloti Robotics.
[35:40.760 -- 35:43.360] He asked, RL distributions are non-stationary,
[35:43.360 -- 35:46.080] making it hard to reason about PPO losses
[35:46.080 -- 35:48.520] and how that relates to return or generalization.
[35:48.520 -- 35:51.000] Are there any intermediate plots and visualizations
[35:51.000 -- 35:53.120] you'd like to generate to debug
[35:53.120 -- 35:56.200] or incrementally build up a large scale RL system?
[35:56.200 -- 35:59.760] Yeah, there are definitely some stats that I look at.
[35:59.760 -- 36:02.640] So I will be, I'll talk about this
[36:02.640 -- 36:07.640] in the nuts and bolts like reboot later this year,
[36:07.760 -- 36:12.760] but I'd say things like looking at the explained variance
[36:12.800 -- 36:15.320] of the value function and looking at the,
[36:15.320 -- 36:18.120] like how many samples are getting clipped in PPO
[36:18.120 -- 36:23.120] and what the KL divergences between the policy before
[36:23.120 -- 36:25.680] and after the update is, yeah, things like that.
[36:25.680 -- 36:30.640] And then Ethan, the Calibero from Mila asks,
[36:30.640 -- 36:33.760] what is your median estimate for the arrival date of AGI?
[36:33.760 -- 36:37.440] I think not too far away, but like I said,
[36:37.440 -- 36:39.480] I expect there to be a lot of false starts.
[36:39.480 -- 36:44.360] I would say I expect like AI to be able to do better,
[36:44.360 -- 36:46.520] a better job than humans at most jobs
[36:46.520 -- 36:49.040] that humans do now, five years or so.
[36:49.040 -- 36:51.040] That's not all jobs, but most jobs.
[36:51.040 -- 36:52.680] For a while, we're gonna discover things
[36:52.680 -- 36:54.080] that AI is very good at
[36:54.080 -- 36:56.440] and where we wanna keep humans in control.
[36:56.440 -- 36:59.440] So I think there'll be some kind of gradual process
[36:59.440 -- 37:01.240] over the next 10 or 15 years.
[37:01.240 -- 37:02.440] I've been curious about this.
[37:02.440 -- 37:05.160] I see that some RL work is patented,
[37:05.160 -- 37:08.800] but I could not find a TRPO or PPO in,
[37:08.800 -- 37:10.160] I could not find patents on these.
[37:10.160 -- 37:13.760] Are those protected, patent protected at all?
[37:13.760 -- 37:18.320] Or how do you think of intellectual property protection
[37:18.320 -- 37:19.280] for that kind of work?
[37:19.280 -- 37:22.120] I haven't ever looked into patenting anything
[37:22.120 -- 37:25.080] and OpenAI hasn't either as far as I know.
[37:25.080 -- 37:26.960] I think the trend over time has been
[37:26.960 -- 37:29.600] for people to take patents in machine,
[37:29.600 -- 37:31.920] like a machine learning algorithms less seriously.
[37:31.920 -- 37:34.520] There's this algorithm in computer vision called SIFT,
[37:34.520 -- 37:36.960] which is like this key point to detector.
[37:36.960 -- 37:38.960] And this was patented.
[37:38.960 -- 37:42.080] I think the guy who patented it,
[37:42.080 -- 37:44.680] he probably made his university some money from the patent,
[37:44.680 -- 37:48.160] but in the end, all it did was cause people
[37:48.160 -- 37:52.080] a lot of annoyance because people had to come up
[37:52.080 -- 37:56.280] with alternative algorithms that had a different acronym
[37:56.280 -- 37:58.240] and weren't patented.
[37:58.240 -- 38:02.920] So the OpenCV open source library would have,
[38:02.920 -- 38:05.400] had to be careful about putting this algorithm
[38:05.400 -- 38:07.960] in their library because of the patent risks.
[38:07.960 -- 38:11.960] So I think like these patents aren't,
[38:11.960 -- 38:13.920] patent rights aren't exercised that much.
[38:13.920 -- 38:17.080] And I think big companies like Google will patent
[38:17.080 -- 38:19.280] a lot of stuff for defensive reasons.
[38:19.280 -- 38:22.040] So if they get in some big legal dispute
[38:22.040 -- 38:24.360] with another company, it can be used
[38:24.360 -- 38:26.520] as like one of the bargaining chips.
[38:26.520 -- 38:30.440] But I think, I don't think anyone's gonna like get sued
[38:30.440 -- 38:35.320] for royalties for not providing royalties
[38:35.320 -- 38:36.960] for the use of some algorithm.
[38:36.960 -- 38:40.080] Okay, and then there's been a ton of work in RL, of course,
[38:40.080 -- 38:43.560] since you first published TRPO and PPO.
[38:43.560 -- 38:45.200] But from your point of view,
[38:45.200 -- 38:46.440] if you had to pick a few highlights
[38:46.440 -- 38:50.360] in terms of a few important milestones in RL algorithms
[38:50.360 -- 38:51.600] since PPO came out,
[38:53.120 -- 38:55.080] and by the way, it's amazing that in 2022,
[38:55.080 -- 38:56.400] we're still using PPO,
[38:57.520 -- 39:01.000] I think quite similar to its original form.
[39:01.000 -- 39:01.840] Is that right?
[39:02.920 -- 39:03.920] Yeah, pretty much.
[39:03.920 -- 39:06.880] Yeah, so what would you say are the biggest
[39:06.880 -- 39:09.680] highlights for you in terms of RL algorithm
[39:09.680 -- 39:11.640] since you did PPO?
[39:11.640 -- 39:13.440] Yeah, there's definitely been some interesting stuff.
[39:13.440 -- 39:16.480] So I think like a little after PPO,
[39:16.480 -- 39:19.120] there is TD3 and SAC,
[39:19.120 -- 39:23.000] and those seem like pretty solid value-based methods.
[39:23.000 -- 39:25.320] That was one development that was interesting.
[39:25.320 -- 39:27.840] I think like, yeah, I thought Mu zero
[39:27.840 -- 39:32.840] and it's like elaborations were also like efficient zero.
[39:32.840 -- 39:36.840] Efficient zero were also pretty impressive
[39:36.840 -- 39:38.960] that you can get that good sample efficiency.
[39:38.960 -- 39:41.600] Both of the things I just mentioned were kind of,
[39:41.600 -- 39:45.000] well, I don't wanna say mostly on toy tasks or benchmarks
[39:45.000 -- 39:48.120] because yeah, I'm sure people are doing some real things
[39:48.120 -- 39:49.440] with these algorithms.
[39:49.440 -- 39:52.040] Yeah, so I think that stuff was interesting.
[39:52.040 -- 39:56.760] I think like the whole recent interest,
[39:56.760 -- 40:00.360] surge of interest in the offline RL was also notable.
[40:00.360 -- 40:02.480] I would say the stuff we're doing
[40:02.480 -- 40:06.040] with RL from human feedback is the kind of offline RL
[40:06.040 -- 40:09.000] because we're like, we have a fixed dataset
[40:09.000 -- 40:11.640] and we have a fixed reward modeling dataset
[40:11.640 -- 40:12.880] and we're training against that.
[40:12.880 -- 40:14.720] This is like offline RL,
[40:14.720 -- 40:15.960] but you're doing it in a different way.
[40:15.960 -- 40:19.640] You're using an on policy algorithm with a reward model
[40:19.640 -- 40:23.280] as opposed to maybe a more typical way to do offline RL
[40:23.280 -- 40:25.040] would be use off policy algorithm.
[40:25.040 -- 40:27.760] Would that work here or would that not work here?
[40:27.760 -- 40:30.160] What we're doing here is kind of like model-based RL
[40:30.160 -- 40:33.280] because the reward model is like a model
[40:33.280 -- 40:35.800] of the unknown part of the system.
[40:35.800 -- 40:38.920] So like the unknown part of the system here
[40:38.920 -- 40:42.760] is the human radar or yeah, the human.
[40:42.760 -- 40:46.880] It's not the outputting appending to your list of tokens.
[40:46.880 -- 40:48.600] So this is kind of like the work
[40:48.600 -- 40:51.840] that's like takes a dynamics model of the environment
[40:51.840 -- 40:54.240] and does some kind of just runs
[40:54.240 -- 40:56.600] a policy grading algorithm against it.
[40:56.600 -- 40:57.440] So it's not like,
[40:57.440 -- 41:00.400] so the idea of running an online algorithm
[41:00.400 -- 41:03.720] against a model, that's kind of a well-established idea.
[41:03.720 -- 41:06.800] Though I would say the papers that previously did this,
[41:06.800 -- 41:08.520] they were in a pretty different regime.
[41:08.520 -- 41:11.200] We're in this regime of doing fairly small updates
[41:11.200 -- 41:14.600] to the policy because we have these awesome pre-trained models
[41:14.600 -- 41:19.000] and we don't need to actually change them that much.
[41:19.000 -- 41:21.520] So yeah, we use these online algorithms.
[41:21.520 -- 41:23.760] I'd say part of the reason why we can get away
[41:23.760 -- 41:28.000] with using just like an online algorithm
[41:28.000 -- 41:30.480] is because we've been just looking
[41:30.480 -- 41:32.480] at a contextual bandit problem.
[41:32.480 -- 41:35.080] Yeah, because we only have like one time step.
[41:35.080 -- 41:37.840] Like you get a query and you output a response
[41:37.840 -- 41:40.160] and then that response gets a reward.
[41:40.160 -- 41:43.120] So if we had like a multi-step process
[41:43.120 -- 41:48.120] such as a conversation where you can't assign a reward
[41:48.320 -- 41:50.280] until the very end of the conversation
[41:50.280 -- 41:54.160] and or you had some, I don't know, some interaction
[41:54.160 -- 41:57.800] with like some real world system that's hard to simulate,
[41:57.800 -- 42:00.440] you wouldn't, then it wouldn't be as straightforward to,
[42:00.440 -- 42:03.760] you wouldn't be able to use exactly the same methodology.
[42:03.760 -- 42:05.680] You would probably have to use a,
[42:05.680 -- 42:08.360] you would have to probably train a Q function
[42:08.360 -- 42:10.600] or something like that.
[42:10.600 -- 42:13.080] If you want your method to be sample efficient,
[42:13.080 -- 42:15.640] you would probably have to do something slightly different.
[42:15.640 -- 42:19.120] I think we'll have to start exploring this
[42:19.120 -- 42:22.560] at some point soon, but so far we haven't,
[42:22.560 -- 42:27.480] at least I haven't seen any cases in like in the domain
[42:27.480 -- 42:29.680] I'm looking at that require this,
[42:29.680 -- 42:33.480] but I expect it to be relevant at some point.
[42:33.480 -- 42:37.080] So we had Arvind Srinivas talking about decision transformer
[42:37.080 -- 42:39.360] on the show recently, that was a great episode.
[42:39.360 -- 42:41.360] And I see that you were also a co-author
[42:41.360 -- 42:43.920] on the 2016 RL squared paper.
[42:43.920 -- 42:46.680] I want to ask you what your thoughts about meta RL.
[42:46.680 -- 42:48.560] Arvind had some interesting things to say
[42:48.560 -- 42:50.640] about maybe the idea that a transformer
[42:50.640 -- 42:52.320] could kind of supersede the need
[42:52.320 -- 42:54.200] for an RL algorithm altogether.
[42:54.200 -- 42:56.200] What do you expect from meta RL?
[42:56.200 -- 42:58.600] Do you expect we'll still be using human-authored
[42:58.600 -- 43:00.600] RL algorithms in the future?
[43:00.600 -- 43:03.000] Yeah, that's a pretty bold statement that we don't need,
[43:03.000 -- 43:05.400] we won't need any RL algorithms anymore.
[43:05.400 -- 43:07.640] Yeah, since the RL squared paper,
[43:07.640 -- 43:10.920] people have been talking less about meta learning,
[43:10.920 -- 43:12.400] as far as I can tell,
[43:12.400 -- 43:15.760] actually because of sequence modeling has gotten so good,
[43:15.760 -- 43:19.680] like transformer sequence models, that it's kind of clear
[43:19.680 -- 43:21.920] that meta learning is just a special case of learning.
[43:21.920 -- 43:26.560] Like it's just like a certain kind of long context learning,
[43:26.560 -- 43:28.720] learning involving long episodes.
[43:28.720 -- 43:31.120] And maybe it shouldn't be treated that differently
[43:31.120 -- 43:33.600] or addressed with special algorithms.
[43:33.600 -- 43:36.760] I would say, yeah, the ideas like decision transformer
[43:36.760 -- 43:37.880] are pretty interesting,
[43:37.880 -- 43:40.520] where you try to reduce RL to supervised learning.
[43:40.520 -- 43:43.800] It's still not like certain exactly how these compare
[43:43.800 -- 43:47.320] in performance to RL, like people have started to analyze
[43:47.320 -- 43:49.280] that empirically and theoretically.
[43:49.280 -- 43:53.320] And I would say in practice, sometimes it's better,
[43:53.320 -- 43:55.240] sometimes it's worse.
[43:55.240 -- 43:57.960] In my experience, like it's been worse on the problems
[43:57.960 -- 44:01.920] that my colleagues and I have, where we've tested it.
[44:01.920 -- 44:05.480] But yeah, it's definitely an interesting direction.
[44:05.480 -- 44:08.360] Dr. John Schulman, thank you so much for sharing your time
[44:08.360 -- 44:10.360] and your insight with the talk RL audience today.
[44:10.360 -- 44:11.480] Thanks so much.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere