TalkRL: The Reinforcement Learning Podcast | Transcript: Outstanding Paper Award Winners

Outstanding Paper Award Winners - 1/2 @ RLC 2025

August 15, 2025 / 06:46/E69

Speaker 1: 00:01

Talk RL

Speaker 2: 00:06

So I'm here at RLC twenty twenty five in Edmonton at University of Alberta. I'm with Alexander Goldie, who just won the outstanding paper award for scientific understanding in reinforcement learning. Congratulations, Alex. Thank you very much. So can you tell us a bit about this paper?

Speaker 3: 00:24

Sure. So throughout the history of reinforcement learning, we've always relied on handcrafted algorithms from people. And the goal in meta learning and algorithm discovery is that we take the human out of the loop and learn the algorithm from data instead. There's a bunch of work which does this, but the problem in general is that people don't think about how they should do the learning, just what they should learn. And so the real goal in this paper was kind of comparing different ways of learning the algorithm.

Speaker 3: 00:55

So asking an LLM to propose something or using evolution to train a neural network that replaces an algorithm in RL. Your paper title was how should we meta learn reinforcement learning algorithms? First author, Alexander Goldie et al. Congratulations again, Alex. Thank you very much.

Speaker 2: 01:14

So I'm here with Ryan Sullivan, who just won the outstanding paper award on tooling, environments, and evaluation in reinforcement learning at RLC twenty twenty five. Ryan, can you tell us about this work?

Speaker 4: 01:26

So we created this curriculum learning library, which we call Syllabus. It's a portable curriculum learning library. And the motivation behind it is that curriculum learning has been a pretty core component of a lot of the big breakthroughs in Reinforceable Learning, things like Alpostar, OpenAI five. But it's not used very commonly in academic benchmarks because it's very hard to work with. So what Syllabus does is it provides this sort of unified API for defining corrected learning algorithms, it provides portable implementations of a lot of automatic corrected learning methods, And most importantly, it's this global synchronization infrastructure that makes it really easy to take these algorithms and apply them to any RL code base that you're already using.

Speaker 4: 01:59

So it takes just a few lines of code just to wrap around the environment, and you create a curriculum class. And you can apply correct learning to your code in any RL baseline or any RL code base, as long as you're using Python multiprocessing. Can you

Speaker 2: 02:11

tell us a bit about some of the algorithms that it supports?

Speaker 4: 02:14

Sure. So we have implementations of prioritized level replay. There's a learning progress curriculum that was originally developed for Minecraft. Sampling for Learnability, which is a recent JACKS based algorithm, and Omni, which sort of uses an LM to filter out interesting tasks in the task task space, based on the agent's current performance.

Speaker 2: 02:33

Congratulations, Ryan.

Speaker 4: 02:34

Thank you.

Speaker 2: 02:34

I'm with Joseph Suarez and Spencer Chang of Pepper AI. They just won the outstanding paper award for resourcefulness in RL. Congratulations, both of you.

Speaker 5: 02:45

Thank you. Appreciate it. Can you tell us more about the paper that you won the award for? Sure. So PufferLib is an ultra high performance reinforcement learning library.

Speaker 5: 02:56

It lets you train agents at millions of steps per second on a single GPU, as well as even faster distributed. The key details of PufferLib are ultra high performance simulation in c across a variety of domains, high performance vectorization, and multiprocessing to maintain that level of performance throughout the training step, and then the trainer itself also being optimized to millions of steps per second. And then we use all of this to do research incredibly quickly to be able to run thousands or tens of thousands of experiments on only a few GPUs. It's been used by a few different labs that we have have academic collaborations with as well as a few different places in the industry. One of the things we found is that there isn't one specific area that tends to heavily use our tools.

Speaker 5: 03:42

It's really much more disparate. So we've done some work in animation generation. We've done some work in finance. We've done some work with other research labs, both in industry and in academia. And we're having all sorts of conversations with people in far different industries we would have never expected.

Speaker 5: 03:58

And so with PufferLib, we've been able to kinda apply our library to various real world applications, some that, you know, originally backgrounds in those companies would have never thought RL would be useful, and being able to apply it is something that is PufferLib's kinda specialty at this point.

Speaker 2: 04:13

I'm here with Esra El Elmi and to hear about her outstanding award winning paper.

Speaker 1: 04:19

Hey. I'm I'm Esra, and we just won this outstanding Weber award on the theory of reinforcement learning. The Weber is mainly about how can we use this generalized projected Bellman error, which is simply an objective function that we can use to learn value that we can use to solve this value estimation problem in reinforcement learning. And you might be wondering like, oh, we use semi gradient TD methods all the time, they are great. What's wrong with that?

Speaker 1: 04:50

And then, so there are many cases where we can show that there is a counter example that show diversions for these methods. And then an alternative for that is that we can use methods that do actual gradient descent, and then they are guaranteed to converge in the same sense that gradient descent converge. And those methods are called gradient CD methods. They are quite old. They are not new.

Speaker 1: 05:17

They exist. But what's new here is those methods usually work only with linear function approximation, and they only they mostly were developed for this, like, one step TD updates that can be very slow at doing credit assignment. And what we do in this paper is that we show that you can actually generalize those methods, extend them to this deep reinforcement learning setting, and use them with something called boundary terms, which allow you to do a much faster credit assignment. And then you might be wondering like, okay, so that looks cool. Does it actually work?

Speaker 1: 05:58

And then the nice thing is that we show that it actually works in practice. So and that's quite cool because like you start from, like, the first principles, drive, an objective function, add, like, the lambda returns that, like, I really like, and it's it exists for many years, but it's really rarely used. And then you show that when you combine these, like, first principle methods, you end up with a nice objective function that you can optimize and you can use, like, for many deep RL setting, and it will outperform the baselines that you we currently have. Thank you.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Outstanding Paper Award Winners - 1/2 @ RLC 2025

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere