TalkRL: The Reinforcement Learning Podcast | Transcript: Outstanding Paper Award Winners

Outstanding Paper Award Winners - 2/2 @ RLC 2025

August 17, 2025 / 14:18/E70

Speaker 1: 00:01

Talk RL .

Speaker 2: 00:06

I'm here at RLC twenty twenty five in Edmonton, Alberta at the University of Alberta. And I'm here with Ayush Jain, who just won the outstanding paper award for the empirical reinforcement learning research category for his paper, mitigating suboptimality of deterministic policy gradients in complex q functions. Congratulations, Ayush.

Speaker 1: 00:27

Thank you so much, Robin. It's great to talk to you. I'm a big fan of the talk RL podcast. So, yeah, I can give a brief introduction about the paper. So the context of this paper is set in the continuous action space, and we're trying to do queue learning in the continuous action space.

Speaker 1: 00:41

Traditionally, people do deterministic policy gradients for this, and we've identified a drawback in deterministic policy gradients in certain settings, in certain environments. So the key problem is that because deterministic policy gradient tries to find the optimal action to take in the environment by, doing gradient ascent on the queue function, it is susceptible to getting stuck in local optima of the queue function, in some complex environments. So these environments could be, like, dexterous manipulation or, like, large scale recommender systems. So, yeah, in these environments, it is susceptible to getting stuck at these local optima of the queue function. So what we do in our work is we try to mitigate the suboptimality of these deterministic policy gradients.

Speaker 1: 01:29

So the there are two insights of our of our work. The first one is that instead of just using a single actor, which is the common case in deterministic policy gradients, we propose to use multiple actors. And that just just by the virtue of using an ensemble of actors, you are more likely to end up at a local optimum that is better than just using a single actor. But we can do something more interesting than this, and that is you can actually make the life of all of these actors successively easier by simplifying your queue functions successively. So the basic idea is that you can take one action, let's say, that your first actor suggested, and you can simplify your queue function below that action.

Speaker 1: 02:16

And you can just remove that part from your search space. And if you do that, you you make sure that every next subsequent action that you find using gradient ascent, it it faces fewer and fewer local optima because you are clearing out some part of the search space of the queue function. So in the end, if you do these two techniques of adding more actors and simplifying your queue function, you improve the optimal the the actions that you find through this through our method. So, yeah, we find out that we are able to outperform t d three and the DDPG, which are like common algorithms in the space, in in these environments where you have these complex and non convex action q functions.

Speaker 2: 02:59

I'm here with Kalarina Muslamani, who just won the outstanding paper award at RLC twenty twenty five for emerging topics in reinforcement learning for her paper towards improving reward design in RL, a reward alignment metric for RL practitioners. Congratulations, Kalarina.

Speaker 3: 03:17

No. Yeah. Thank you so much. No. Yeah.

Speaker 3: 03:20

So this paper so generally, I think reward design in RL is a pretty, like, fundamental problem. Yet we kinda overlooked that. We just assume that we have some given reward, and then we can have our favorite r algorithm, and we can use that. But in this paper, we really kind of investigated, how do we, like, how do we, identify whether our reward functions are actually properly specified? Meaning, if we do train an RL agent with that reward function, will our kind of final behavior be what what what what we expected?

Speaker 3: 03:54

And so we propose a metric which we call the trajectory alignment coefficient. It's a pretty kind of simple idea. This metric will compare the similarity between a set of rankings given by a domain expert over a set of trajectories or just different agent behaviors to those that are induced by a given reward discount factor pair. And so we say our reward functions are aligned. If our reward function so yeah.

Speaker 3: 04:21

Yeah. If if our reward function has a score kinda close to one, so it ranges from one to negative one. If it's one, it means it's aligned. And if it's negative one, it means it's kind of misaligned. And it was really cool.

Speaker 3: 04:34

So in this paper, we we not only wanted to wanted to propose a metric, but we also wanted to see whether we can actually use this metric to help our practitioners design better reward functions since that's kind of the the real problem we we we care about. And so to do that, we did a user study of 11 our practitioners, and we told them that they were going to be collaborating with their domain expert in order to design a reward function for a very simple tabular task. In the study, we had them perform reward selection. So they were given two reward function at a time and they had you pick the reward function that they thought would satisfy the preferences of the domain expert. And funny is that we found that, again, even in a simple like four by four grid world, users could not and these are practitioners, not like lay people.

Speaker 3: 05:23

But, yeah, these are our practitioners could not kinda pick the the the preference respecting reward function. Again, this is a four by four tabular current and grid. But at the same time, for most of those our practitioners, when they had access to our metric during reward selection, they were able to to choose the preference for starting reward function at a rate like, on a kinda close to close to a 100%. And at the same time, they felt they felt better doing it. So we also had like a we also had a qualitative fees kind of seeing what their experience was like.

Speaker 3: 05:58

So they feel better doing it and they actually do a better a better job at it. So, yeah, all this is really it's really exciting for me because this is, again, a really big problem. So if we can figure out better ways to design better reward functions, then I think we can hopefully have more real world applications overall.

Speaker 4: 06:16

Hi. This is Reginald McLean, and I'm one of the authors on the paper Multitask Reinforcement Learning Enables Perimeter Scaling. My coauthors were on this work were Evangelos Chatsarulis, Jordan Terry, Isaac Woonggang, Nerman Farsad, and Pablo Samuel Castro. We this paper received an award at Reinforcement Learning Conference twenty twenty five on scientific understanding in reinforcement learning. So our work attempted to tackle the common assumption in multitask reinforcement learning that you need to create a specialized architecture to share information between different tasks.

Speaker 4: 06:59

One of the things that we noticed in the literature was that there was a missing comparison of these architectures to a comparably sized dense dense feed forward architecture with the same number of parameters or similar number of parameters. So that's where our paper starts, was really trying to dig deep into that assumption. And what we found was that this assumption is actually incorrect, that simply making your network bigger in our case, we made our networks wider rather than deeper. And simply just making our networks wider was allowed us to perform better on the meta world benchmark. So that provides some insight that maybe we just need to do parameter scaling in this case, rather than these complicated specific architectures.

Speaker 4: 07:53

We also got had an interesting insight into the effect of scaling with regards to plasticity loss. So in the literature, there's been different reasons proposed for why neural networks lose their network plasticity or their ability to learn. We, using the percentage of dormant neurons, found that by increasing the number of tasks and the number parameters in our networks, we were actually able to mitigate plasticity loss. So this has kind of a key implication for reinforcement learning where maybe we need to, in the single task setting, be able to introduce additional sources of data, whether they might be from auxiliary tasks, using things to predict in the environment, or, you know, some other maybe some other ideas that people can come up with, and that would allow us to fully utilize our networks.

Speaker 2: 08:52

I'm sitting with Will Solo, PhD student at Oregon State University. Will just won the outstanding paper award for applications of RL. Congratulations, Will.

Speaker 5: 09:02

Thanks, Rob, and appreciate being here today. The paper, learning annual and perennial crop management strategies was accepted, and I'm really happy to be here presenting our work. I work alongside my adviser, doctor Sandhya Sai Subramanian and doctor Alan Fern. We're funded by the Ag Aid Institute grant. We saw the need for a specialty crop simulator given that other available benchmarks and options for reinforcement learning don't really cover the needs that we see the problems that we see come up in the agriculture domain, specifically that crop cycles are inherently cyclic, feedback is incredibly delayed.

Speaker 5: 09:39

We're always operating under extreme partial observability. And so the common RL benchmarks that we have don't really cover the cover agricultural settings. And more importantly, also, the available crop simulators out there are incredibly limited in terms of the tasks that they can offer. And for from an RL point of view, there's not a lot of incentive for an RL researcher to take a new simulator in a new domain if they if it's not widely usable, easily accessible for them. So we tried to bridge that gap.

Speaker 2: 10:08

So can you talk about how you constructed the simulator? What what were the components, or did you use a lot of historical data? How how did that work?

Speaker 5: 10:17

Yeah. So there's a lot of widely available crop growth models used in the agronomy community. Cycles, AppSim. We chose to work with WOFOS, which stands for the World Food Studies crop growth model. This crop growth model has been around for around twenty five plus years, and has been used widely in Europe and also some in North America and China to both investigate genealogy techniques and management problems throughout different crops.

Speaker 5: 10:45

And we've seen that it's been used for both annual and perennial crops. So the and more important, also useful for us, it's written in Python, which means it's much more easy to implement within the OpenAI gym framework or just, you know, AI frameworks in general. There's a couple other crops estimators that are written in compiled languages, which certainly you could you could make that work. It'd be a lot more effort up front from a software engineering standpoint. And, also, Wohfrost has a really nice modularly written crop growth model, and so you have different modules for every single crop subprocess.

Speaker 5: 11:19

As soon as you hear that, you might think, oh, yeah. Ximtorial transfer down the line. That's eventually what we'd love to be doing, and you need a crop simulate crop model that's gonna support, you know, high fidelity modeling of all crop processes. So we started there. We modified the base crop growth model to support long horizon, I.

Speaker 5: 11:38

E. Multi year perennial crops. There's a a repository of available sort of crop parameters for 25 different annual crops and two different perennial crops that we leaned on. These were calibrated based off of historical data. So you cannot make the argument that, you know, those parameters, that's with the right location, that's all that's a high fidelity simulation of what happens in the field.

Speaker 5: 12:00

But, obviously, then, Sim2Real transfers, it's it's own own beast. From there, we took essentially all those configuration files, parsed them into really nice, easy to read YAML files, and created an OpenAI gem wrapper around the robust crop growth model. That allows AI or RLV searchers to take all these different crops. I can see the names really easily, but essentially just configure simulation much like they would any sort of OpenAI gym environment, but with just with, you know, command line arguments, much like they would any other gym environment. So we hope that it's something that's much more straightforward for our users, and we provide a couple of different, you know, built in RL algorithms just as baselines.

Speaker 2: 12:43

Can you talk about how this fits into your PhD more broadly?

Speaker 5: 12:47

More broadly, I found myself really interested at the intersection of, you know, AI and application or RO and application. I sort of see myself down the line if I could end up either teaching or also more just in the intersection of AI, robotics, agriculture, or climate, or some other application. I think there's a lot of a lot of value that comes from working on a real world problem just because you're incredibly constrained by what you have available in that in that problem. For agriculture, for example, you know, if we have twenty years of data, that's all the data that we have. And so it becomes a question of what you do with that data, what you're probably trying to solve.

Speaker 5: 13:26

If you wanted to collect more data, you know, you'd tack on five years to your PhD. I'm waiting for that data to come in. And so I really enjoy the the problems that arise in those spaces. More broadly, you know, I'll be I'll be working on the extensions to this project for the next couple years, and maybe also bringing in some different robotics techniques because brought our ideas or into that space just because a lot of what we what we're doing now, now that we have this crop simulator, is different flavors of model calibration, which let pivot very much into the Syntorial transfer problems that you see

Speaker 2: 14:03

in robotics. Thank you, William Solo, and congrats again on your award. Awesome. Thanks a bunch, Robert. Appreciate it.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Outstanding Paper Award Winners - 2/2 @ RLC 2025

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere