TalkRL: The Reinforcement Learning Podcast | Transcript: David Abel on the Science of Agency @ RLDM 2025

David Abel on the Science of Agency @ RLDM 2025

September 8, 2025 / 59:42/E72

Speaker 1: 00:01

Talk RL. Talk

Speaker 2: 00:05

RL podcast is all reinforcement learning, all the time. Featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chauhan. Today, I'm joined by David Abel, senior research scientist at DeepMind on the agency team, and an honorary fellow at the University of Edinburgh.

Speaker 2: 00:30

Welcome, David.

Speaker 1: 00:32

Thank you so much, Robin. Yeah. I'm real thrilled to be here. I've been a listener of the podcast for a while, so really excited for our conversation today.

Speaker 2: 00:39

So I got to meet you at RLC twenty twenty four at Amherst, and you were really great at drawing people into these fasc fascinating discussions and moving between big ideas. And so I think this is gonna be a great conversation. You're gonna be a great guest. How do you like to describe your area of focus?

Speaker 1: 00:58

Thank you. So I tend to think of my area of focus as something like the science of agency, which naturally draws on topics like learning and computation and intelligence and, of course, reinforcement learning. And I tend to pursue this research agenda through a slightly philosophical lens. I'll often get started with something like a philosophical question that I think is really exciting or important, and then try to make progress in answering that question by grounding the study through the use of, say, new definitions, new axioms, or other kinds of mathematical language that can hopefully yield maybe a new way of thinking, a new insight, and hopefully, ultimately, some kind of new understanding.

Speaker 2: 01:39

We are sitting here at RLDM conference in Dublin, Ireland, which you helped organize this as a workshop chair. How's the conference going for you?

Speaker 1: 01:48

It's been great so far. Just getting started today and already had some really exciting conversations, connections, got to see some some old friends, colleagues, things like that. Always quite exciting. Really fun tutorials this morning. My colleague Anna gave a verbally kind of mind bending and eye opening tutorial that I thought was quite exciting way to start the conference.

Speaker 1: 02:07

So great to

Speaker 2: 02:07

be here. So I noticed your academic career includes a mix of CS and also philosophy. Can you say a little bit about how the philosophy angle might have influenced your perspective?

Speaker 1: 02:18

Yeah. For sure. So philosophy was my first love of growing up, and I always found myself drifting toward these questions that really force you to think deeply about different topics. And having that background in philosophy, I still try to pull from it today in figuring out which questions to focus on, but also in terms of the method. So what does it look like to make meaningful progress in a given area?

Speaker 1: 02:47

I do find the the style of making progress from philosophy to be quite valuable. It's this notion of, like, slow accumulation of insights or definitions or sometimes even things like paradoxes or puzzles or problems that can give us a clear way to think about a different particular space. So I do find that quite valuable, and I really try to bring that over to the research that I do now. At the same time, I do find the the methods of science, let's say math and computer science, also really valuable. And the concrete nature of that kind of progress is something I definitely do aspire to.

Speaker 1: 03:26

So being able to try to blend this more philosophical style of questioning with really systematic progress that can come in the form of definitions or theorems or algorithms or experiments is something I aspire to. Can't always do it perfectly, of course. I've been

Speaker 2: 03:42

telling people that I'm gonna go interview the oral philosopher. And yeah. Do is that a fit?

Speaker 1: 03:49

I'm honored to have that title if that's that's something that that folks think of. That's great. I I love that, to be honest. Yeah.

Speaker 2: 03:56

Nice. And also, I should mention, I'm really glad we planned this. You reached out. I was not originally planning to come to RLDM, and when you sent that message, I really thought about it. I wanted to come before it took the scales for me being here today.

Speaker 2: 04:08

So thank you for that.

Speaker 1: 04:09

Oh, amazing. That's great to hear. Yeah. I'm curious what you'll think of RLTM. It's a really fun conference.

Speaker 2: 04:13

Yeah. I would say, just briefly, one of the highlights for me is getting a taste of the neuro side, and the mapping from RL to what the neuroscience people see in the brain. And that's new to me, very exciting, but So some of your work focuses on definitions, and I can imagine some people wondering, you know, is that overly pedantic to get too deep in the weeds of the definitions? Like, maybe we we should get back to building algorithms if they think that's the the key task. But on the other hand, how can we build these conceptual towers without a strong foundation?

Speaker 2: 04:46

Like, maybe is that how you see it? Or can you point to moments where, like, defining something clearly in this way was really fruitful?

Speaker 1: 04:54

Yeah. Definitely. That's a great question. So I do think definitions are extremely important in the clarity that they can bring and the opportunity to frame a field. And there's a few things I wanna say about this.

Speaker 1: 05:09

So there's this beautiful quote from Quine. Says, the less a science is advanced, the more its terminology rests on an uncritical assumption of mutual understanding, which suggests that when we have some kind of conceptual confusion, it can leak into a lot of the ways that we conduct our experiments. We form a hypothesis. We pick what our goals are for the field, how we do the work that we do. And oftentimes with younger fields, I think they haven't quite shored up their definitions in a way that can hold the field back in some ways.

Speaker 1: 05:40

So there are opportunities, I think, as well to get overly focused on definitions and just go around the sort of semantic spiral, and maybe it doesn't lead anywhere. But there have been a few moments throughout history where to me getting definitions right is what really allowed a field to flourish. I'll just mention a couple examples. But again, I think the time scale is usually a definition is introduced, and then over a relatively long period, its influence is felt. So one that immediately comes to mind is the notion of belief.

Speaker 1: 06:09

So I guess in the history of thinking about the concept of belief, it had this kind of squishier notion or maybe more immediate intuitive notion where we just can introspect and think about our own beliefs. But around the advent of say probability, we were able to come up with a formal language that characterizes something close to our intuitive picture of belief. And by creating that formal language, we were then able to go beyond what our intuitive notion of belief could give us on its own. And, of course, probability now is one of these kind of bedrocks across many different disciplines. So it's one of these cases where having that conceptual clarity can yield fruits that go far beyond our own intuition without having that definition.

Speaker 1: 06:56

So now for instance, if you wanted to better understand belief, I view probability as one of these sort of trailheads. I really like this idea of kind of a conceptual trailheads in research space where if you wanna think about, for instance, information, another trailhead. And I say trailhead in the sense that it's a starting point. It's a place to familiarize with a concept to rule out which paths maybe to not go down and to send you at least in the right direction, and maybe you can branch off later. But having this initial contact with some well thought out, in some cases, precisely defined notions can be really valuable.

Speaker 1: 07:33

So with information, probably starting with, say, Shannon's information theory is a good trailhead even if you might branch out later. And that's another definition where once we had this notion of, say, Shannon's entropy, a lot can then follow from it. So there's a book by this Hungarian philosopher, Vakatush, called Proofs and Refutations that talks about the way in which definitions are formed in an empirical, almost iterative way in the sense that we don't just pluck a definition out of the sky and then start using it because it fits our kind of conception of an idea, but rather we pick a definition or a starting point and then see what we can derive from that definition. And if it reaches the points that we might have hoped to reach, then it's a sign that the definition is pointing us in the right direction. But if it's missing that, then we might go back and refine the definition.

Speaker 1: 08:28

So there is this kind of incremental iterative way in which our definitions are crafted. And I think this style of definition can be really valuable when it's done over the course of a history of the history of, say, a field's development because it allows the definitions to really reflect the bedrock of the field, and as you said before, give us a foundation to stand on.

Speaker 2: 08:50

Okay. Let's let's get into some of your recent work. So we're gonna start with the paper that you called, Posticity as the Mirror of Empowerment. That's with yourself as the first author and a number of co authors, including many well known names in RL. So can you talk about the main some of main ideas in this paper to get us started?

Speaker 1: 09:11

Yeah. Definitely. So as per the title, the two main concepts in this paper are plasticity and empowerment. And the paper is all about a relationship between these two concepts. So plasticity is derived from this notion of synaptic or neuroplasticity, which is something like, to what extent can our neural circuitry rewire itself?

Speaker 1: 09:31

How pliable is it? And the second concept of empowerment refers to how much control or power an agent has over its environment. And we, in this paper, were initially trying to explore and better understand what plasticity was. Coming back to our conversation about definitions, we're interested in trying to better define plasticity as it's within the of AI and machine learning space become a more recent topic of excitement. And we sensed this sort of conceptual differences across the way people were using this term plasticity.

Speaker 1: 10:05

So we wanted to zoom in and try and give it a simple definition that we could build from. And in doing that, we ended up finding this relationship between plasticity and empowerment. Slightly more concretely, what are plasticity and empowerment? They're both properties of an agent, which will be an input output system for us, just anything that has observations coming in and actions going out. And the plasticity, as we define it, is the degree to which the agent's inputs can influence its outputs.

Speaker 1: 10:36

And this could be expanded over time, so how much could the observation seen maybe five time steps into the past influence the actions five steps into the future. That's how we characterize plasticity. And empowerment has been around for a little while. It was as to my knowledge first introduced by Klubin et al in 2005 or so. And the way they defined it was a measure of how many different futures the agent can bring about.

Speaker 1: 11:05

And this is defined in a similar language to how we define plasticity as how much can the agent's actions influence its future observations. So if an agent has zero empowerment, it means that its actions have no impact on the environment's observations. So the environment's gonna continue to do what it was going to do regardless of what the agent did. The environment's sort of not listening to the agent. And the main results of the paper is that really, there's two.

Speaker 1: 11:30

The first is that plasticity and empowerment are basically two sides of the same coin. We call them the a mirror of one another because how much an agent influences the environment, we call empowerment. But that's really if you flip the perspectives and think about things from the environment's point of view, that's really just the plasticity of the environment. And similarly, if you look at things from the agent's point of view, how much its observations influence its actions is really the empowerment of the environment. How much can the environment steer or control the agent?

Speaker 2: 12:02

So I really had to pause when I read the notion of environment plasticity and environment empowerment, because I don't think I was used to thinking mean, like, it's just we talked about agent empowerment, we've heard about that in the past, and agent plasticity. But these, really had to think about. I'm not fully sure I understand that, but let's get into that. So one one aspect is your paper referred to Marco's 73 paper on bidirectional information theory, which was new to me, very interesting in of itself. And in Marco's example setting, there was two people communicating over a noisy channel.

Speaker 2: 12:39

And so, if I understand your analogy is that there's a con that RL is like a conversation between Asian environments, similar to these two parties communicating over the noisy channel?

Speaker 1: 12:49

That's exactly right. Yeah. Perfect.

Speaker 2: 12:51

And what Marco calls the first party is monologuing. Does that mean the first party has zero plasticity? Or that the other party has zero empowerment?

Speaker 1: 13:00

Is that what That's exactly right. Yeah. Perfect.

Speaker 2: 13:03

Okay. So, I'm gonna try to relate that to a few other ideas. We often have a case of the environment is not really learning from the agent, but the agent is learning from the environment. Like in in the basic canonical cart pole environment, cart pole is not collaborating or competing with the controller, it's transition function is a stationary thing. Is that a the general assumption here?

Speaker 2: 13:24

Was that

Speaker 1: 13:25

Yeah. Like, would say that so?

Speaker 2: 13:27

The environment's not adversarial or cooperative. That is that a separate concept?

Speaker 1: 13:33

Interesting. Good question. I like that. That seems right to me. Maybe there's room for some of these concepts to slip in the way that the environment handles the agent's actions, but I think for the most part, by d by default, yeah, my take would be that carpool, it just is, and the agent is manipulating the angle of the pole.

Speaker 2: 13:52

Got it. Okay. So was it really about the agent's actions having any effect on the observations? Is that what we're talking about?

Speaker 1: 13:59

When we talk about empowerment, that's exactly I see. Exactly. Yeah.

Speaker 2: 14:03

And then we have this we've heard of this big world hypothesis, which is, I think, roughly that the Asian is much smaller than and simpler than the world. Mhmm. Do we say that somehow constrains plasticity in the limit of an Asian in a big world environment? Yeah. That's a miscollection.

Speaker 2: 14:20

Speaker 1: 14:23

would say there is a potential to to relate these two in quite a quite a precise way. So what would that look like? Agents that have no plasticity have effectively shut off their observational channel to the environment. They can't be influenced by what they observe any longer. And agents that have really high plasticity are those that can be influenced greatly by their environment.

Speaker 1: 14:51

And in order for an agent to be really influenced by what it sees in its environment, there needs to be a lot of information in the environment. So in the big world, under the big world hypothesis, our agents are always there's always a lot of information available to the agent because the agent is so much smaller than the environment. So I think one way of characterizing this would be in terms of the kind of information content present, where an agent can only be highly plastic if there's a lot to learn about, and that's definitely present in this big world hypothesis setting. Now, similarly, the empowerment of an agent, you can imagine for a really big world, intuitively, the agent's empowerment is probably quite low as well. That if the agent can fully control its environment, then it probably suggests something about the relationship between the agent and the environment.

Speaker 1: 15:40

Maybe the agent's not smaller, for instance.

Speaker 2: 15:44

Cool. Okay. And then the definitions of the paper were mostly about observations and actions. Does reward enter into this definition at all? Or you consider reward part of the observation or just separate?

Speaker 1: 15:56

Yeah. Really good question. This is one that came up amongst the group quite a lot as we were working on this, and we ultimately decided to focus purely on actions and observations. So as you said, we could allow reward to be an element of the observation. Right?

Speaker 1: 16:12

The observation might be a big vector and the last element, whatever number that is, you might think of that as the reward. But we chose to obscure that for now and are focusing entirely just on action and observation. And one thing this lets us do is create the symmetry between agent and environment where we can just switch their roles. Coming back to this point about environment plasticity, like, has a meaning associated with it because when it's just an agent and environment communicating, there's no difference between the actions and the observations in some sense. They're both just signals that they send each other, and we could just as easily relabel the signals as action and observation.

Speaker 1: 16:53

We can swap their names. And as a result, you also are effectively just swapping the role of the agents in the environment. Once reward is in the mix, there's an asymmetry that then creates just some additional nuance to think about. And, of course, the real reason that maybe plasticity and empowerment, at least one reason that they could matter quite a lot is because of the impact they might have on, say, learning from reward or learning to maximize or pursue some goal or something like that. And that's not an area we have

Speaker 2: 17:24

any concrete results yet. Okay. Did can we possibly relate this to like a normal form games, like a prisoner's dilemma situation? Could we imagine in that case, like, the environment There's no environment there per se. Maybe it's more just the other player, and the environment is just the reward matrix?

Speaker 2: 17:44

Or so in that case Does that even make sense here? And it's a Would it be empowerment of one agent with the plasticity of the other? Or is that just not out of scope completely?

Speaker 1: 17:52

Nailed it. No. That that's a perfect example. Okay. If the environment in the traditional sort of single agent setup is just another agent, so now we have Alice and Bob sending each other action and observation back and forth, We effectively can recreate a two player game where the symmetry is explicit rather than implicit.

Speaker 1: 18:10

And as a result, we can recover. I think we haven't explored this too much, but I think there's an opportunity to connect these ideas of plasticity and empowerment with the kinds of analysis and properties you might expect in the context of, say, games. And as you mentioned, say, the plasticity of Alice in this context is exactly the empowerment of Bob and vice versa. So in the case of, say, two musicians playing music together, if one musician is purely reacting, say Alice is purely reacting to what Bob plays, he's just repeating the chords or the notes, and that's Alice being highly plastic. But at the same time, that means Bob is very empowered because his music has a high influence over what Alice does, and vice versa.

Speaker 1: 18:53

If Alice is very empowered, it means her music has a high degree of influence on what Bob plays. And so in this sense, we get this nice symmetry where the plasticity of one agent is exactly the empowerment to the other.

Speaker 2: 19:04

This comes back this actually that that scenario seems more similar to Marco's original paper with the two parties communicating. That that I was stared at the diagram for a long time. I'm trying to make sense of it. And when I finally understood it, I was like, yeah, this explains a lot about human communication. Having some bad conversations, or either monologues, or two people completely talking past each other.

Speaker 1: 19:25

Exactly.

Speaker 2: 19:26

And either the empowerment or plasticity has gone to zero somewhere.

Speaker 1: 19:29

Exactly. I think it suggests that you need to have a little bit of both in order for good collaboration, good communication, this kind of thing.

Speaker 2: 19:36

So is the use of plasticity here different from its use in continual learning, like the ability of a con of a network to learn at some specific point in training? Is that the same definition or slightly different than how you're using here?

Speaker 1: 19:47

I think in spirit, they're supposed to be quite similar, although we haven't done the work to really shore up that connection. But we certainly drew inspiration from a lot of recent work looking at the loss of plasticity and then continual reinforcement learning. The one of the key characteristics of our definition of plasticity, it's that it's relative to a particular window of time. And so we ask, in this period of time, it's a finite horizon version of the definition. Over this particular window, how much is the agent's actions influenced by the observations?

Speaker 1: 20:20

And you could imagine and say some of this recent work that looks at the loss of plasticity, it's also potentially looking at a particular window of time, but that window coincides with some particular structural change in the environment. Maybe the task shifts or these labels shift. This version of continual ImageNet, I think it is, where the labels are scrambled. And in this kind of experiment, it's a case where my belief is that our notion of plasticity should line up with how they're thinking about a loss of plasticity. But I think there's some details to shore up to make that nice and tidy.

Speaker 2: 20:55

And then you introduced some generalized directed information here. And if I understand, I think you're building on directed information from Marco. Can you just briefly, or maybe that was from Massey, but can you just very briefly mention that stack from information, direct information, and generalize? What are the how would you define each of those ways?

Speaker 1: 21:13

Yeah. Definitely. Nice. So the original notion of mutual information, right, comes from Shannon, and as a way of characterizing how much you learn about one variable when you learn about an another variable. And so if two two variables are completely disconnected, then learning about say x doesn't tell you anything about y, then their mutual information will be zero.

Speaker 1: 21:36

And following Marco's bidirectional theory, there was this paper from Massey in the nineties that introduced this notion of directed information that says, rather than capture just how much two variables share with one another, we wanna implicitly or sorry, explicitly deal with how much does a past variable tell you about some future variable. But we don't necessarily want to also capture how much does that future variable tell you about the past. So in the context of Marco's bidirectional communication setup, so say again, Alice and Bob are communicating. When Alice sends a message to Bob and then Bob sends a message back to Alice, the directed information is designed just to tell you, well, how much did Alice's message inform what Bob said? So how much information is Alice sending into the future through the through her conversation with Bob?

Speaker 1: 22:29

But what we want to remove when we move to the directed information is on learning what Bob said, how much does that recover of Alice? So there's a this as the name suggests, there's this notion of direction present.

Speaker 2: 22:47

And that's the arrow of time? Is that the direction?

Speaker 1: 22:49

Exactly. The arrow of time. Causality. Temporal causality. Yeah.

Speaker 1: 22:53

So it has some connections with certain notions of causality. It it's a little bit weaker than full on causality, but

Speaker 2: 22:59

Like Granger causality.

Speaker 1: 23:00

Granger causality. Exactly. Yeah. And so directed information from Massey what Massey ended up showing later in a later paper was that there are these two different kinds of two different quantities of directed information. One is how much Alice sends to Bob, how much Alice's message influence Bob, and the second is how much do Bob's messages influence Alice.

Speaker 1: 23:21

And a really cool result from this later paper by Massey and Massey, I believe, was that those two quantities, so that's the directed information from x to y plus direct information from y to x, those two quantities sum together are equal to the mutual information between x and y. So it's a way of decomposing the total amount of information shared between two random variables into their kind of directed constituents. And when we looked for when we were looking for a way to define plasticity, we were circling around directed information and related quantities for quite some time. And the limitation we kept running into was that directed information is only well defined over sequences that are the same length. So Alice has to send exactly some number of messages to Bob, and Bob has to send the same number of messages to Alice.

Speaker 1: 24:11

And those messages both have to start at the beginning of time. And so the generalized directed information is a way of extending Massey's directed information, but now we're allowed to choose our windows of time arbitrarily. So for any past window, say from a to b, and then any future window c to d, and the these two windows could overlap a little bit, they could overlap not at all. A to b can come before or after c to will recover a valid notion of the amount of information sent from a to b out to c to d.

Speaker 2: 24:49

So I noticed some people get a lot of fruitful results from applying information theory to RL Mhmm. To ML more generally. And I'm from a computer engineering background, communication systems so important. I guess one limitation of information theory is that it's measuring information by the pound or by the kilogram. Like, all bits are equally work like valuable, basically.

Speaker 2: 25:16

Yes. Whereas, I'm going a little off roading here, but the notion that like an intelligent agent might just pick out one bit from a very long message and say, oh, that's really important and I'm just gonna act on that, and then you see, look at value of information or something like that.

Speaker 1: 25:32

Absolutely. No, think that's spot on. By default, when we only have these two signals coming back and forth, we can't yet necessarily connect the notions of say, problem solving or value ladenness or things like this. And but I am optimistic that there are ways to bridge now coming back to your question before on reward. So for instance, just intuitively, an agent that has no plasticity and no empowerment probably can't learn much, and it can't solve that many problems.

Speaker 1: 26:00

And so just by knowing how much these bits can influence each other, we can get some clarity on what maybe the agent is capable of, but I do think there's room to flush that out more. So the other main result of this paper, again, dealing with this relationship between plasticity and empowerment, reveals that there can, in some cases, be a tension between the two capacities. That is to say for for any agent interacting with an environment in this kind of communication game that the one we described before, over a given window of time, the agent's plasticity and the agent's empowerment actually kind of borrow from the same resource. That is to say an agent that's maximally plastic can't also be maximally empowered and vice versa. And the intuition for this is that an agent that fully controls its environment can't also be surprised by that same environment.

Speaker 2: 26:55

Right on. So let's move on to another paper of yours, a definition of continual RL. Again, that was the first author paper of yours with a number of co authors. So can you start us off with some main ideas from this paper?

Speaker 1: 27:07

Yeah. Happy to. So this paper was really looking to come up with a new definition of a problem. It's, of course, closely related to the sort of traditional framing of reinforcement learning, but makes a number of key departures that can get us to think about our problems in a slightly different way. So one of the kind of key characteristics looking back on reinforcement learning, maybe some of the some of the successes we've had in the past as well as some of the analysis, algorithms, is that they're oftentimes about delivering a solution to a problem.

Speaker 1: 27:45

So we imagine we have some given problem or task or domain in mind, and what we're looking for is a solution to that task or domain. And then once that task is found, learning can stop, and then the problem is solved. But there's, of course, this this long tradition of folks looking at, say, lifelong learning and multitask learning, transfer learning, this long lineage that asks, how do agents just continue to learn and continue to adapt over time? And it just puts the focus on the emphasis a little bit differently than when we're looking for a solution. And so that's what we really wanted to capture is how can we come up with a simple set of ingredients that can motivate research to study this problem precisely where there isn't necessarily a solution we're looking for.

Speaker 1: 28:34

There isn't necessarily one optimal strategy that we'd like to find, and then learning can stop. We really wanna better understand agents and algorithms that just continue learning indefinitely. And the essence of how we ended up doing that was to say, look, if we if our starting point is a class of representable solutions So this is, say, the policy class in RL. This is something like the hypothesis class in supervised learning. The main distinction we draw is the difference between the ultimate strategy that we'd like the agent to deliver is a path through that solution space rather than a point.

Speaker 1: 29:16

And when we think back on our history, things like Markov decision processes or maybe bandits, the typical way we conceive of an optimal strategy is as a point in our space of representable strategy. So a single fixed policy in an MDP or maybe in a bandit, you wanna find the best arm and then continue pulling that arm or even report which arm that is. And our suggestion of how to best make this distinction is to say, in some problems, you might have a point as your desired output, but in others, there might not be any optimal point. In order to perform optimally, we need agents that can continue to adjust which point they're selecting and thus form a path through that solution space.

Speaker 2: 29:56

I think in the multitask grid world example that you gave, the agent in the paper, the agent has no idea which MDP they're in, I think. So let's say the agent was designed to try to quickly determine which MDP they were in, would that change the conclusion of this being a continual problem or no? Because I I think I I mixed that up the first time you read it.

Speaker 1: 30:18

Oh, interesting.

Speaker 2: 30:20

Do you know what I mean?

Speaker 1: 30:21

I do know what you mean. Yeah. So if we have we can imagine an agent that tries to figure out which MDP it's in, and then memorize the solution to that MDP. Is that right? And then as he switches, it just tries to Yeah.

Speaker 2: 30:34

Because it was a set, and I think in the paper, was just it just had no idea what MDP, and you'd switch it, and it has to have no notion, it and wouldn't even know. But maybe that's a very different setting, which maybe this question makes no sense. I I actually don't know.

Speaker 1: 30:47

So if are you is it the case that the agent gets to know which MVP it's in? Or is it that it has to learn which MVP it's in?

Speaker 2: 30:56

I was thinking if it has to figure out which MVP there is a setting in which that I've seen in other papers where there's a set of MVPs that it and the agent tries to quickly figure discover which MVP it is, and then Yeah. Perform well in that MVP. Whereas I think in the setting in you gave in the paper, it was just suddenly switched to a different MVP and it wasn't trying to map it to some previous MVP or something like that. Exactly. But maybe I'm really out of scope of what you were trying to do.

Speaker 1: 31:21

No. That's spot on. So in the case where an agent needs to figure out which MVP it's in, and then the MVP might switch under its feet on a given time step. On our view, this is an instance of continual RL precisely because it has to continue figuring out which MDP am I in now. And then if the MDP switches, it again has to figure out which MDP am I in now.

Speaker 1: 31:46

And as long as that process continues indefinitely, we would call that continual learning according to our definition because it comes back to this distinction. Once you have a policy class defined, the question, is there an optimal policy in that class? Or is the only is the bet is it better to switch between elements of that class?

Speaker 2: 32:04

I guess it's a difference it would be a difference between a continuing trail through the policy space versus there's three points in that policy space, and I'm just gonna jump between them. Yes. Is it do you think do you see them both as the CRL under this definition?

Speaker 1: 32:18

Exactly. Under our definition, we define both of those as CRL. But I think that you're right. There is a further distinction we could draw between an infinite path that's not a cycle and cycles. Right.

Speaker 1: 32:30

For our purposes, we just wanted that distinction between a path and a point. But maybe it's I think it's a really good point and one that it could be quite interesting to look at, the differences that arise from cycles as opposed to infinite paths.

Speaker 2: 32:44

Okay. Then then, is it is it all is it true that in this definition of CRL can hinge on the nature of the agent? If its capacity was small relative to the world, then it may be forced to keep changing its policy because it can't remember. Like, maybe there was only like maybe it just couldn't like, if it could just remember a bit more, then it could find its a way to not continually adjust? Or is this completely independent of any notion of that?

Speaker 2: 33:08

Do you know what I'm saying?

Speaker 1: 33:09

That's exactly right. Yeah. So we, at times, played with notions of agent constraints along the way as part of this work. And our view is that there are a lot of different symptoms that can all give rise to this kind of problem. So it's similar to this fable about sages looking at an elephant with one sage is looking at the tusk and one at the tail and one at the foot, and they all see a very different object with very different characteristics.

Speaker 1: 33:37

But in reality, there's some there's one underlying thing. It's the elephants. And I'm not quite sure we've uncovered what exactly that elephant is, but I'm I do think that we have some of these entry points into thinking about the space, one of which is something like the big world hypothesis. We have a small agent in a big world where the interesting relationship is coming from the fact that, say, the agent has to contend with constraints. But at the same time, you could imagine, gosh, maybe the world is just inherently nonstationary no matter how large you make the agent.

Speaker 1: 34:06

Or perhaps you can have what people refer to as open ended environments, and the way in which these things relate to each other is we have some intuitions that maybe connect them, but we don't yet have the kind of grand picture that unites all of them. But we did, as part of this, look at cases where the reason the agent has to keep switching its policy is because it can only represent a small number of policies.

Speaker 2: 34:31

Let's talk about another recent paper, Agency is Frame Dependent. That was your first author paper, a number of co authors, look it up on archive. What's the basic idea, Dirk?

Speaker 1: 34:40

Yeah. So our goal was to try to bring some closer connection between the concept of agency and a lot of existing analysis and results in reinforcement learning. And for me, really, the starting point came from this book by Tom Cello called the Evolution in Agency, which looks at the emergence of agency throughout evolutionary history on Earth. And he closes the book with this, I think, very cool quote. I'll get pieces that are wrong, but it's something like every field has some kind of per first principle that it orbits around within biology, might say life with and with psychology or maybe cognitive science, we have both behavior and cognition as to competing almost first principles.

Speaker 1: 35:27

And his suggestion is that to overcome the kind of competition of these starting points for dealing with behavior as a primary thing we care about as opposed to cognition, he suggests we should embrace agency as a first principle because it can unify both psych behavioral considerations and cognitive ones. And that passage to me has been really inspirational for really taking agency quite seriously as a potential starting point. I like this reference of I think it was off starter. It says something like, one of the goals within AI is to really try to understand the mechanisms that differentiate, say, a rock from a squirrel or something. And there's something quite beautiful about that, just like looking out in the world and realizing there's some kind of magic behind biolog biology and life.

Speaker 1: 36:18

And now within AI, of course, there's opportunity to bring, like, computational perspectives to to understand these kinds of phenomena in the world. So agency is just super fascinating concepts, and our goal here with this paper was to really try to wrestle with what can RL bring to agency and what can agency bring to RL. And it grew out of really as part of this definition of continual RL work, we tried to come up with what we viewed as an objective perspective on what counted as learning. And one thing that we kept circling around as part of that work is that there's a anytime you commit to some view of what learning is, there's always some extra reference point that this definition needs to be made with respect to. So for instance, this in Mitchell's textbook about machine learning makes it proposes some definition about learning that an agent or an algorithm learns at a task when its performance on the task goes up.

Speaker 1: 37:19

And in this way, that notion of learning is relative to say a task or say a performance measure. And this, no matter how hard we try, we kept coming across different ways of defining learning that always required some extra detail. And when we consider agency itself, which is encompasses learning as a sub piece, we came to the conclusion that agency itself requires these extra commitments that we've chosen here to call a reference frame. And we're not necessarily the first to to make this kind of argument. There's versions of this that have been around for some time, but we felt it was valuable to put some reinforcement learning color to this argument to really tie directly the way in which agency is frame dependent with some of the results within RL.

Speaker 2: 38:08

So in the paper, you you list these four different claims, or I'm calling dimensions claims. What can you just mention what those four concepts are?

Speaker 1: 38:17

Yeah. For sure. So this comes from a mixture of definitions of agency in the literature, primarily this paper called defining agency by Baron Dierin and colleagues. And they proposed to define agency three properties, and we chose to the fourth from some other folks. So the four properties are an agent need is a thing that has a boundary.

Speaker 1: 38:39

The second is that the agent has goals. The third is that the agent is what they call a source of action, which is like the freewill property in a way. And the fourth is that the agent adapts, so it's capable of adapting its outputs based on its inputs. And this is effectively Bernd Ehren's proposal of how to define agency. Other nearby definitions make some subtle subtle departures for which properties are needed rather than pick say adaptivity, and it would pick something about having a persistent identity over time.

Speaker 1: 39:14

Like, the agent's not suddenly up and changing all of its wishes from Monday to Tuesday or something like this. There's some kind of consistency that ties the agent together. There's a few different properties out there. And from our perspective, while I do, of course, as we discussed before, think definitions are really important, in this case, I think roughly most of the definitions of agency that are out there will all lead to this conclusion of frame dependence. I think it's pretty hard to get a reasonable definition of agency that doesn't also admit frame dependence.

Speaker 2: 39:46

Okay. Can you give us some more examples about these situations where either something isn't an agent based on these different different claims?

Speaker 1: 39:53

Definitely. Yeah. Great question. In the defining agency paper, there are a few that are brought up to this point, and one that I quite like to look to focus on, say, this source of action property, which is something like the I mentioned it as the free will property. It's something like does it's nice to think of an agent as a thing that could have done otherwise.

Speaker 1: 40:16

So an agent has some set of options available to it, and it's choosing from among those options or picking a different course of action. And you can imagine that a maybe a silly example would be to consider a brick wall that's, say, getting demolished by a wrecking ball. The wall is not making the choice to be knocked over by the wrecking ball. Instead, the interesting source of action, I think to most observers looking at the situation would be the person controlling the wrecking ball. That's where the energy is coming from.

Speaker 1: 40:49

That's where the real kind of causal chain of events bottoms out in this example. And so in that sense, we might say, the agency doesn't live within the wall because it's choosing to get knocked over quote unquote, but instead it lives within this kind of collective agent of the maybe the bulldozer and the operator of the bulldozer. So that's one way in which the source of action is used to try to attribute agency to two different systems. And there's this really cool paper by Kenton et al called Discovering Agents where they use a full causal account to identify within a causal model which things have something like this source of action present. And one thing they note in their paper is that depending on which causal model you pick or which causal variables you pick, you'll actually reach different conclusions about whether or not a given subsystem has this property or something like this property.

Speaker 1: 41:42

And so it's a way in which this notion of frame dependence is again rearing its head for source of action. It's what are your causal variables? And depending on how you pick those, you could reach a different conclusion about whether or not a given system has agency.

Speaker 2: 41:56

So maybe the wrecking ball operator didn't really have a choice but to wreck that wall because his boss told him he had to do it. Hey, let's go the causal chain, it's more upstream, and you gotta Hey. Causal goes on all the way.

Speaker 1: 42:07

That's right.

Speaker 2: 42:08

What do you think of this statement? Evolution is an agent that designs adapted life forms. What do you think? Is it actually an agent? Interesting case.

Speaker 2: 42:16

Speaker 1: 42:19

while our claim is that agency is for independent, there is a sense in which different reference frames support more or less valid conclusions. So think it's still early in the stage of this research, but I'm optimistic that over time, we can develop principles that will let us figure out which frames are well structured or lead to right kinds of conclusions, maybe some kind of meta principles, minimum description length, these kind of things. And with the case of evolution, my suspicion is that most people probably have the intuition that we do not wanna attribute agency to to evolution. And so it could be a signal to us to think about what principles of choosing a reference frame will lead to reference frames that agree with that conclusion. Maybe one place my head goes.

Speaker 1: 43:05

Another is to say, there is something to me valuable about being able to zoom in and out using these reference frames. So we can talk about macroscopic systems like similar to the maybe the wrecking ball operator from before. We can talk about the agency of the operator, but we can also zoom out and invoke a different reference frame that allows us to think about the agency of the operator plus the machine around them. But we can zoom out much further to talk about the agency of, say, a village or a city or a country, a corporation, these sorts of things that allows us to then attribute agency and perhaps even measure it relative to a reference frame at different scales, which maybe evolution evokes a little bit because it's this much broader scale both in time and space kind of system, if it's even a system.

Speaker 2: 43:56

Okay. I'm gonna jump back to an earlier paper of yours called The Three Dogmas. And you said a number of interesting things in this paper. One is that you said that we emphasize modeling environments rather than agents. And I guess I was surprised at reading this section as I immediately thought a lot of the focus of published papers has been about learning algorithms for RL Asians, but maybe you're talking about more modeling in the sense that we have the MDP formalism.

Speaker 2: 44:24

Is that right?

Speaker 1: 44:25

That's exactly right. Yeah. So really this is hinting at the discrepancy between our ability to answer the question of what is the problem being solved or rather how do we model our environment? And in our classic two box diagram for reinforcement learning, there's a pretty clear sense of what the environment should be. And a lot of papers begin by saying, we model this problem as an agent interacting with a Markov decision process or some variant of it, like a POMDP or something.

Speaker 1: 44:53

But at the same time, we don't have that canonical model of what an agent is. So I think it is really down to this modeling question as you mentioned. So, yes, Sutton had this paper. I think it was the last RLDM actually in '22 on the quest for the common model of the intelligent decision maker. I think it was a really great paper.

Speaker 1: 45:13

And I think we're just adding fuel to that kind of premise that there's a big open question in our field, which is how to think about what agents are. Is there an a model similar to the MDP that can give us that precision, that simplicity, that clarity that we can then build around to hopefully arrive at conclusions coming back to our discussion about definitions in the same way that probability can lead us to new frontier as a territory that we didn't know when we first started writing down probability or thinking about belief. The hope is that we can do the same thing with Asians. If we can get our models right, get our our premises where they can take us to some new territory that might lead to some really new and exciting insights.

Speaker 2: 45:56

Kain, then you said learning is open frame is is finding a solution, or roughly paraphrasing. So can you talk about Kailua? What's wrong with doing what's wrong with that?

Speaker 1: 46:05

Yeah. For sure. So I will say as well, like, from my perspective, nothing wrong with that on its own. I think it's that if we then don't explore other perspectives or treatments of how to understand what learning is, then there's a missed opportunity. So I think we just wanted to add an encouragement and enthusiasm behind, like, other treatments of how to think about learning.

Speaker 1: 46:30

In this kind of classical view, we said agents are trying to solve problems. Of course, problem solving is extremely valuable. But if there's another conception of learning that actually does differ, it's just about the point of emphasis, that starting point for research. What new frontiers might we lead might we find if we explore down these other paths? So that's really what we're after here that, yeah, relates to the definition of continual RO paper that we really wanted to try to think carefully about this sort of new class of agent, one that we have to think about memory quite seriously.

Speaker 1: 47:04

We have to think about resource constraints quite seriously in order to contend with the challenges of the problem.

Speaker 2: 47:11

And then you mentioned the reward hypothesis, the assumption that all goals are well thought of in terms of reward maximization. Maximization. And you mentioned looking for a richer language for goal specification. Can you say more about that? What is hard to do with reward maximization?

Speaker 1: 47:30

Yeah. Let's see. So this piece was a continuation of a line of work from a really wonderful group of colleagues on trying to understand the limits of reward. And just anecdotally, we can think about cases where you run into issues like Goodhart's law, where once you write down the reward function, then you can create these you create unintended consequences that weren't really what you wanted when you wrote the reward function down. And so reward over time has been notorious and difficult to specify in certain kinds of problems.

Speaker 1: 48:04

With this line of work, it wasn't we're getting at something a little bit different, which is to ask, if reward is taken to be the language we're using to communicate goals or purposes, is it a universal language? Are there goals that we cannot write down using reward? Like, we just wanted an in principle kind of answer. Are there some things we can conceive of that cannot be written down with rewards? And we did end up finding, and from a couple different perspectives, that there are certain kinds of tasks or goals or purposes that can't be written down at least as a scalar reward.

Speaker 1: 48:40

And so we first wanted to just highlight that fact that there is some limit to using reward. Whether or not you care about those things that can't be described, I think is a really important and maybe separate question. But we first just wanted to present that characterization of what can reward actually express.

Speaker 2: 48:58

Great. I see one outstanding paper award at NeurIPS twenty twenty one for this paper on the expressivity of Mark Humphrey Rohr with you as a first author. That's not an easy feat. Yeah. That's that's a very interesting limitation.

Speaker 2: 49:12

I guess I I hadn't really thought of it before. There was just this assumption. I guess from that, there was a the paper from David Silber and Company where that said reward is all you need, and that seems like the perfect counterpoint to that.

Speaker 1: 49:25

Right. We were actually all kind of wording on these at a similar time, in fact. And I think there's there's a sense in which there were ultimately focused on a slightly different outcome, so the two can exist in in harmony in a way. But the in the Markov reward paper, we were just looking at when we restrict ourselves to Markov reward functions, when we already have a state space present, what are the things that those Markov rewards can capture, and what do we give up by not looking at Markovian rewards? And going into it, we didn't really know whether there would be any limitations.

Speaker 1: 49:58

And if there were limitations, if they would be interesting. And through kind of poking around for a little bit of time with this group, we ended up finding a couple examples that when we restrict to mark off reward, we knew the rule that class of rewards could not capture these kinds of tasks. And we found that surprising and fun and interesting, and so we wanted to share. And the two examples are as follows. So the first one had to do with effectively the fact that we're dealing with Markovian rewards.

Speaker 1: 50:30

So in a grid world, the task of having an agent always move the same direction in a state space such as a grid world can't be captured by Markov reward because to know whether you need to the agent should keep moving left, you need to know what it was going to do in the future and what it already did in the past. And so in that sense, it needs to be in our mark of reward function. But that one you might just think, gosh, we can fix that by just moving to history based rewards or augmenting the agent state to take into account these past characteristics. The second kind of counterexample has to do with impossible outcomes and how they should impact goals. So for instance, if you have a you have a policy, you have two policies, and they both do exactly what you want to do on all the states that they reach.

Speaker 1: 51:24

But then in an a state that neither policy actually goes to, one of the policies does something really bad that you really don't want to happen. Right? It pushes some big red button that you're not happy with. And the question is, should we care about that fact, or are these two policies actually equal from the perspective of goal satisfaction? And really, this comes down to what does it mean for a goal to be well captured by reward?

Speaker 1: 51:53

And what we the way we chose to formalize that question, I would say this is one of the bigger challenges as part of the work was figuring out how do we ground this question, make it nice and concrete, was to say the value functions of the policies, order the policies, and the value functions can only be influenced by the states that are actually reached by the policy. So if you have this is assuming we're in the start state. So the start state value function orders all the policies, but that means that impossible to reach states can't influence a policy's position in the ordering. So if a policy does something really bad and a stay won't reach, it doesn't impact its value. So again, it's it is a counterexample, but it's a counterexample to the way we chose to characterize this result, the assumptions along the way, how do we ground what it means for a goal to be captured by reward.

Speaker 1: 52:48

So one response we got a couple times in presenting this work was, sure, you can't capture those tasks, but we don't really care about those tasks, do we? And I think there's something fair about that. But at the same time, I think it's valuable to know that these kind of examples do exist.

Speaker 2: 53:02

The Good Hearts one is a really tough one. Right? How are you getting around that? Right. Yeah.

Speaker 2: 53:08

Very cool. Okay. So besides your artwork, can you talk about things that you find interesting in in in RL lately?

Speaker 1: 53:15

Yeah. Honestly, can it I I feel like the kid in the candy champagne here in RL DM because I I just I I really love the fuel that I love the community, and I think people are just up to so many cool things. But I'll there's a couple of little threads I want I can mention. One is this line of work that I wanna call something like a an empirical science of our own that really takes this view that we're just trying to get gain understanding about systems we don't yet understand. And there's a really beautiful paper by think Ostrovsky was the lead author.

Speaker 1: 53:46

It was called the difficulty of passive reinforcement learning, if I remember the name correctly. And what they showed is that if you have an active learner and a passive learner that get the same exact stream of experience that's generated by the active learner's actions. And the passive learner just kind of watches idly in the background, but continues to do updates along the way based on that stream of data. But crucially, its actions don't influence what data are seen next. The really surprising finding from this work is that if you ever untether the two and then allow them to act in the world, the active learner will perform extremely well.

Speaker 1: 54:23

It will continue to perform as well as it was before, maybe not so surprising. But the passive learner, its performance will degrade rapidly. It will do as well, if I remember correctly, effectively what it would have done in terms of performance, same as the start of learning. And to me, it speaks to the power of interaction, being able to try out your own ideas, try out your own hypotheses. Right?

Speaker 1: 54:47

Agents or learning algorithms need to be able to falsify their beliefs in order to learn. And I think the if this is reflected in the fact that this passer wearer could not try out the things that it was mistaken about in a way that it could learn from. Anyway, that's a beautiful paper. That was one that came mind.

Speaker 2: 55:04

That's like the difference in online and offline learning and between some probably levels of causality. Yeah. Cabrafracture test trying to counterfactuals and

Speaker 1: 55:13

That's right. I think so. And the thing that's quite striking is that the data itself is the same, and I think there's often this view that if we just cracked exploration and got this perfectly curated dataset that maybe we'd be able to train up the right learning algorithm to to solve a given problem. But in this, it's suggesting that there's actually between the learning trajectory and the generation of the data that's quite important as well.

Speaker 2: 55:38

While you're here, is there anything else you wanna message the audience?

Speaker 1: 55:41

Cool. So I'd be happy to share maybe a few pieces of research advice that I find myself kind of communicating somewhat regularly that I really believe in. And one really comes from this little paper by Uri Allen called How to Choose a Good Scientific Problem. And in this paper, I'd encourage folks to read it. It's quite a quite a fun and and quick read, and I I keep coming back to it.

Speaker 1: 56:04

Alan has this little figure on the x axis. I believe it's the perceived difficulty of the problem, and on the y axis, it's the perceived gain in knowledge that you might expect if you were to sort of succeed in, you know, working on this problem or research question. And one of the main one of the first points that you can draw from this this little diagram is if you're choosing a research topic that existing in this kind of bottom left region of very difficult research questions that you'd anticipate would materialize only in a small gain in knowledge are often not the right questions to study. Right? If you want to work on something really hard, you'd hope that the potential understanding you'd gain if things go well is is big.

Speaker 1: 56:55

And similarly, the the second thing that Alan mentions with this plot is to think about the sort of Pareto frontier along this plot where at the the far right, you're looking at easy problems that materialize into maybe a small gain in knowledge, and then you can kind of walk yourself up to larger gains in knowledge, but at the expense of maybe a more difficult question. And it's really this risk profile that I think is really useful for thinking about the right time and maybe place to work on certain kinds of research questions. So what timeline do you have? When do you need to maybe apply for your next position? How much sort of freedom do you have?

Speaker 1: 57:32

These sorts of things. And that can help you think about where you should live along that risk profile. The second thing that I I always come back to this really comes from Hamming's piece, you and your research, is about your what I like to call the sandbox, which is, like, how you get started on a given research topic. What the way Hamming puts this is to say, you should have some kind of plan of attack in order to make progress on your research question. And I really like this notion of assembling a sandbox with your collaborators.

Speaker 1: 58:06

That's the opportunity to play, the opportunity to get familiar with a topic, play with ideas, play with experiments, whatever it might be. But it's some way to get started where you can really start to build up intuition for a given space that can hopefully then over time gradually transform into greater and greater insights or deeper and deeper questioning. Now a third thing I often come back to, you know, folks often say to find, you know, great mentors, and that's certainly been I I feel like a deep sense of appreciation for the mentors I've had throughout my life. So definitely second that. But I also wanna add another category of a kind of person or a group of people to watch out for throughout your career, which are those people kind of in your rough era or your your kind of career siblings or or pals or something like this that are kind of at a similar career stage to you.

Speaker 1: 59:03

And you'll often find that as you navigate your career with these people, you can kind of lean on each other, support each other, and help to navigate difficult questions and decisions in a way where you can share insights, share feedback. And I feel, again, really grateful to have found some great folks in this space. And I regularly kind of turn to those people for for advice, and for support, and for collaboration.

Speaker 2: 59:27

Doctor David Abel, thanks so much for doing this with me today.

Speaker 1: 59:30

Amazing. Thank you so much, Robin. Yeah. My pleasure.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

David Abel on the Science of Agency @ RLDM 2025

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere