TalkRL: The Reinforcement Learning Podcast | Transcript: Thomas Akam on Model-based RL in the Brain

Thomas Akam on Model-based RL in the Brain

August 3, 2025 / 52:06/E68

Speaker 1: 00:01

Talk RL. Talk RL podcast is all reinforcement learning, all the time. Featuring brilliant guests, both researched and applied. Join the conversation on Twitter at talk r l podcast. I'm your host, Robin Chohan.

Speaker 1: 00:22

I'm very glad to be joined by professor Thomas Akem. Professor Akem is a neuroscientist at the Oxford University Department of Experimental Psychology. He is a welcome career development fellow and associate professor at the University of Oxford, and leads the Cognitive Circuits Research Group. Welcome, professor Akem.

Speaker 2: 00:39

Thanks, Robin. Thanks for having me on.

Speaker 1: 00:41

How do you like to describe your area of focus?

Speaker 2: 00:45

So I'm a behavioral neuroscientist, so I'm interested in how the brain generates behavior. And within that, I am interested in flexible behaviors, so how we can quickly adapt to different situations. And one of the ways that I think the brain's able to do that is by having very rich internal models of the outside world, which allow us to predict the consequences of our actions before we do them. And so I'm really interested in trying to understand how these models are learned and how they're used for action selection.

Speaker 1: 01:19

How would you describe the holy grail that you're pursuing? Like, how will you know you're done in

Speaker 2: 01:25

in terms of what you're pursuing? So I think it's very unlikely that we'll be anywhere close to done by the time I retire. The brain is, you know, a very complicated system. People have been working on it for well over a hundred years at this point. And I think that the amount that we still don't understand probably greatly outweighs what we do understand, although we do understand you know, we we we have a great deal of knowledge now about brain function, and we are starting to understand some of the principles.

Speaker 2: 01:52

But I I I certainly don't think we're likely to have fully solved it in the near future. I think the types of answers that I find satisfying in my research are really trying to understand the types of learning algorithms that are implemented in the brain, which allow us to behave adaptively, and how they map onto brain circuits. So I think if I was able to make a significant contribution to understanding this kind of model based decision making, which underlies flexible behavior, I would be very satisfied.

Speaker 1: 02:28

So we met at the RLDM conference in Dublin last month. That's the reinforcement learning and decision making conference, where you gave a fascinating tutorial on the mapping between computational RL and RL in the brain. Can you describe for our listeners some of the main ideas from that talk at a high level, and help us see to what extent aural in the brain is is might be similar or different to deep aural algorithms that our listeners might be more familiar with.

Speaker 2: 02:56

Yeah. So the, so RLGM is an interdisciplinary conference which brings together people from neuroscience and psychology with people from computational machine learning and reinforcement learning. And at the start of the conference, to try and promote interaction between the communities and understanding, They have a pair of tutorial lectures, which are really introductory tutorials where a neuroscientist talks to the machine learners about brain reinforcement learning, and a a person from the dry side, the machine learning side, talks to neuroscientists about about their field. So I was asked to do the lecture for the machine learners, and what I tried to do with it was both to convey the real success story that has emerged over the last twenty years around how certain ideas from reinforcement learning, appear to map on to certain brain circuits. Most notably, the idea that the activity of dopamine neurons, which are a particular class of neuron, in the midbrain, their activity seems to correspond in many situations to this temporal difference reward prediction error signal from reinforcement learning theory.

Speaker 2: 04:17

And this has really been a very, very influential idea, which has been fleshed out a lot, developed originally in the in the kind of late nineties, and has really prompted an enormous amount of experimental work. And so what I was trying to do is to really convey the sort of classical standard version of how that story works and and them and and and talk about some of the evidence that's consistent with it, but also to highlight some of the complexities and some of the things that don't fit so readily into that picture to try and give a balance between the the things that we really do have some degree of confidence in and the while not sort of obscuring the fact that there's really a lot that we still don't understand.

Speaker 1: 05:04

Can you tell us a bit about the different learning paradigms in the brain and how RL fits into the other types of learning?

Speaker 2: 05:12

This sort of classical story about dopamine and temporal difference reinforcement learning really applies to this region of the brain called the basal ganglia, where you have these dopamine neurons and that the signal that we think, at least in many situations, looks like a reward prediction error. So the difference between how good you thought things were gonna be and how good they turned out to be, taking into account your best beliefs about the kind of long run long run future, not just immediate rewards. So the this signal converges in in a region called the striatum with input coming from the cortex. So the cortex is this is the the outer layer of the brain. If you're looking at, you know, human brain, the sort of it looks like a walnut and that those folds that you're seeing are the cortex, and it's essentially like a you can think of it as a large two d sheet of neurons, which in the human is sort of folded to fit a larger surface area into the into the skull.

Speaker 2: 06:11

In the in the mouse, it's completely smooth over the over the surface. And so I guess the kind of classical picture would be that the cortex is representing the state of the external world. And so for example, I because the cortex is supposed to be hierarchically organized, and so in the visual system, for example, low down in the in the hierarchy of the visual system, you'll get neurons respond to very simple visual features like oriented edges. And as you move up the the system, you get neurons respond to progressively more comp complex features like faces or, you know, particular objects. So it's thought that the the cortex is essentially learning a state representation, which describes what's out there in the external world, and then that provides an input to the to the striatum where where dopamine can essentially drive learning about the the values of different states and actions.

Speaker 2: 07:08

So that's kind of the classical picture, I guess. And how you so exactly how the learning in cortex works, how the representation learning works isn't fully understood, but one very influential set of ideas come from what's called predictive coding. So the idea would be that essentially cortex is trying to predict its own sensory inputs, and that's organized hierarchically. So kind of low level cortical areas trying to directly predict the the firing of the inputs from the sensory system, and then the high level cortical areas trying to predict the active to the low level areas. And and you in this sort of hierarchical structure, you eventually learn these very rich representations of the external world.

Speaker 2: 07:54

And so that, I guess, we could think of as being very much related to self supervised learning in machine learning, where essentially you're trying to predict the next observation. And yeah. So I guess those were the sort of two main systems that I talked about in the talk. In the cortex, this is sort of self supervised learning system, and then basal ganglia potentially implementing temporal difference reinforcement learning. But one, I guess, one sort of complication which, you know, I I I talked discussed in the in the lecture was your cortex does seem to be really a very active player in action selection.

Speaker 2: 08:31

So there are projections from cortex directly to to the spinal cord and to the brain stem, which are, you know, areas that are critical for actually generating movements and controlling the muscles. And so that that does you know, I guess and there's also these these feedback loops where the basal ganglia where we think reinforcement learning may be happening is actually able to gate activity in cortex essentially by kind of loops of feedback. And again, that's that's something which doesn't fit so neatly at least into the very simple picture of cortex as a state representation and basically as doing reinforcement learning. And I guess I was trying to convey some of that complexity along with the very classical story in this lecture.

Speaker 1: 09:18

Can you say more about when the brain chooses to use model free versus model based RL?

Speaker 2: 09:24

I guess it's maybe just worth, unpacking a little bit what we mean when we talk about a model, both in reinforcement learning and in, and in neuroscience. So, I guess classically in reinforcement learning, a model is something which takes, the current state and, a possible action, and it makes a prediction about what will happen next, so about the the next state. So I guess one very influential set of ideas about how model based model free reinforcement learning might relate to behavior, comes from a paper by Nathaniel Dore and colleagues in the in the sort of mid naughties, where they proposed that a psychological distinction between goal directed actions and habits might map onto a computational distinction between model based reinforcement learning and model free reinforcement learning. And so just to briefly unpack what people mean by goal directed actions and habits, you know, I guess those terms are have a kind of colloquial meaning. You know, we can think about what it means to goal directed or or or we have an intuition about what habits mean.

Speaker 2: 10:33

But in the nineties, psychologists really operationalized that that distinction using, particular tests. So for example, to test whether, a subject was doing an action to attain a particular goal, you could train, say, I don't know, a rat to press a lever to get a reward, and then you could devalue that reward by pairing it or by pre by by pre feeding by pre feeding the the rat with as much of that of that, you know, food pellet or whatever. So I didn't want to eat anymore. And and what you could see is that devaluing that outcome reduced the tendency to press the lever even if in that critical test session, you weren't delivering any more of the outcome. So essentially, what that demonstrates is that at least some kind of rudimentary forward model is being used, which says, you know, not simply press doing this action is good, but it actually says, you know, doing this action will lead to this specific outcome.

Speaker 2: 11:32

And so so it's sort of rudimentary form of planning. And so so, yeah, that was kind of one key way of operationalizing goal directed behavior is essentially whether it's sensitive to changing the value value of the outcome. Whereas whereas what was observed was that under some circumstances, could get behaviors which essentially were insensitive to changing the value of the outcome, often with with very extensive training. And this kind of captures something about this day to day notion of habits that, you know, sometimes you'll find you you leave the house to go somewhere you don't normally go, and you find that you've, you know, walked halfway to work before you realize you're you're going the wrong way. And and kind of you can go into autopilots and really just follow do do sequences of actions without really thinking why you're doing them.

Speaker 2: 12:21

Yeah. So there's very influential work from Nathaniel Daughan colleagues in the in the mid naughties proposed that this distinction might map onto this computational distinction between goal directed behavior really being when you're using a forward model and habitual behavior when you're not. And, you know, I think that's one of a number of theories accounting for for this distinction. And I think there's probably there's probably a lot that's right about it, but it's probably not the the full story. I think also something which is perhaps becoming increasingly apparent is that we also use our models of the world to just represent the world.

Speaker 2: 13:02

So in in a sense, like, say say even the model free reinforcement learning system that we think may be instantiated in the base of anglia and the dopamine system, that's operating over a representation of the world, which is itself learned and which you could think about as being a model. And it may it may even be that the way that you learn that model is essentially by trying to predict the consequences of actions. You know, one one you one of the properties that you you would like in a state representation, I. A way to represent the external world is that the representation sums up your best estimate of the state of this external world. So you and contains all the information that you need to to act or to predict the future.

Speaker 2: 13:43

And one way of learning these type of Markovian state representations, as they're called, is essentially to to just try and learn to predict the next the consequences of of of actions. And so, yeah, so I think that it's it's becoming increasingly apparent that, you know, even the even the processes that we might have conceived of as being model free, I I e they might be using just basic temporal difference reinforcement learning to actually work out what to do. They are operating over a state representation that is itself like a very rich model of the world, and so that that does some extent complicate, I guess, understanding what's going on.

Speaker 1: 14:21

So it sounds like the line between model free and model based might be less crisp. Is that what you're saying?

Speaker 2: 14:26

I think that's probably fair to say. That said, I think there are processes going on which really are very reminiscent of, for example, the kind of simulations of possible futures that that are really characteristic of at least some kinds of model based algorithms in RL. So one really fascinating area of neuroscience is this phenomenon of internally generated sequences in this brain region, the hippocampus. So the hippocampus famously has cells that fire in particular regions of space called place cells. So if you record in this region as a rat renders around in environment, in its environment, one cell will fire in one location and another cell will fire in another location.

Speaker 2: 15:11

And so you can decode where in the environment the the rat is from which cells are firing. And this really remarkable phenomena happens where the cells don't just represent where the rat is right now, but actually play out sequences which represent possible location or possible trajectories to the environment. And that happens both while a while a rat is say wandering around and what you see are these sequences that essentially do a move from the position of the animal to shortly in front on a very sort of with this sort of oscillatory rhythm at about eight hertz called the theta rhythm. And they seem to sort of be doing a a sort of short range exploration or or or sweep in front of the animal. But then there's another phenomena where either during sleep or during periods of quiet restfulness, you get a network activity pattern where a lot of the cells in this region will fire in a short time window and will play out a possible sequence through the environment on a very compressed temporal scale.

Speaker 2: 16:16

And it's thought that there are there are various ideas about what these internally generated sequences might be doing. They're probably serving a number of roles in the brain, but certainly one one idea is that they might be doing something a little bit like this Dyna algorithm in reinforcement learning. So you you could play out a possible sequence of states, and then you could effectively update value estimates in in the basal ganglia using that that replay sequence. So so, yeah, there are processes that really do look very reminiscent of ideas from model based reinforcement learning happening in brain circuits.

Speaker 1: 16:57

You mentioned and you referred to in your talk as well Rich Sutton's Dyna architecture, which which substitutes real environment experience with with model experience from the model. And and the learning agent in in Dyna doesn't really care which one it's experiencing. It learns in the same way. So I I guess my question here is like, do do some parts of the brain not know or care whether they're experiencing the model predictions or the real world?

Speaker 2: 17:24

These, sharp wave ripple sequences, which are these sort of very fast sequences that happen during these activity bursts, during some phases of sleep, and during during sort of just quiet restfulness or reward consumption, this sort of thing. These are happening on quite a different time scale from natural behavior, so they really are a accelerated relative to natural behavior. And so, and the brain is is in, you know, the the I I guess, like a different activity state. So I don't think that's that may well be driving learning, but it's but it's like the sequences are happening on a different time scale from, from what's happening during, during awake behavior. But then also you have dreaming when the brain is so during sleep, you know, there's in in sleep, there are different phases of sleep, and you have REM sleep where the where the brain activity looks more similar to the awake state, and then you have these slow wave sleep where it's showing these much slower oscillations.

Speaker 2: 18:22

But that and and so my understanding is that during dreaming, you see activity which, you know, can resemble that in the wake state, and there's some really remarkable findings looking at how the the brain maintains a an estimate of of of heading direction. There are cells that basically code for, like, which way your your head is pointing in the environment. And this work which I'm I'm not super familiar with, so I I maybe maybe you're getting this wrong, but my understanding is that during during sleep, during rapid eye movement sleep, you get these movements of the eye, you know, which which indicate I guess, which happen when you're dreaming. And my understanding is that in this region of the brain, this is I think in mice, in this region of the brain where there are cells that represent head direction, you see that these fast eye movements that occur during sleep actually correspond to changes in, this representation of heading direction, in this circuit, which, you know, is representing heading direction in the awake state. So it really suggests that there is a kind of internal simulation happening of action sequences potentially through the environment, you know, and and and exactly how they're being used for learning, think, remains pretty poorly understood.

Speaker 2: 19:46

But I think it's so, you know, certainly subjectively, you know, like, clearly, like, the fact that we have this subjective experience during dreaming, which resembles, you know, real experience. So I guess that's also consistent with this idea that that there are some states where the brain is internally generating activity, which at least some parts of the brain perceive as being very much like awake activity. But it's also you know, I'm I'm not an expert on on sleep, I should say that. But my understanding is there are also substantial differences in neuromodulatory tone. So I guess which neuromodulators are are are active.

Speaker 2: 20:25

And and for so for example, as I'm sure that there are there are ways in which you want to treat treat experience occurring during dreams as differently from from in the awake state, for example. So so yeah, I think my understanding is that there are activity patterns that occur during sleep that look like they they could be a simulation of awake activity, but there are likely also important differences in how learning happens, you know, in a way in the awake and sleeping states.

Speaker 1: 20:52

Can you talk about the difference between learning learning in RL and then the inference with RL? I think I guess in computational RL, it's it's pretty common to separate those phases or have deployed, like, with, LLMs have deployed RL, policy that is not learning. So in the brain is once something has been learned with RL, is are different circuits used to execute that policy, or is it really just always the same always on learning?

Speaker 2: 21:16

I think one big difference between the way reinforcement learning is used in machine learning and in applications and what's happening in in the brain is my understanding is that in most applications of machine learning, you do have this very clear separation between a training stage where you're updating the weights of the model, and then deployment or test where you're where you're keeping the weights fixed, but you're using the model to do to do some application or solve some problem. And, obviously, that's very different from the situation for biological brains because your biological brains don't have the luxury of being able to use large amounts of experience just as training data. You know, if you if I mean, every moment when you're alive, it is important. You you know, if if you die when you're a child, you don't get to just start again and and and, you know. So so, yeah, there's there's both this problem that you have to be behaving adaptively at all times.

Speaker 2: 22:20

You know, you you don't have the luxury of just training a huge amount before you before you actually have to behave. And and then in terms of whether there are so so in that sense, I think that that that that's a big difference between machine learning and and and biology that that there isn't a a clear distinction between training and test. I think it is true to say, however, that when you first learn something, the as you as as learning is consolidated, you you can change the circuits that you're using. So this this transition from goal directed actions to to habitual behavior appears to be one example of that, where essentially the brain circuits which are necessary for for goal directed actions. So for when you're thinking, really, I'm gonna do this action to obtain this outcome appear to be at least partially different from the brain circuits that would mediate the exact same action when it had become very habitual.

Speaker 2: 23:14

So there does another another sort of story that has a bit of that flavor is this idea that when you first experience something, the episodic memory, so the memory of, you know, is precisely what happened yesterday, the particular sequence of things you did appears to be stored in a structure called the hippocampus, which is also interestingly where you see this internally generated sequences. But then over time, at least some of the the content of our memory gets transferred to cortex in this process known as memory consolidation. So so famously, this patient, who, back in the fifties, I think, had his hippocampus removed bilaterally in order to try and cure his epilepsy and was then unable to form any new long term memories. You know, so that that really, I guess, kicked off this whole but but so critically, he still retained a lot of memories from earlier in his life. And and this really, I think, kicked off this whole, idea of memory consolidation and of transfer of memories from a from a shorter term storage to a longer term storage system.

Speaker 2: 24:26

How does exploration work in in the brain? I guess this is something, you know, I'm not I would say I'm particularly expert on, but there's a few things, to say to that. So there's there is a lot of stochasticity or apparent stochasticity in the choice behavior of humans and animals when they're doing, say, value guided tasks, and it's, you know, the it's unclear completely the extent to which this represents direct exploration versus, I guess, just noise. There do appear to be some brain regions which are selectively activated, particularly in bits of frontal cortex when you take an action which is not the action that you think is best, but which might provide more information. So that that does suggest perhaps a kind of active process guiding exploration.

Speaker 2: 25:11

It's all interestingly, also, the dopamine neurons which appear to carry this reward prediction error signal also respond to surprising events and you just sort of value neutral events that are surprising. And that response also does appear to be be important for for learning as well. So that that also might suggest that perhaps I mean, that's always been quite mysterious to me, this this aspect that this signal which appears to be important for for learning values also seems to respond to surprising things happening. And you you could imagine that this might reflect coupling of so some kind of exploration bonus or or novelty bonus into the actual reward system itself. I guess another relevant thing might be that, you know, that there is often very highly structured exploratory behavior when when an animal is allowed to explore a new environment for the first time, you'll you'll see, you know, very, very structured exploratory behavior, which appears to be balancing, I guess, the risk of predation against the the sort of desire to gain more knowledge of the environment.

Speaker 2: 26:20

You know? So I think that exploration does appear to be something which is which is both, you know, actively regulated by the brain and potentially also coupled into the brain's reward learning system. But it's certainly not an area that I have really strong ideas about from a theoretical perspective.

Speaker 1: 26:40

So you mentioned a bit about spatial neurons, and and I've read about spatial neuron, different types of spatial neurons. But can you talk about how time is is modeled in the brain?

Speaker 2: 26:50

Yeah. So there's there's various different lines of work that, I guess, speak to this. So and also on on different time scales. So, on the sort of short time scale of, I guess, you know, seconds, sensory events will often drive not just a kind of completely transient response, but a but a kind of, I guess, a a response that evolves over some some number of seconds immediately after the event, which I guess you could conceive could sort of conceptualize as perhaps acting as kind of basis functions in time for then learning about the the consequences of that event. But then also in tasks where where you have to, I guess, keep track of time.

Speaker 2: 27:37

So so I'm thinking of a work from Eva Pastakova when she was in in Buzaki's lab where she had rats running on a treadmill, and I think they had to run for a certain amount of time during the delay period for working memory task. And she observed this phenomena that there were cells that tile this time interval during during the delay period. So that really seemed to be, I guess, quite an explicit representation of time, but that was in the context of a of a learned task. And, you know, and so so so that that perhaps might be a learned representation. And then another really interesting observation about the representation of time was published by the Moser Lab a couple of years ago in in a region of brain called the entorhinal cortex, where what they saw was essentially following changes in the environment.

Speaker 2: 28:27

So I think they were switching they were recording from from rats, and they were switching them between environments with black black walls and white walls, but it was really just creating a salient change in this in the perceptual environment. They would see neurons whose activity would either ramp up or ramp down over really long time scales, sort of minutes, minutes or tens of minutes following these changes. And, you could imagine that as providing a sort of representation of of time since particular salient events have elapsed, you know, perhaps in the form of basis functions, which would which might allow you to learn, or to to to sort of tag memories with with when they happened relative to other events. So there's there's certainly a a rich representation of the time of time in the brain. I it and it it but but exactly, you know, how the how that's learned and and used, I think is still really really very much still being understood.

Speaker 1: 29:28

So you talk a lot about the relationship between reward prediction error and dopamine, and and I I understand you challenged some of the canonical theory in this area. Is that is that right? Can you tell us about the the basic theory and then also what you found that challenged that?

Speaker 2: 29:43

Yes. I I mean, I I think there's a really a lot that's right about that theory, and I I don't want to give the impression that I, you know, I that that's the main theoretical framework in which I think about dopamine, and I think that really there's lot that's right about it. I guess what we were I think I think what you're referring to is this is Blancaposer at our paper that we published last year where what we found was that, I guess, perhaps contrary to our expectations, manipulating or stimulating or inhibiting the activity of dopamine neurons at the time of trial outcome in a behavioral task, didn't drive changes in the the subject's choice, choices on the next trial. So if contrary to the idea that the dopamine reward prediction errors in this behavior were acting on a fast time scale from trial to trial to update update the values or policies. And and instead, what we think is happening there is that dopamine is acting over a much slower time scale of task acquisition, but but that the fast changes in behavior from trial to trial in response to changes in where rewards are available, we we we think is likely being mediated by changes in a sort of persistent activity or recurrent activity in cortex.

Speaker 2: 31:04

And we modeled that as being learned through self supervised learning. You know, so we we we had a model in which a a a recurrent network, which we were sort of using to model frontal cortex was just learning to predict the next observation. So doing self supervised learning essentially to predict the next observation given the current state of the the cut the given the previous observation and the and the action. And in doing so, this this network learned learned effectively to track to to to estimate the underlying state of the task, which which effectively to learn a kind of Markovian representation from from these only in this partially observable task. And then we conceptualize basal ganglia as learning values and policies and reinforcement learning over this over the state representation.

Speaker 2: 31:56

But critically, this this model provides two different channels for information about the past to influence the future. So you you can you can have on a very slow time scale these synaptic weight updates, which which we conceptualize is happening through self supervised learning in cortex and reinforcement learning in in basal ganglia. But then you have this other channel for information to be carried forward through time in the form of recurrent activity. And that's, you know, really an idea which I think is very I mean, this we we're certainly not the first people to propose ideas like this, so there's a very influential paper from DeepMind a couple of years ago where they considered a sort of very related model in which a recurrent neural network is trained, in their case, just using reinforcement learning over a slow time scale of task acquisition, and then recurrent activity. So to just the activity reverberating in this in in this sort of recurrent network is responsible for fast time scale changes in behavior.

Speaker 2: 32:53

And so I guess, you know, that work and our work is really highlighting how changes in synaptic weights, and and changes in activity provide these two different channels for for the past to affect the future and to affect action section.

Speaker 1: 33:07

So when when someone like you looks at the human brain, do you see parsimonious efficient structure where everything in its place and a place for everything? Or, you know, I've heard that, for example, in human DNA, that a lot of the DNA is non functional, repetitive, or it's otherwise not crucial. And if you look at the organs in the human body, most organs clearly have a purpose, and maybe the the few exceptions might just more like prove the rule. So do you see the brain as very logically organized and everything having a function, or do you see a lot of chaos and and accidental design?

Speaker 2: 33:42

Brain tissue is very metabolically expensive. So I believe in the human, body, the brain is something like 2% of the weight of the total body, but it consumes about 20% of the energy that the body uses. And that's basically because, you know, these are electrically active cells that are maintaining sort of negative voltages at the cell. And so you're you've got both you're both expending energy pumping ions across the cell membranes, and also, synaptic transmission is energetically expensive. So you're reuptaking the the the neurotransmitters and packaging into vesicles and all this stuff.

Speaker 2: 34:22

So the the brain basically, running brain tissue is energetically expensive for the body. And as such, there's really a I'm sure a very strong evolutionary pressure to essentially, you know, not have more brain tissue than you are actually using. So I guess my default assumption is that the brain you know, the the the brain is that we have as much brain as has proved useful and no more. You know? And actually, you know, there are these there are these nice organisms that have a brain when they're in a larval stage swimming in the sea, and then they attach themselves to a rock, they just kind of digest their brain because they don't need it anymore.

Speaker 2: 35:00

And I I I think I guess that kind of just hints at like how expensive this tissue is. So so yeah, I think I mean, the brain is also just really, really beautiful structure. So it's it's it's really exquisitely organized. And and yes, I I think essentially the evolutionary pressure on the brain to do as much as possible with as little of this very expensive brain tissue as possible, I think has essentially made it into an an extremely efficient and well organized structure. And this this is particularly highlighted when you look at animals with very small brains.

Speaker 2: 35:34

So the Drosophila has, you know, a tiny brain with just a couple of 100,000 neurons, but it's still able to do some really remarkable things both in terms of, you know, I mean, generating adaptive behavior, but even basic memory and learning. And, you know, really remarkably, it appears that dopamine neurons in the fly are really important for reward processing. So, you know, it's kind of remarkable that that there's at least some commonality and function of this system in in brains that are separated by, you know, hundreds of millions of years of of evolution.

Speaker 1: 36:12

So I understand that the human brain is vastly more efficient than, computational AI running on silicon. Can you talk about, maybe possible reasons for this efficiency gap? Do you

Speaker 2: 36:22

have any insight on that? So this is really not my core area of expertise at all, but I guess one difference between biological neurons and, you know, artificial neural networks is this combination of analog and digital processing. So, essentially, in the neuron the neuron maintains the voltage inside the neuron at a sort of a cup about a 100 millivolts lower than voltage outside. And when synaptic input comes in, excitatory input pushes that voltage upwards, and inhibitory input pushes that voltage downwards. And that's really a very analog process where you have a continuous signal within the dendritic arbor of the cell.

Speaker 2: 37:07

So you've got the cell body and then you've got the dendrites, which are this tree like input structure where synapses, come in as inputs to the neuron. So the processing within the the dendrites and the and the cell body is really very analog with these continuous voltage fluctuations being driven by synaptic input. But then when the voltage rises above a particular threshold, you get this discrete event called a spike, where essentially these voltage gated ion channels in the cell membrane open and pull the voltage up and then pull it back down again, and you get this very very fast of order, you know, a millisecond or two voltage spike, which spreads right down the axon, which is the output structure of the cell. And and so so the ax the the the cell body and the dendrites, the input structures are very spatially compact. You know, they they will occupy, like, a really small small volume of brain tissue, but the axons can really go a long way and and you go to a whole other area of the brain.

Speaker 2: 38:12

And I guess one way that that contributes to efficiency is you are doing all of the kind of integration and I guess also the learning in with these analog signals that are happening in this in this very localized regions of the of the cell in the dendrites and the the the cell body. And then the actual communication over long distances within the brain is done in this much more digital like signal of sending these these these action potentials or these spikes down the dendrite, which then trigger synaptic release. So I I think that combination of analog and digital processing is is probably an important part of it. But as I say, like, I guess the energetics and the energy energy efficiency aspects of of brain tissue are not something that I'm I'm really that expert on.

Speaker 1: 38:57

So when you look at a modern LLM, what do you see as a neuroscientist? On one level, it's a very simple feed forward architecture. Conceptually, it's almost like a single block. What what do you make of it modern LLMs, and and and what do you make of the idea that some people think we can get to AGI by scaling up this this type of thing? How does that sound to you?

Speaker 2: 39:16

You know, this is really, I would say, outside my core of expertise. I I would say take take what I have to say with a pinch of salt. I guess one very salient aspect of LLMs is that they are sequence prediction machines. So the core training is basically next token prediction on this enormous corpus of of of texts taken from the Internet. And, you know, I certainly think that next sample prediction is a really important way that the brain learns.

Speaker 2: 39:46

And so, you know, I think as as talking about earlier, I think that this sort of self supervised learning of predicting the next observation is really important for how cortex learns representations of the outside world. However, there is an important distinction there in that, I guess, the brain is also generating actions, and it is so I guess it's it's you you you know, it's it's learning a model in which you you have a sequence of observations, you also have a sequence of actions. And so you learn things about how your actions affect the stage of the world ultimately. And at least during the training phase of an LLM, they're simply doing sequence prediction. And I guess that does to some extent one one thing I I often wonder about with other lens, I I read some of the discussion about this, but I, you know, I don't feel like I'm necessarily well enough informed to have a strong opinion about it, is the extent to which this next sample prediction just done over text really learns a good model of the world or whether it really learns a model of the text that talks about a thing in the world.

Speaker 2: 40:55

And so for example, you know, LLM you know, I've read a few things recently talking about how LLMs can often give very, very cogent accounts of the rules of chess or I don't know, rhyming or things like this. But actually, when you then get them to play chess, they will try and make illegal moves or they just won't play very well or this sort of thing. And so, you know, it it it does appear that in LLM training, you get something which is extremely good at generate at at talking about something and and and showing what looks like a very high level of understanding about something, but it's not clear that it's really built a good underlying model of the dynamics of the system that is being described by the text as opposed to the generative process of the text itself. And, you know, for some applications like generating code where there maybe isn't really a very clear distinction between the text, which in this case, you know, is the thing you're trying to generate. You know, it it it certainly my students use use code I know there's a lot of coding, and and, you know, I guess my limited experience with them is that for simple things, they can be great.

Speaker 2: 42:08

You know, I I I I've not used them enough in this domain to really have a lot of a good sense for how well they perform on really complex problems. But but in terms of whether or not you can learn a good model of the causal structure of the world just by doing next sample prediction on sequences of text. I think that that's really unclear to me, and I and so yeah. I I I guess I'm somewhat skeptical that simply scaling LLMs will generate truly intelligent systems. But as I say, that's really quite a long way from my area of expertise.

Speaker 1: 42:47

Can you give us a little bit of a sample of your your current research that you're doing and what you're looking at right now?

Speaker 2: 42:53

Yeah. So really the core thing that I would like to understand is how planning is implemented in the brain. I suspect that that probably isn't going to be a unitary answer. I think that this model based model free distinction, which has been really influential in neuroscience, is undoubtedly gonna end up more granular than that. And so I think, for example, these kind of dynamite mechanisms that happen through offline replay will be part of that story, but I think that, another aspect of it may be systems that use, kind of attracted dynamics in frontal cortex to effectively infer plans.

Speaker 2: 43:34

Perhaps that would be something that'd be applicable maybe in in sort of simpler decision problems. So, yeah, really trying to understand how brains use models of the world to plan action sequences to achieve their goals. That's that's my kind of core focus at the moment. And we so I'm I I, you know, in behavioral neuroscientists, I I mostly work with mice as a model organism, and we're doing a lot of work with navigation in complex mazes where we essentially queue goal locations for the mice, and they navigate to them to get a reward. And then we we try and understand both how the environment and their sort of own behaviors are represented in the brain.

Speaker 2: 44:18

We we ultimately with a view to trying to understand how are these internal models or cognitive maps of the environment used to generate action sequences. And so we really I guess we're we're just in the process now of of writing up some of the first work from from that line of research and where we've essentially been trying to understand what the representations are of the behavior in frontal cortex during this goal directed navigation. And, yeah, I'm I'm excited excited about about that line of research, and that's really the the main thing we're focusing on currently.

Speaker 1: 44:56

And are the types of experimental methods, are they evolving?

Speaker 2: 44:59

Yeah. So there's been really incredibly rapid tool development in neuroscience over the last years and decades. I'm not primarily a tool developer. I'm I develop some open source tools for for running experiments, I don't push the technological boundaries of of tool development. But I I'm an avid tool user, really, there's just been a really remarkably fast movement on that.

Speaker 2: 45:29

So one area of development's been in the area of electrophysiology. So so measuring voltage signals in the brain and the development of mutant orders for that, particularly these neuropixels probes, which essentially allow you to record from hundreds several 100, you know, currently, sort of 380 or something on one probe, different sites very, very closely spaced in the brain use it from a from a it's essentially a a silicon a probe made out of silicon, which has active electronics on the probe, which does the amplification and digitization of signals so that you can record from about 400 sites on the probe at 30 kilohertz, and all of that data can go through two very, very fine wires. And so what this lets you do is record the activity of individual brain cells, at millisecond resolution, but critically to record from, you know, up to several 100 cells simultaneously, during behavior. And that's really so so if you just look at the number of cells that it's been possible to simultaneously record during behavior, you know, it's it's essentially a kind of exponential increase over decades. And and that's really transformative because it it allows you to ask questions that you just couldn't ask if you're recording one cell at a time about how the population of neurons together is is representing behavior.

Speaker 2: 46:57

And and and so so that's one area where the tools have really moved extremely fast. I guess another area is in the use of genetically encoded constructs to either read out or to manipulate brain activity. So optogenetics is, you know, one very famous example of that where essentially, you can you can use, for example, a virus to express a construct in a particular genetically defined group of neurons such that when you shine light on the neurons, it either drives activity of the cells or inhibits it. And so that ability to specifically target a very precisely defined group of neurons in the brain and drive activity or inhibit activity, you know, that's an again, something which is was just not possible twenty years ago and is now absolutely routine. And then another very powerful technique is essentially expressing constructs and cells which report some aspect of the physiology into a change in fluorescence.

Speaker 2: 48:01

So the the sort of first application of that, one of the early applications was these genetically encoded calcium indicators, which change their fluorescence with the level of interest area calcium, which which correlates with activity. And so, you know, my my lab doesn't do a lot of calcium imaging. We've but but but people can routinely now record from record calcium activity from thousands or even tens of tens of thousands of cells using two photon microscopes. And then recently, it's also been tools have been developed, which basically couple a these are, again, genetically encoded tools that essentially couple a neurotransmitter receptor to a fluorescent protein in such a way that when the receptor binds particular neurotransmitter, the the fluorescence changes. And so that basically lets you read out, for example, dopamine release, in the in the brain using optical methods.

Speaker 2: 48:55

And, again, that's that's really transformative. So, yeah, the the tools are moving incredibly quickly, and that's really just transforming the picture that you can get of of brain activity.

Speaker 1: 49:08

That sounds so much harder than explainable ML. Do you pay attention to the computational RL side, and and what what parts of that do you find interesting?

Speaker 2: 49:17

You know, the the rate at which publications come out in competition ML is, you know, it's like a fire hose, and and I I there's no way I feel I can keep up with with the amount of stuff that's published. But nonetheless, I think that the ideas so I, you know, I do I do try and at least keep some sense of what's going on in that space, but, yeah, I I I certainly wouldn't claim to be comprehensively reading that literature at all. I guess the things that I find really exciting because they relate to my area of work are models or, I guess attempts to learn models from sensory data, so from pixels, which can be used for planning. So these models will often have a flavor of having a having some kind of encoder that takes pixel level data and mass it down to a low dimensional latent space, and then you'll learn dynamics in that latent space so that you can predict how I guess what you're trying to do is to to be able to learn some kind of representation of the environment, which you can which allows you to predict how your actions are gonna change things.

Speaker 2: 50:33

And I and that's often I mean, there there are lots of different approaches that you use there, but but, you know, man's understanding is a one that is often used essentially to you're then trying to predict future observations by effectively rolling things out in this latent space, and then having a having a mapping from the latent space back to pixels. So that those kinds of models which are trying to learn models for planning, but are trying to do it from pixel level data. You know, I think that's really relevant to the kind of work I'm doing, where I'm trying to understand how the brain is learning learning models for planning from sensory data.

Speaker 1: 51:10

Is there anything else you you'd wanna share with our listeners while while you're here?

Speaker 2: 51:14

Yeah. My understanding is your audience is is mostly people in reinforcement learning and machine learning and often graduate students. I guess one thing I would say is your neuroscience needs people with your skill set. The problems in neuroscience are really fascinating. The brain is really a just a wonderful thing to study.

Speaker 2: 51:34

And so, yeah, if you are curious about neuroscience, you know, your your skill set is really applicable here. And, yeah, it's it's it's it's really a great field to be working in and it's a field that's moving very fast. So, yeah, consider consider it as a possible career path. Professor Thomas Akem, you so much

Speaker 1: 51:54

for your time today. This has been fascinating.

Speaker 2: 51:56

My pleasure.

Creators and Guests

Host

Robin Ranjit Singh Chauhan

🌱 Head of Eng @AgFunder 🧠 AI:Reinforcement Learning/ML/DL/NLP🎙️Host @TalkRLPodcast 💳 ex-@Microsoft ecomm PgmMgr 🤖 @UWaterloo CompEng 🇨🇦 🇮🇳

Thomas Akam on Model-based RL in the Brain

Broadcast by

Creators and Guests

headphones Listen Anywhere

Listen Anywhere