class: center, middle, inverse, title-slide # Neural mechanisms of learning and decision-making ### Shany Grossman and Ondrej Zika ### Max Planck Research Group NeuroCode ### Last update: 10 November, 2021 --- ## Overview ### Part 1: Basic RL and dopamine (Ondrej) - motivation to study RL - magpies! - basic RL concepts - dopamine and RL - developing RL intuitions ### Part 2: Avanced RL (Shany) - habitual vs goal-directed behaviour - model-based and model-free learning - neural correlates of MB/MF learning --- ## Part 1: Basic RL and dopamine ### Overview We will: - introduce basic RL algorithms and develop some intuition about how they work -- - discuss how the brain might implement RL -- - look at the role of dopamine in RL --- ## Reinforcement learning  --- ### Associative (Pavlovian/Classical) learning - learning associations between cues and outcomes without an action - second order conditioning  --- ### Introduction: learning theory (psychology) - trial-by-trial learning: incorporating feedback into future predictions **Pavlovian conditioning** - building associations between neutral and biologically salient stimuli - doesn't require action, involuntary **Operant (instrumental) conditioning** - strength of **behaviour** modified through reinforcements and punishments - voluntary (i.e. action-dependent) --- ### Reinforcement learning (AI/ML) - branch of machine learning concerned with learning the correct mapping between situations and actions with the goal of maximizing some reward signal (Sutton & Barto 1998) -- - neither supervised not unsupervised ML, it has an *interactive* component, i.e. learning through agent's *own experience* -- [](https://www.youtube.com/watch?v=VMp6pq6_QjI) --- ### Reinforcement learning (AI/ML) Side question: what is this "reward"? -- In ML/AI: > *"...a general incentive mechanism that tells the agent what is correct and what is wrong using reward and punishment."* In Neuro/psych: > *"rewards" are subjective, so it can be anything from information to juice* ?? - aversive events can be rewarding - satiety (devaluation) --- ### Reinforcement learning (AI/ML) The main features of RL algorithms are: - trial-and-error updating of *prediction* - extended propagation of future rewards (temporally delayed rewards considering all possible actions) --- ### RL and animal behaviour -- - trial-and-error learning occurs in complex organisms (let's watch a video about it) [](https://www.youtube.com/watch?v=7oclPZx520k) --- ### RL and animal behaviour - trial-and-error learning occurs in complex organisms - some evidence for associative learning in single-cell organisms as well (see [Gershman et al. 2021](https://elifesciences.org/articles/61907)) --- ### RL and the brain What we know so far? - animals seam to learn from errors - theoretical RL algorithms predicts error-driven learning ### Associative learning and the brain: Hebbian learning -- > *When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.* -- or.. > *cells that fire together wire together* (no reward here though) --- ### Dopamine (anatomical window) organised mainly in three sub-groups/sources: 1. retrocrural nucleus (RRN) 2. Substantia nigra pars compacta (SNc/SNpc) 3. Ventral tagmental area (VTA)  --- ### Dopamine (anatomical window) - projections to striatum + PFC - while the signal leaving the dopaminergic areas is relatively uniform, it is highly variable in the striatum  --- ### Dopamine and movement - bits on movement control/vigour --- ### Dopamine and reward --  --- ### Dopamine - prediction error  - prediction & prediction error signals -- (back to Magpie) [](https://www.youtube.com/watch?v=LJG3282QU4g) -- - a link between a normative strategy and a feature of the nervous system --- ### Dopamine - phasic DA activity related to finding reward (food), Schultz et al 1986 - number of papers have linked the dopamine signal to PE, both positive and negative (omission of reward) - negative PE is harder to decet because baseline firing rate is relatively low, but longer depression has been associated with more negative PEs - DA also seems to be sensitive to reward expetnacy - see Bayer and Glimcher (2005) for demonstration relationship between PE and DA signal - shown in many species and across imaging modalities (for example see O'Doherty 2003 for fMRI)  O'Doherty et al. (2004), *Neuron* --- ### So DA is reward prediction and RPE and... - reward scaling - reward probability - reward timing - involved in movement control (Prkinson's) and vigor, go/nogo pathways - involvement in reward (believed to signal obtained reward) - timing (Walton) - state signalling (Starkweather) - distributed processing RPE (Daw recent) --- ### Appetitive versus aversive learning - issue with dopamine floor - absolute prediction error, end/avoidance of aversive stimulus is rewarding (Moutoussis 2008) - major differences between appetitive and aversive learning (discuss fight/flight pathway PAG etc) --- ### Learning rates and uncertianty - how to decide how much to learn? - noise in observations - changeability of the environment - learning the volatility --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") -- `\(V_{t+1} = V_t + \alpha(r - V_t)\)` -- `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model: alpha  --  --- ### Rescorla-Wagner model: rare events `\(V_t\)` ... current expected value `\(V_{t+1}\)` ... future expected value `\(r\)` ... reward `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(\gamma\)` .. discount of future expected rewards -- `\(V_{t+1} = V_t + \alpha(r + \gamma V_{t+1} - V_t)\)` --- ### Rescorla-Wagner model: rare events  --- ### Temporal-Difference learning - taking temporal information into account, predicting aggregate reward over actions -- - stimuli predicting highest average reward elicit highest DA responses -- --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### End of Part 1 --- --- ### Dopamine - anatomy of dopaminergic afferents and BG - large variabilty in terms of computation - involved in movement control (Prkinson's) and vigor, go/nogo pathways - involvement in reward (believed to signal obtained reward) - timing (Walton), state signalling (Starkweather), distributed processing RPE (Daw recent) --- ### DA and RPE - activity associated with obtaining food -> reward? - maybe not reward, but the difference between expectation and obtained (Schultz) - asymmetry in firing rate potential (relevant later for aversive learning) - replicated in many single-cell recording, fMRI and optogenetic studies - make distinction of value and reward --- ### how rare this is - a link between a normative strategy and a feature of the nervous system - yes but also if you think about it it's not a surprise since evolution is based on trial-and-error learning --- ### TD-learning - taking temporal information into account, predicting aggregate reward over actions - stimuli predicting highest average reward elicit highest DA responses --- ### Value - predicting future rewards - keeping track of *relative* values of different stimuli/features - useful framework to study decision making - softmax - vmPFC/OFC role in representing value as a common currency across categories - maybe mention metaRL --- ### Appetitive versus aversive learning - issue with dopamine floor - absolute prediction error, end/avoidance of aversive stimulus is rewarding (Moutoussis 2008) - major differences between appetitive and aversive learning (discuss fight/flight pathway PAG etc) --- ### Learning rates and uncertianty - how to decide how much to learn? - noise in observations - changeability of the environment - learning the volatility --- ### Useful resources - [Basic RL Tutorial](http://hannekedenouden.ruhosting.nl/RLtutorial/Instructions.html) --- ---