class: center, middle, inverse, title-slide # Neural mechanisms of learning and decision-making ### Shany Grossman and Ondrej Zika ### Max Planck Research Group NeuroCode ### Last update: 10 November, 2021 --- ## Overview ### Part 1: Basic RL and dopamine (Ondrej) - motivation to study RL - magpies! - basic RL concepts - dopamine and RL - developing RL intuitions ### Part 2: Avanced RL (Shany) - habitual vs goal-directed behaviour - model-based and model-free learning - neural correlates of MB/MF learning --- ## Part 1: Basic RL and dopamine ### Overview We will: -- - introduce basic RL algorithms and develop some intuition about how they work -- - discuss how the brain might implement RL -- - look at the role of dopamine in RL --- ## Reinforcement learning  --- ### Associative (Pavlovian/Classical) learning - learning associations between cues and outcomes without an action - second order conditioning  --- ### Introduction: learning theory (psychology) - trial-by-trial learning: incorporating feedback into future predictions **Pavlovian conditioning** - building associations between neutral and biologically salient stimuli - doesn't require action, involuntary **Operant (instrumental) conditioning** - strength of **behaviour** modified through reinforcements and punishments - voluntary (i.e. action-dependent) --- ### Reinforcement learning (AI/ML) - branch of machine learning concerned with learning the correct mapping between situations and actions with the goal of maximizing some reward signal (Sutton & Barto 1998) -- - neither supervised not unsupervised ML, it has an *interactive* component, i.e. learning through agent's *own experience* -- [](https://www.youtube.com/watch?v=VMp6pq6_QjI) --- ### Reinforcement learning (AI/ML) Side question: what is this "reward"? -- In ML/AI: > *"...a general incentive mechanism that tells the agent what is correct and what is wrong using reward and punishment."* In Neuro/psych: > *"rewards" are subjective, so it can be anything from information to juice* ??? - aversive events can be rewarding - satiety (devaluation) --- ### Reinforcement learning (AI/ML) The main features of RL algorithms are: - trial-and-error updating of *expected rewards* (i.e. *predictions*) - propagation of future rewards (temporally delayed rewards considering all possible actions) --- ### RL and animal behaviour -- - trial-and-error learning occurs in complex organisms (let's watch a video about it) [Magpie learning](https://www.youtube.com/watch?v=7oclPZx520k) --- ### RL and animal behaviour - trial-and-error learning occurs in complex organisms - some evidence for associative learning in single-cell organisms as well (see [Gershman et al. 2021](https://elifesciences.org/articles/61907)) --- ### RL and the brain What we know so far? -- - animals seam to learn via trial-and-error -- - theoretical RL algorithms predicts error-driven learning --- ### Associative learning: example of conditioning in the amygdala --- ### Dopamine (anatomical window) organised mainly in three sub-groups/sources in the midbrain: 1. retrorubral nucleus (RRN) 2. Substantia nigra pars compacta (SNc/SNpc) 3. Ventral tagmental area (VTA)  --- ### Dopamine (anatomical window) - projections to striatum, amygdala, cerebral cortex - the signal leaving the dopaminergic areas is *relatively uniform* ("dopaminergic bradcasting") - it is highly variable in the striatum  --- ### Dopamine and movement - DA has been implicated in movement control (basal ganglia are tightly looped with cortical neurons) -- - Parkinson's disease has been closely linked to DA deficiency (extensive inhibition of movement) -- - action selection --- ### Dopamine and reward - apart from movement, DA has been linked to reward processing -- - most drugs are associated with direct on indirect DA release (alcohol, nicotine, morphine) or reuptake inhibition (cocaine, amphetamines) --  --- ### Movement or reward? - some involvment in both - interaction in action selection: rewarding favourable actions - "tonic" activity = movement facilitation; "phasic" activity = reward-related activity --- ### Dopamine - prediction error  --- ### Dopamine - prediction error -- - *prediction* & *prediction error* signals -- [Magpie vol. 2](https://www.youtube.com/watch?v=LJG3282QU4g) -- - Bayer and Glimcher (2005) for demonstration relationship between PE and DA signal -- - a spectacular link between a normative strategy and a feature of the nervous system --- ### RPE signal in fMRI  O'Doherty, 2004 - but RPE also found in many non-dopaminergic areas --- ### Appetitive and aversive PEs DA associated with both, but: -- - negative PE harder to detect due to low baseline activity -- - avoidance of aversive stimulus is rewarding (Moutoussis et al. 2008) -- - different parts of the SN associated with app vs aversive PEs (Pauli et al. 2015) -- - aversive learning has life on its own (present in many other regions, PAG, amygdala) --- ### The "whole story" Apart from pure prediction errors, DA is sensitive to many other things - reward scaling - reward probability - effort (Walton & Bouret, 2019) - state-dependent PEs (Papageorgiou et al. 2016) - involved in movement control (Parkinson's) and vigor, go/nogo pathways - state inference (Starkweather et al. 2017) - distributed RPE (Dabney et al. 2020) --- ### Rescorla-Wagner model --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") -- `\(V_{t+1} = V_t + \alpha(r - V_t)\)` -- `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model: alpha  --  --- ### Rescorla-Wagner model: rare events  --- ### Rescorla-Wagner model: rare events  --- ### Temporal-Difference learning `\(V_t\)` ... current expected value `\(V_{t+1}\)` ... future expected value `\(r\)` ... reward `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(\gamma\)` .. discount of future expected rewards -- `\(V_{t+1} = V_t + \alpha(r + \gamma V_{t+1} - V_t)\)` --- ### Temporal-Difference learning - taking temporal information into account, predicting aggregate reward over actions -- - stimuli predicting highest average reward elicit highest DA responses --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### What to remember (3 key points) 1. Dopamine behaves very similar to normative updating strategy -- 2. Intuitions about Rescorla-Wagner and TD-learning -- 3. Magpies are awesome! Beware...  --- --- ### Dopamine - anatomy of dopaminergic afferents and BG - large variabilty in terms of computation - involved in movement control (Prkinson's) and vigor, go/nogo pathways - involvement in reward (believed to signal obtained reward) - timing (Walton), state signalling (Starkweather), distributed processing RPE (Daw recent) --- ### DA and RPE - activity associated with obtaining food -> reward? - maybe not reward, but the difference between expectation and obtained (Schultz) - asymmetry in firing rate potential (relevant later for aversive learning) - replicated in many single-cell recording, fMRI and optogenetic studies - make distinction of value and reward --- ### how rare this is - a link between a normative strategy and a feature of the nervous system - yes but also if you think about it it's not a surprise since evolution is based on trial-and-error learning --- ### TD-learning - taking temporal information into account, predicting aggregate reward over actions - stimuli predicting highest average reward elicit highest DA responses --- ### Value - predicting future rewards - keeping track of *relative* values of different stimuli/features - useful framework to study decision making - softmax - vmPFC/OFC role in representing value as a common currency across categories - maybe mention metaRL --- ### Appetitive versus aversive learning - issue with dopamine floor - absolute prediction error, end/avoidance of aversive stimulus is rewarding (Moutoussis 2008) - major differences between appetitive and aversive learning (discuss fight/flight pathway PAG etc) --- ### Learning rates and uncertianty - how to decide how much to learn? - noise in observations - changeability of the environment - learning the volatility --- ### Useful resources - [Basic RL Tutorial](http://hannekedenouden.ruhosting.nl/RLtutorial/Instructions.html) --- ---