class: center, middle, inverse, title-slide # Neural mechanisms of learning and decision-making ### Shany Grossman and Ondrej Zika ### Max Planck Research Group NeuroCode ### Last update: 11 November, 2021 --- ## Overview ### Part 1: Basic Reinforcement Learning (RL) and dopamine (Ondrej) - motivation to study RL - basic RL concepts - dopamine and RL - developing RL intuitions ### Part 2: Avanced RL (Shany) - habitual vs goal-directed behaviour - model-based and model-free learning - neural correlates of MB/MF learning --- ## Part 1: Basic RL and dopamine ### Overview We will: -- - introduce basic RL algorithms and develop some intuition about how they work -- - discuss how the brain might implement RL -- - look at the role of dopamine in RL --- ## Associative (Pavlovian/Classical) learning  --- ### Associative (Pavlovian/Classical) learning - learning associations between cues and outcomes without an action - second order conditioning  ??? - psychiatry, e.g. panic disorder --- ### Introduction: learning theory (psychology) - trial-by-trial learning: incorporating feedback into future predictions **Pavlovian conditioning** - building associations between neutral and biologically salient stimuli - doesn't require action, involuntary **Operant (instrumental) conditioning** - strength of **behaviour** modified through reinforcements and punishments - voluntary (i.e. action-dependent) --- ### Reinforcement learning  --- ### Reinforcement learning (AI/ML) - branch of machine learning concerned with learning the correct mapping between situations and actions with the goal of maximizing some reward signal (Sutton & Barto 1998) -- - neither supervised not unsupervised ML, it has an *interactive* component, i.e. learning through agent's *own experience* -- [](https://www.youtube.com/watch?v=VMp6pq6_QjI) --- ### Reinforcement learning (AI/ML) The main features of RL algorithms are: - trial-and-error updating of *expected rewards* (i.e. *predictions*) - propagation of future rewards (temporally delayed rewards considering all possible actions) --- ### Reinforcement learning (AI/ML) Side question: what is this "reward"? -- In ML/AI: > *"...a general incentive mechanism that tells the agent what is correct and what is wrong using reward and punishment."* In Neuro/psych: > *"rewards" are subjective, so it can be anything from information to juice* ??? - aversive events can be rewarding - satiety (devaluation) --- ### RL and animal behaviour -- - trial-and-error learning occurs in complex organisms (let's watch a video about it) [Magpie learning](https://www.youtube.com/watch?v=7oclPZx520k) --- ### RL and animal behaviour - trial-and-error learning occurs in complex organisms - some evidence for associative learning in single-cell organisms as well (see [Gershman et al. 2021](https://elifesciences.org/articles/61907)) --- ### RL and the brain What we know so far? -- - animals seam to learn via trial-and-error -- - theoretical RL algorithms predicts error-driven learning --- ### Dopamine (anatomical window) organised mainly in three sub-groups/sources in the midbrain: 1. retrorubral nucleus (RRN) 2. Substantia nigra pars compacta (SNc/SNpc) 3. Ventral tagmental area (VTA)  --- ### Dopamine (anatomical window) - projections to striatum, amygdala, cerebral cortex - the signal leaving the dopaminergic areas is *relatively uniform* ("dopaminergic bradcasting") - it is highly variable in the striatum  --- ### Dopamine and movement - DA has been implicated in movement control (basal ganglia are tightly looped with cortical neurons) -- - Parkinson's disease has been closely linked to DA deficiency (extensive inhibition of movement) -- - action selection --- ### Dopamine and reward - apart from movement, DA has been linked to reward processing -- - most drugs are associated with direct on indirect DA release (alcohol, nicotine, morphine) or reuptake inhibition (cocaine, amphetamines) --  --- ### Movement or reward? - some involvment in both - interaction in action selection: rewarding favourable actions - "tonic" activity = movement facilitation; "phasic" activity = reward-related activity --- ### Dopamine - prediction error  Schultz et al., 1997 --- ### Dopamine - prediction error -- - *prediction* & *prediction error* signals -- [Magpie vol. 2](https://www.youtube.com/watch?v=LJG3282QU4g) -- - Bayer and Glimcher (2005) for demonstration relationship between PE and DA signal -- - a spectacular link between a normative strategy and a feature of the nervous system --- ### RPE signal in fMRI  O'Doherty, 2004 - but RPE also found in many non-dopaminergic areas --- ### Appetitive and aversive PEs DA associated with both, but: -- - negative PE harder to detect due to low baseline activity -- - avoidance of aversive stimulus is rewarding (Moutoussis et al. 2008) -- - different parts of the SN associated with app vs aversive PEs (Pauli et al. 2015)  -- - aversive learning has life on its own (present in many other regions, PAG, amygdala) --- ### The "whole story" Apart from pure prediction errors, DA is sensitive to many other things - reward scaling - reward probability - effort (Walton & Bouret, 2019) - state-dependent PEs (Papageorgiou et al. 2016) - involved in movement control (Parkinson's) and vigor, go/nogo pathways - state inference (Starkweather et al. 2017) - distributed RPE (Dabney et al. 2020) --- ### Rescorla-Wagner model --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") -- `\(V_{t+1} = V_t + \alpha(r - V_t)\)` -- `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model `\(V_t\)` ... expected reward for the current trial `\(V_{t+1}\)` ... expected reward for next trial `\(r\)` ... reward received on the current trial `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(V_{t+1} = V_t + \alpha(r - V_t)\)` `\(\alpha=0.3\)` `\(V_0=5\)`  --- ### Rescorla-Wagner model: alpha  --  --- ### Rescorla-Wagner model: rare events  --- ### Rescorla-Wagner model: rare events  --- ### Temporal-Difference learning `\(V_t\)` ... current expected value `\(V_{t+1}\)` ... future expected value `\(r\)` ... reward `\(\alpha\)` ... learning rate (i.e. "how much do you learn") `\(\gamma\)` .. discount of future expected rewards -- `\(V_{t+1} = V_t + \alpha(r + \gamma V_{t+1} - V_t)\)` --- ### Temporal-Difference learning - taking temporal information into account, predicting aggregate reward over actions -- - stimuli predicting highest average reward elicit highest DA responses --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### Temporal-Difference learning <center>  <center> --- ### What to remember (key points) 1. Dopamine behaves very similar to a normative updating strategy -- 2. Intuitions about Rescorla-Wagner and TD-learning -- 3. Beware the magpies...  --- ### Questions > *How does the dopaminergic reward system change across the lifespan?* --- ### Questions > *I was wondering about individual differences in reward values, like preferences in rewards. Can one find differences in dopamine response to this? And can they be taken into account in models of leaning/decision making in addition to the probability of reward in general?* --- ### Questions > *In chapter 15, multiple animal studies are mentioned, that show dopamine response to stimuli. It seems that in humans only the BOLD signal could be used as prediction-error correlate. Are there no spectroscopy data available to show that metabolic changes / metabolic response correlate with decision-making responses in humans?* --- ### MR spectroscopy - different molecules have different proton/electron density - this alters the way in which they react to the magnetic field  see: https://radiopaedia.org/articles/mr-spectroscopy-1 --- ### Dopamine and spectroscopy  --- ### Glutamate and PE **Contribution of SN glutamate to PE signals in schizophrenia** - White et al. 2015  --- ### Useful resources - [Basic RL Tutorial](http://hannekedenouden.ruhosting.nl/RLtutorial/Instructions.html) --- ---