Studying cognition via bandit tasks: two tales on deviating from optimal learning and planning

Speaker: Prakhar Godara (TU Darmstadt)

2025/01/29 15:20-17:00

Location: Building S1|15 Room 133

Abstract:

This talk will summarize (the academic part of) my stay in Darmstadt. I will present my explorations into deviations from optimal learning and planning in bandit tasks.

On the learning side we will explore recent results [1-4] (among others) which claim that human behavior in a two armed Bernoulli bandit (TABB) task is described by positivity and confirmation bias, thereby implying [5] that “Humans do not integrate new information objectively”. The claim is based on fitting to human data a Q-learning model with different (albeit constant) learning rates for positive and negative reward prediction errors. However, we find that even if the agent updates its belief via, arguably objective, Bayesian inference, fitting the above model demonstrates both the biases. This finding seems particularly surprising, as Bayesian inference, when written as an effective Q-learning algorithm, is described by monotonically decreasing but symmetric/unbiased learning rates. In this part of the talk, I will explain the reasons behind this observation, by studying the stochastic dynamics of these learning systems using Master equations.

On the planning side, I will present a theory of approximate planning using the existing theory of metareasoning [6]. Conventionally, human metareasoning models assume that the agent knows the transition and reward distributions of the environment it inhabits ([7] as a characteristic example). In the talk I will demonstrate how one could generalize such models by proposing a meta Bayes-Adaptive MDP (meta-BAMDP) framework to handle metareasoning in environments with unknown reward/transition distributions. I will also present two theorems that allow us to make the meta-level problem more tractable. These results offer a resource-rational perspective and a normative framework for understanding human exploration under cognitive constraints, as well as provide experimentally testable predictions about human behavior in TABB tasks.

[1] S Palminteri, G Lefebvre, EJ Kilford, SJ Blakemore, Confirmation bias in human reinforcement learning: Evidence from counterfactual feedback processing. PLoS computational biology 13, e1005684 (2017)

[2] G Lefebvre, M Lebreton, F Meyniel, S Bourgeois-Gironde, S Palminteri, Behavioural and neural characterization of optimistic reinforcement learning. Nat. Hum. Behav. 1, 0067 (2017)

[3] H den Ouden, et al., Dissociable effects of dopamine and serotonin on reversal learning. Neuron 80, 1090–1100 (2013).

[4] MJ Frank, AA Moustafa, HM Haughey, T Curran, KE Hutchison, Genetic triple dissociation reveals multiple roles for dopamine in reinforcement learning. Proc. Natl. Acad. Sci. 104, 16311–16316 (2007).

[5] S Palminteri, M Lebreton, The computational roots of positivity and confirmation biases in reinforcement learning. Trends Cogn. Sci. 26, 607–621 (2022).

[6] Stuart Russell and Eric Wefald. Principles of metareasoning. Artificial Intelligence, 49(1):361–395, 1991. ISSN 0004-3702. doi: https://doi.org/10.1016/0004-3702(91)90015-C.

[7] Frederick Callaway, Bas van Opheusden, Sayan Gul, Priyam Das, Paul M Krueger, Thomas L Griffiths, and Falk Lieder. Rational use of cognitive resources in human planning. Nature Human Behaviour, 6(8):1112–1125, 2022