packages <- c('tidyverse', 'nnet', 'data.table', 'cowplot')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
options(dplyr.summarise.inform = FALSE) #suppress annoying messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
set.seed(1234) #set seed for reproducibility
This is R Notebook used in Tutorial 2 the COSMOS summer school. In this session, we focus on describing and programming computational models to describe both individual and social learning mechanisms. We use these models to both simulate data, and also describe how the models can be estimated on data (either from participants or simulated data) to infer parameters. Let’s jump in!
Brisk Introduction to Reinforcement Learning
We start with a very quick introduction to the Reinforcement Learning (RL) framework by Sutton & Barto (2018), which describes how an agent (either biological or machine) can learn how to behave intelligently by interacting with the environment. At each time point \(t\), the agent selects an action \(a \in A\) from a set of possible actions \(A\), and then receives feedback from the environment in the form of a new state \(s_t\) and a reward observation \(r_t\). Since we will focus on a simple bandit task, we can ignore the state information for now and assume that rewards are solely a function of the action \(r_t := R(a_t)\).
Learning a value function
We will start with a Q-learning model (Watkins & Dayan, 1992), as one of the simplest RL models. By trying out actions and receiving rewards, the Q-learning agent learns a value function for each action \(Q(a)\) describing the expected reward. With each reward observation, the current Q-value \(Q_t(a)\) is updated with the following equation:
\[ \begin{equation} Q_{t+1}(a) \leftarrow Q_t(a) + \alpha\color{red}{[r_t(a) - Q_t(a) ]} \end{equation} \] The term in the equation colored red is known as the reward prediction error (RPE). This captures the difference between the expected reward \(Q_t(a)\) and the actual reward observation. When this is lower than expected, we update our expectation to be lower, and vice versa.
The amount we update our predictions is governed by the learning rate parameter \(\alpha\), which is bounded to the range \([0,1]\). This is often implemented as a free parameter, which is then estimated from data to describe a characteristic of the learner(s). We can estimate \(\alpha\) separately for different experimental manipulations to see how learning changes, for example, in contexts of punishment vs. reward (Palminteri et al., 2015) or how group size influences the rate of individual learning in social contexts (Toyokawa et al., 2019).
The Q-learning model also requires some prior initialization of Q-values, which can often be set to \(Q_{t=0}=0\). However, there are also settings where optimistic initialization (Sutton & Barto, 2018) allows agents to learn and explore more efficiently. Signatures of optimistic exploration have been found in human behavior (Wilson et al., 2014; Wu et al., 2018a) and have been included in models of social learning (Ho et al., 2019).
Exercise 2.1. Can you modify the Q-learning model to be Bayesian? This will allow the model to learn not only the expected reward, but also quantify the level of uncertainty about the outcome of each option. You might be highly confident about the quality of your favorite coffee shop, but a single bad experience at a new restaurant might not be very strong evidence you should never return. One approach is to use a Kalman Filter (Dayan et al., 2000; Speekenbrink & Konstantinidis, 2015; Wu, Schulz, et al., 2022) to learn a distribution over rewards, with the simplifying assumption that each option corresponds to an independent Gaussian distribution and that learning follows linear Gaussian dynamics. Gershman (2015) shows how a Kalman filter provides a Bayesian analogue to the RPE learning updates of the Q-learning model we describe above.
Converting values to actions
In order to convert values to actions, we need a policy \(\pi\) defining a probability distribution over actions. Each action \(a\) is assigned probability \(\pi(a)\), which can be used to either predict or simulate future actions of the agent. One common method is to use a softmax policy, which selects actions proportional to their exponentiated Q-values:
\[ \begin{equation} \pi(a) \propto exp\left(\beta Q(a)\right) \end{equation} \] \(\beta\) is the inverse temperature parameter, governing the amount of stochasticity in the policy. When \(\beta\) is small, there is more randomness in the policy with higher probability given to lesser-valued options. As \(\beta\) grows larger, the policy tend to more strongly favor the option with the highest value. The code below visualizes the softmax policy across a range of different \(\beta\) values and at different combinations of Q-values for a simple 2-armed bandit.
softmax <- function(beta, Qvec){
p <- exp(beta*Qvec)
p <- p/sum(p) #normalize to sum to 1
return(p)
}
Qdiff <- seq(-5,5,length.out=100) #Let's fix the value of option 2 to 0 and then modulate the value of option 1 between -5 and 5
#Generate softmax policies for different beta values
softmaxDF <- data.frame()
for (beta in c(0.5, seq(1,4,by=1))){
sims <- sapply(Qdiff, FUN=function(diff) softmax(beta, c(diff,0))) #Compute softmax across the range of value differences
simDF <- data.frame(Qdiff = Qdiff, prob1 = sims[1,], prob2=sims[2,], beta = beta) #Put the value difference, choice probabilities, and beta parameter into a data frame
softmaxDF <- rbind(simDF, softmaxDF)
}
softmaxDF$beta <- factor(softmaxDF$beta)
ggplot(softmaxDF, aes(x = Qdiff, y = prob1, color = beta))+
geom_line()+
theme_classic()+
scale_color_viridis_d()+
labs(title = "Softmax policy",
x = expression(Q[1]-Q[2]),
y = 'Probability of action 1',
col = expression(beta))+
theme(legend.position = c(1,.1), legend.justification = c(1,0))
Softmax policy using a simplified case of a 2-armed bandit. Towards the right side, action \(a_1\) has a higher Q-value than the alternative \(a_2\), and has a higher probability of being chosen under the softmax policy.
Exercise 2.2. What other policies could you consider? Think about which strategies you tried in Tutorial 1 during the live demo of the individual learning experiment. Did you find that you sometimes choose an option randomly? You could describe that using an \(\epsilon\)-greedy policy (Sutton & Barto, 2018). Did you tend to explore options you visited the least? Then you could implement a count-based exploration policy (Machado et al., 2020) or uncertainty-directed exploration policy such as upper confidence bound sampling (Auer et al., 2002). Note, that to implement the latter, you will need to have implemented a Bayesian RL model (Exercise 2.1) in order to quantify the uncertainty for each option.
Simulating data
We’re now ready to simulate some data with this Q-learning model paired with a bandit environment we defined in Tutorial 1. We first need to specify some task parameters, such as the number of arms, the means and standard deviations of rewards, number of rounds, and number of trials per round. Then we need to define the model parameters, where we start with some sensible values of \(\alpha=.9\), \(\beta=1\), and \(Q_0=0\). The code block below describes the entire simulation procedure and plots reward curves over each individual round (colored lines) and aggregated over 10 rounds (black line).
#Task parameters for bandit with Gaussian rewards
k <- 10 #number of arms
meanVec <- seq(-10,10, length.out=k) #Payoff means
sigmaVec <- rep(1, k) #Payoff stdevs
T <- 25 #total number of trials
nAgents <- 4
rounds <- 10
banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
return (payoff)
}
#Model parameters
alpha <- .9 #learning rate
beta <- 1 #Softmax inverse temperature
Q0 <- Qvec <- rep(0,k) #prior initialization of Q-values
#Now simulate data for multiple agents over multiple rounds
simDF <- data.frame()
for (a in 1:nAgents){ #loop through agents
for (r in 1:rounds){ #loop through rounds
Qvec <- Q0 #reset Q-values
for (t in 1:T){ #loop through trials
p <- softmax(beta, Qvec) #compute softmax policy
action <- sample(1:k,size = 1, prob=p) #sample action
reward <- banditGenerator(action) #generate reward
Qvec[action] <- Qvec[action] + alpha*(reward - Qvec[action]) #update q-values
chosen <- rep(0, k) #create an index for the chosen option
chosen[action]<- 1 #1 = chosen, 0 = not
trialDF <- data.frame(trial = t, agent = a, round = r, Q = Qvec, action = 1:k, chosen = chosen, reward = reward)
simDF <- rbind(simDF,trialDF)
}
}
}
saveRDS(simDF, 'data/simChoicesQlearning.Rds')
#Plot results
ggplot(subset(simDF, chosen==1), aes(x = trial, y = reward, color = interaction(agent,round)))+
geom_line(size = .5, alpha = 0.5)+
stat_summary(fun = mean, geom='line', color = 'black', size = 1)+
theme_classic()+
xlab('Trial') +
ylab('Reward')+
ggtitle('Simulated performance')+
theme(legend.position='none')
Simulated performance of the single Q-learning agent over multiple rounds. Each colored line is the learning curve in a single round, with the black line showing the mean across rounds.
Demo 1: Tweaking individual learning parameters
Simulated behavior is sensitive to the choice of model parameters. Under the following link, you can access an interactive Shiny app to explore how different model parameters influence the simulated learning curves.
Likelihood function
Beyond only simulating data, we also want to use computational models to describe experimental data collected from subjects. Which model type provides a better description will allow us to test different hypotheses, while the estimated parameters of the winning model should provide an interpretable characterization of behavior.
To fit a model to data, we first need to define a likelihood function:
\[ \begin{equation} P(D|\theta) \end{equation} \] The likelihood function describes the likelihood of the observed data \(D\) (i.e., actions made by a subject) given a set of parameters \(\theta\) (e.g., \(\alpha\) and \(\beta\)).
Since we are hardly ever modeling a single behavioral data point, but rather a sequence of observations (e.g., over multiple trials of learning and multiple rounds of a task), we need to describe the joint likelihood over all observations under consideration \(P(D|\theta) = \prod_t P(d_t|\theta)\), which is the product of the likelihood of each observation \(P(d_t|\theta)\). This becomes much easier using logarithms, since we can replace multiplication with summation in log-space and simply sum over the log-likelihoods:
\[ \begin{equation} \log P(D|\theta) = \sum_t \log P(d_t|\theta) \end{equation} \] Note that natural logs \(\ln\) are most commonly used rather than base 10 logarithms \(\log_{10}\).
The (natural) log of any probability will be negative (since probabilies are always less than or equal to 1). Thus, it’s more convenient to express the fit of a model in terms of the negative log-likelihood (nLL), by simply inverting the sign of the previous equation:
\[ \begin{equation} nLL= -\sum_t \log P(d_t|\theta) \end{equation} \] The nLL will always be a value greater than zero, with smaller values indicating a better fit. Thus, nLL expresses the amount of error or loss in quantifying how well a model (given a set of parameters) fits the data, and is sometimes also called “log loss”.
In practice, we can define a likelihood function (using very similar code to our model simulations) that takes a set of parameters and participant data, and then returns the nLL. The likelihood function iteratively loops through the trials of the task and predicts the next action based on the current policy \(\pi_t\). We then take the log probability of the actual action that was selected, and use that to update our running tally of the nLL. Remember that we also need to use the actual action and reward observations to update the value function of the agent in order for the model predictions to evolve following the learning dynamics of our model.
Lastly, we wrap the likelihood function in an optimization function in order to find the set of parameters that minimizes the nLL. By minimizing the loss of the model, we arrive at an estimate of the most likely set of parameters known as the Maximum Likelihood Estimate:
likelihood <- function(params, data, Q0=0){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
names(params) <- c('alpha', 'beta') #name parameter vec
nLL <- 0 #Initialize negative log likelihood
rounds <- max(data$round)
trials <- max(data$trial)
for (r in 1:rounds){ #loop through rounds
Qvec <- rep(Q0,k) #reset Q-values each new round
for (t in 1:trials){ #loop through trials
p <- softmax(params['beta'], Qvec) #compute softmax policy
trueAction <- subset(data, chosen==1 & trial==t & round == r)$action
negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
nLL <- nLL + negativeloglikelihood #update running count
Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, chosen==1 & trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
}
}
return(nLL)
}
#Now let's optimize the parameters
init <- c(1,1) #initial guesses
lower <- c(0,-Inf) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf)
MLE <- optim(par=init, fn = likelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(simDF, chosen==1 & agent==1) )
Using the data we simulated earlier using \(\alpha\) = 0.9 and \(\beta\) = 1.0, we arrive at a MLE estimate of \(\hat{\alpha}\) = 0.95 and \(\hat{\beta}\) = 1.10. This MLE corresponds to a negative log likelihood 109.5659023 for this MLE, quantifying how good of a fit it provides.
Loss landscape
We can also compute nLLs across a range of parameter combinations in order to visualize the loss landscape. The plot below illustrates how some regions of the parameter space produce similar fits to the data.
#Create a grid of plausible parameter combinations
paramCombs <- expand.grid(alpha = seq(0,1, length.out=20), beta = seq(0,20, length.out=20))
#Now compute nLLs for each grid combination
nLLs <- sapply(1:nrow(paramCombs), FUN=function(i) likelihood(as.numeric(paramCombs[i,]), subset(simDF, chosen==1 & agent==1))) #This can be a bit slow
paramCombs$nLL <- nLLs #add to dataframe
bestFit <- paramCombs[paramCombs$nLL == min(nLLs),] #best fitting combination
#plot data
ggplot(paramCombs)+
geom_tile(aes(x = alpha, y = beta, fill = nLL)) +
geom_contour(aes(x = alpha, y = beta, z = nLL), color = 'white')+
geom_point(data=data.frame(alpha=alpha, beta=beta, type = 'true'), aes(alpha,beta, shape = type), size = 4, color = 'red')+ # true value
geom_point(data=data.frame(alpha=bestFit$alpha, beta=bestFit$beta, type = 'MLE'), mapping=aes(alpha,beta, shape = type), size = 4, color = 'red')+ # best fitting value
scale_fill_viridis_c('nLL', direction = -1)+
labs(title = "Loss landscape",
x = expression(alpha),
y = expression(beta))+
scale_shape_manual(values= c(4, 0), name = '')+
theme_classic()+
scale_x_continuous(expand = c(0, 0))+
scale_y_continuous(expand = c(0, 0))
Loss landscape over different parameter combinations. The MLE does not not always provide a close match to the true generating parameters, and the white contour lines show areas of the parameter space providing similar fits.
Scaling up to more complex problems
While the learning mechanisms we have presented are very simple, similar principles can be used in arbitrarily complex problems. Here, we consider a few examples across the three dimensions of complexity we introduced in Tutorial 1.
Environmental complexity: Structured rewards
So far, the learning models we have introduced assume that each option in the environment has independent rewards. This means, that everything our farmer has learned about growing wheat and potatoes provides no information about new crops like barley or corn. This corresponds to a tabular RL approach.
In contrast, many environments have rich underlying structure. When foraging for mushrooms, they are often clustered in similar locations. When trying out new cooking recipes, similar combinations of ingredients will produce similar flavors. And friends in similar social circles often have similar tastes in music or cinema. This kind of structure can be captured by learning models that use function approximation to learn a value function across the choice landscape, using principles of interpolation and extrapolation to generalize from a small handful of observations, to a potentially infinite range of possibilities.
One framework that has been widely used to model how humans generalize in these settings is Gaussian process (GP) regression (Wu, Meder, et al., 2025), which provides a Bayesian framework that can be applied to spatial structure (Wu et al., 2018b), graph structures (Wu et al., 2021), or any arbitrary set of abstract features (Wu et al., 2020). Below, we provide a very simple demonstration of the tabular Q-learning model in comparison to function approximation using a GP, where the orange diamonds represent the same set of reward observations given to each model.
# Generate some synthetic data along an input space of 1 - 25 ---
inputs <- 1:25
simData <- data.frame(
x = c(1,2,3,6,18,10,11,15,16),
y = c(-38, -34, -32, -18, 14, 16, 24, 33, 29)
)
# Q-learning: Tabular RL
alpha <- 0.7 # learning rate
Q0 <- 0 # prior/initial value
Q <- rep(Q0, length(inputs))
for (i in seq_len(nrow(simData))) {
a <- simData$x[i]
r <- simData$y[i]
Q[a] <- Q[a] + alpha * (r - Q[a])
}
#Gaussian process regression: Function approximation
# Radial Basis Function (squared exponential) kernel
rbf <- function(A, B, lambda = 1, signalVariance = 1) { # A, B can be numeric vectors or 1D matrices, lambda is the length scale (how smooth), and signal is the range of expected reward values
A <- as.matrix(A); B <- as.matrix(B)
if (ncol(A) != ncol(B)) stop("A and B must have same number of columns.")
sq <- matrix(0, nrow(A), nrow(B))
for (i in seq_len(ncol(A))) { # pairwise squared distance along each dimension
sq <- sq + (outer(A[, i], B[, i], "-") / lambda)^2
}
signalVariance * exp(-0.5 * sq)
}
# GP regression with mean function mu_0
# kfun is a function(A, B) -> covariance matrix
gpr <- function(X.test, X, Y, kfun, mu_0 = 0, noise = 1e-6) {
X <- as.matrix(X); Xstar <- as.matrix(X.test)
y <- as.numeric(Y - mu_0)
#GP math, see Wu et al., 2018 for details
K <- kfun(X, X) + diag(noise, nrow(X))
Kxs <- kfun(X, Xstar)
Kss <- kfun(Xstar, Xstar)
Kinv <- chol2inv(chol(K))
mu <- t(Kxs) %*% Kinv %*% y
covS <- Kss - t(Kxs) %*% Kinv %*% Kxs
# Numerical safety for any tiny negatives on the diagonal
var <- pmax(diag(covS), 0)
data.frame(mu = as.numeric(mu + mu_0), var = as.numeric(var))
}
#prepare plots
Qvalues <- data.frame(input = inputs, mu = Q)
# --- Plot 1: Q-learning reward predictions ---
pQ <- ggplot(Qvalues, aes(x = input, y = mu)) +
geom_col(fill = "#56B4E9", width = 0.9) +
geom_point(data = simData, aes(x = x, y = y),
fill = "#E69F00", shape = 23, size = 2) +
theme_classic() +
coord_cartesian(xlim = c(0, 26), ylim = c(-50, 50), expand = FALSE) +
xlab("Option") + ylab("Q-learning Reward Prediction") + ggtitle("Tabular RL (Q-learning)")
# GP posterior
GPinputs <- seq(0, 25, by = 0.1) #here we can define the posterior over a denser input grid for visualization
GPpost <- gpr( # Fit GP to (x, y) with an RBF kernel and prior mean 50
X.test = GPinputs,
X = simData$x,
Y = simData$y,
kfun = function(A, B) rbf(A, B, lambda = 2, signalVariance = 100), #RBF kernel
mu_0 = 0, #prior mean
noise = 1e-6
)
GPpost$input <- GPinputs
# --- Plot 2: GP reward predictions (pGP1) ---
pGP <- ggplot(GPpost, aes(x = input, y = mu)) +
geom_line(color = "#009E73", linewidth = 1) +
geom_ribbon(aes(ymin = mu - 1.96 * sqrt(var), ymax = mu + 1.96 * sqrt(var)),
alpha = 0.3, fill = "#009E73") +
geom_point(data = simData, aes(x = x, y = y),
fill = "#E69F00", shape = 23, size = 2) +
theme_classic() +
coord_cartesian(xlim = c(0, 26), ylim = c(-50, 50), expand = FALSE) +
xlab("Option") + ylab("GP Reward Prediction") + ggtitle('Function Approximation (GP)')
p <- plot_grid(pQ, pGP)
p
In social settings, generalization is just as important. Many imitation-based approaches to social learning assume that social information can simply be copied verbatim. However, this is rarely the case when there are individual differences in preferences, goals, or skills. If you are a vegetarian but copy the food choices of a carnivore, you will find yourself hungry. If you follow a random person on the subway, you’re very unlikely to arrive at your destination. And if you try to imitate the moves of the professional snowboarder, you might even find yourself in the hospital. Thus, social information often needs to be taken with a “grain of salt”.
You can check out (witt2024flexible?) to find out how our social generalization model extends the GP framework to account for social information from people with different reward functions, and how it yields better normative performance and provides better descriptive accounts of human behavior than decision-biasing (DB) or value shaping (VS).
Cognitive Complexity: Theory of Mind
So far, we have described some of the simplest social learning mechanisms. With frequency-dependent imitation or decision-biasing an agent is able to directly imitate socially observed behaviors. With value shaping, social information translates rather directly into a value bonus for the observed action without needing a deal of interpretation.
However, this neglects an important aspect of what makes human social learning so powerful, which is our ability to “unpack” observed actions into imputed mental states using Theory of Mind (ToM). We can describe ToM as a form of inference, where observed behaviors can are used to infer a variety of different mental representations (red arrows).
Value inference
Actions can be used to infer the other person’s values or goals, such as when we recognize an individual sprinting across the street is attempting to catch a bus. Or when we observe someone’s food choices at a buffet and realize they are quite likely to be a vegetarian. Performing these inferences allow us to use social information more flexibly and tailored our own personal circumstances. For instance, we might flexibly decide whether to also chase after the sprinting individual based on whether we think it’s the same bus we also need. Or we might imitate some but not all of the food choices from another person with different dietary restrictions.
One formal framework for describing value inference is using Inverse Reinforcement Learning (IRL; Sutton & Barto (2018); Jara-Ettinger (2019)). Using Bayesian inference, we can compute a posterior distribution about the Values (e.g., Q-values) that gave rise to the observed Actions:
\[ P(V|A) \propto P(A|V)P(V) \] To perform this computationally, we need to integrate over all possible value functions \(P(V)\) and evaluate the likelihood at each possible value function \(P(A|V)\). This type of computation is clearly intractable for almost all problems, since there are an infinite number of possible value functions someone could have. Nevertheless, it provides a rational framework that has been used to uncover inductive biases that simplify the required computations. For instance, the principle of efficient action is a bias that assumes other people are acting in the most efficient manner (Gergely & Csibra, 2003). This allows us to constrain the likelihood to be a value-maximizing policy, although it still does not necessarily rescue the computations of IRL from intractability.
Belief inference
Beyond inferring values, we can also use the same IRL framework to decompose actions into even more primitive building blocks of representation. Social information can be used to make inferences about someone’s Beliefs about the causal structure of the world (“she thinks local markets have fresher produce”), or about their intrinsic Reward specification (“she loves the taste of broccoli”). We can use a similar formalism as above to describe this form of inference:
\[ P(B,R|A) \propto P(A|B,R)P(B,R) \] The more we use ToM to decompose actions into their primitives, the more flexibility it affords. By inferring values, beliefs, and intrinsic reward representations from observed actions, we can better integrate social information into our own individual circumstances, goals, and preferences. However, this additional flexibility comes with substantial computational costs, and it is an exciting line of research to better understand how humans navigate this trade-off.
Combining multiple forms of complexity
While the learning mechanisms we have presented are very simple, the same principles can be used in arbitrarily complex problems.
For instance, Wu, Deffner, et al. (2025) uses an immersive collective foraging task implemented in the Minecraft game engine, where a realistic field of view creates attentional trade-offs for social learning (looking at other players for imitation comes at the cost of slower individual foraging). The task provides rich spatial and visual dynamics, which can nevertheless be incorporated into a computational model by sequentially predicting each of the \(k\) blocks participants destroy:
\[ P(\text{Choice}_{k+1}) \propto \exp(\mathbf{f}_k\cdot \mathbf{w}) \] This is modeled as a softmax choice function over a linear combination of features \(\mathbf{f}\) times weights \(\mathbf{w}\). The features capture hypotheses about individual and social learning, which incorporate rich spatial and visual dynamics of the task, while model weights are hierarchically estimated using Bayesian methods in STAN (see Tutorial 4 from COSMOS 2023) in order to control for individual and group-level variability.
Evolutionary simulations
Another important computational tool for studying social learning is the use of evolutionary simulations. Since social learning has frequency-dependent fitness (Rogers, 1988), the best strategy to use depends on what others in the population are deploying. Thus, in order to determine the best normative strategy, it is often helpful to use evolutionary simulations (Tump et al., 2019; witt2024flexible?):
- Initialize a population of agents
- Simulate performance on the task
- Select agents to seed the next generation (e.g., based on performance)
- Add mutation (change agent type, modify parameters)
- Repeat until convergence
Below, we provide some code to run an evolutionary simulation over 500 generations, starting from a group of purely individual learners and seeing if our simple social learning agents can invade. The results can often be quite noisey, and usually need to be aggregated over multiple replications for robustness.
set.seed(1234)
#Define simulation parameters
popSize <- 100 #population size of each generation
groupSize <- 10 #number of agents playing each game
indPop <- 50 #start with only individual learners to see if social learning evolves
socPop <- 50
initial <- cbind( #initial agent parameters, stored in a matrix
c(rep(0, indPop), rep(1, socPop)), #0 = ind agent, 1 = soc agent
c(runif(indPop), rep(NA, socPop)), #alpha (only for ind agents), sampled from a uniform distribution
c(rexp(indPop), rep(NA, socPop)), #beta (only for ind agents), sampled from an exponential distribution
c(rep(NA,indPop), rexp(socPop)) #theta (only for soc agents), sampled from an exponential distribution
)
Tsteps <- 25 #Trials in each game
k<- 10 #arms of the bandit
meanVec <- seq(-1, 1, length.out = k)
sigmaVec <- rep(2, k)
banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
return (payoff)
}
typeMutation <- .05 #rate of type mutation: ind <--> soc
paramMutation <- .1 #rate of parameter mutation
paramMutationVariance <- .1 #How much parameters mutate
#Evolution function
doTheEvolution <- function(initialPop, generations = 250){
outputList <- list()
outputList[[1]] <- initialPop
for (gen in 2:generations){
pop <- outputList[[gen - 1]] # current population (matrix)
winners <- rep(0L, popSize)
# Indices of agent types
indAgents <- which(pop[,1] == 0)
socAgents <- which(pop[,1] == 1)
# Assign agents to games (rows = games, cols = positions in the game)
agentAssignments <- matrix(NA, popSize, groupSize)
probs <- rep(1/popSize, popSize)
for (iter in 1:popSize){
ids <- sample(1:popSize, groupSize, prob = probs, replace = FALSE)
probs[ids] <- probs[ids] / groupSize
agentAssignments[iter,] <- ids
}
# Play each game
for (i in 1:popSize){
Qmat <- matrix(0, nrow = groupSize, ncol = k) # Q-values for agents in THIS game
socialObs <- rep(NA_integer_, groupSize) # last actions (for social frequency)
gameScore <- rep(0, groupSize)
for (t in 1:Tsteps){
for (j in 1:groupSize){
agent <- agentAssignments[i, j]
if (agent %in% indAgents){
# --- individual learner: softmax over own Q row ---
alpha <- pop[agent, 2]
beta <- pop[agent, 3]
prob_ind <- softmax_probs(Q_row = Qmat[j, ], beta = beta)
action <- sample.int(k, size = 1, prob = prob_ind)
reward <- banditGenerator(action)
gameScore[j] <- gameScore[j] + reward
Qmat[j, action] <- Qmat[j, action] + alpha * (reward - Qmat[j, action])
socialObs[j] <- action
} else if (agent %in% socAgents){
# --- social learner: frequency-dependent copying ---
theta <- pop[agent, 4]
if (t == 1){
prob_soc <- rep(1/k, k) # no observations yet → uniform
} else {
# counts of neighbors' actions (exclude self), tiny epsilon to avoid zeros
counts <- table(factor(socialObs[-j], levels = 1:k))
counts <- as.numeric(counts) + 1e-6
prob_soc <- counts^theta
prob_soc <- prob_soc / sum(prob_soc)
}
action <- sample.int(k, size = 1, prob = prob_soc)
reward <- banditGenerator(action)
gameScore[j] <- gameScore[j] + reward
# Social learners don't update Qmat in your spec; if they should, add a rule here.
socialObs[j] <- action
}
}
}
# Winner of game i reseeds: pick the agent with highest gameScore
winners[i] <- agentAssignments[i, which.is.max(gameScore)]
}
# New generation
newPop <- pop[winners, , drop = FALSE]
# Mutations
mutateType <- runif(popSize) < typeMutation
mutateParams <- runif(popSize) < paramMutation
for (player in 1:popSize){
if (mutateType[player]){
if (newPop[player,1] == 0){ # Ind → Soc
newPop[player, ] <- c(1, NA, NA, rexp(1, rate = 1))
} else { # Soc → Ind
newPop[player, ] <- c(0, runif(1), rexp(1), NA)
}
}
if (mutateParams[player]){
if (newPop[player,1] == 0){
# tweak individual params (alpha in (0,1), beta > 0)
newPop[player, c(2,3)] <- newPop[player, c(2,3)] + rnorm(2, 0, sqrt(paramMutationVariance))
if (is.na(newPop[player,2]) || newPop[player,2] <= 0 || newPop[player,2] >= 1) newPop[player,2] <- runif(1)
if (is.na(newPop[player,3]) || newPop[player,3] <= 0) newPop[player,3] <- rexp(1)
} else {
# tweak social param theta > 0
newPop[player, 4] <- newPop[player, 4] + rnorm(1, 0, sqrt(paramMutationVariance))
if (is.na(newPop[player,4]) || newPop[player,4] <= 0) newPop[player,4] <- rexp(1)
}
}
}
outputList[[gen]] <- newPop
}
return(outputList)
}
#Run simulations
outputList <- doTheEvolution(initial, generations = 250)
#Convert output to dataframe
evoSimDF <-rbindlist(lapply(1:length(outputList), function(i) data.frame(outputList[[i]], generation = rep(i, popSize))))
colnames(evoSimDF) <- c('agentType', 'alpha', 'beta', 'theta', 'generation')
evoSimDF$agentType <- factor(evoSimDF$agentType, labels = c('Ind', 'Soc'))
#What ratio of agents evolve?
agentRatio <- evoSimDF %>% group_by(generation) %>% summarize(Ind = mean(agentType=='Ind'), Soc = mean(agentType == 'Soc'))
agentRatio <- agentRatio %>% pivot_longer(cols = -generation, names_to = 'AgentType', values_to = 'Prop')
p1 <- ggplot(agentRatio, aes(x = generation, y = Prop, color = AgentType))+
geom_line()+
theme_classic()+
labs(x ="Generation", y = "Proportion")+
scale_color_manual(values = c("#d7191c","#2b83ba"), name = "")+
theme(legend.position = c(1,1), legend.justification = c(1,1), legend.background=element_blank())
#How do parameters evolve?
paramDF <- evoSimDF %>% group_by(generation) %>% summarize(alpha = mean(alpha, na.rm = TRUE), beta = mean(beta, na.rm = TRUE), theta = mean(theta, na.rm = TRUE))
paramDF <- paramDF %>% pivot_longer(cols = -generation, names_to = 'param')
p2 <- ggplot(paramDF, aes(x = generation, y = value, color = param))+
geom_line()+
theme_classic()+
labs(x ="Generation", y = "Parameter Value")+
scale_color_manual(values = c("#E69F00", "#009E73", "#56B4E9"), name = "")+
theme(legend.position = c(0.0,1), legend.justification = c(0,1), legend.background=element_blank())
plot_grid(p1,p2, labels = 'auto')
Evolutionary simulations.
Social Complexity: Network structures
Previously, we assumed that all agents have access to information about all other agents. However, social interactions are often structured by the environment, the history of previous interactions, or a variety of other forms of social structure. Consider what network structures define which set of people you have access to for social information, informing you about news, scientific ideas, or the latest fashions.
Here, we consider how the nature of social network structures shapes how information flows through groups of individuals, and how this influences their behavior. Specifically, one key insight is that the connectivity of a network influences the exploration-exploitation trade-off at the group level. More locally connected networks restrict the flow of information, which allows for a greater diversity of solutions to co-exist in the group, which leads to more exploration. These types of networks typically perform better on more complex, high-dimensional problems (Lazer & Friedman, 2007), such as in cultural evolution settings (Derex & Boyd, 2016). In contrast, more fully connected networks enable a faster flow of information, leading the group to converge more quickly on exploiting what is thought to be the best solution. These networks often perform best in relatively smooth environments without too many local optima (Mason & Watts, 2012). More generally, the ideal network structure is shaped by both the demands of the environment (Barkoczi et al., 2016) and the nature of the task (Xi & Toyokawa, 2025).
Here, we provide a very simple illustration of how different network structures (local vs. fully connected) perform on a 50-armed bandit task with a population of decision-biasing agents. In contrast to our previous simulations, here the agents only have access to social information from other agents they are explicitly connected to via the social network.