Tutorial 2 - Models of social and individual learning

Charley Wu and Wataru Toyokawa

2025-09-29

packages <- c('tidyverse', 'nnet', 'data.table', 'cowplot')
#invisible(lapply(packages, install.packages, character.only = TRUE)) #Install packages if any are missing
invisible(lapply(packages, require, character.only = TRUE)) #Load all packages
options(dplyr.summarise.inform = FALSE) #suppress annoying messages
knitr::opts_chunk$set(warning = FALSE, message = FALSE) 
set.seed(1234) #set seed for reproducibility

This is R Notebook used in Tutorial 2 the COSMOS summer school. In this session, we focus on describing and programming computational models to describe both individual and social learning mechanisms. We use these models to both simulate data, and also describe how the models can be estimated on data (either from participants or simulated data) to infer parameters. Let’s jump in!

Brisk Introduction to Reinforcement Learning

We start with a very quick introduction to the Reinforcement Learning (RL) framework by Sutton & Barto (2018), which describes how an agent (either biological or machine) can learn how to behave intelligently by interacting with the environment. At each time point \(t\), the agent selects an action \(a \in A\) from a set of possible actions \(A\), and then receives feedback from the environment in the form of a new state \(s_t\) and a reward observation \(r_t\). Since we will focus on a simple bandit task, we can ignore the state information for now and assume that rewards are solely a function of the action \(r_t := R(a_t)\).

Adapted from Sutton & Barto (2018)
Adapted from Sutton & Barto (2018)

Learning a value function

We will start with a Q-learning model (Watkins & Dayan, 1992), as one of the simplest RL models. By trying out actions and receiving rewards, the Q-learning agent learns a value function for each action \(Q(a)\) describing the expected reward. With each reward observation, the current Q-value \(Q_t(a)\) is updated with the following equation:

\[ \begin{equation} Q_{t+1}(a) \leftarrow Q_t(a) + \alpha\color{red}{[r_t(a) - Q_t(a) ]} \end{equation} \] The term in the equation colored red is known as the reward prediction error (RPE). This captures the difference between the expected reward \(Q_t(a)\) and the actual reward observation. When this is lower than expected, we update our expectation to be lower, and vice versa.

The amount we update our predictions is governed by the learning rate parameter \(\alpha\), which is bounded to the range \([0,1]\). This is often implemented as a free parameter, which is then estimated from data to describe a characteristic of the learner(s). We can estimate \(\alpha\) separately for different experimental manipulations to see how learning changes, for example, in contexts of punishment vs. reward (Palminteri et al., 2015) or how group size influences the rate of individual learning in social contexts (Toyokawa et al., 2019).

The Q-learning model also requires some prior initialization of Q-values, which can often be set to \(Q_{t=0}=0\). However, there are also settings where optimistic initialization (Sutton & Barto, 2018) allows agents to learn and explore more efficiently. Signatures of optimistic exploration have been found in human behavior (Wilson et al., 2014; Wu et al., 2018a) and have been included in models of social learning (Ho et al., 2019).

Exercise 2.1. Can you modify the Q-learning model to be Bayesian? This will allow the model to learn not only the expected reward, but also quantify the level of uncertainty about the outcome of each option. You might be highly confident about the quality of your favorite coffee shop, but a single bad experience at a new restaurant might not be very strong evidence you should never return. One approach is to use a Kalman Filter (Dayan et al., 2000; Speekenbrink & Konstantinidis, 2015; Wu, Schulz, et al., 2022) to learn a distribution over rewards, with the simplifying assumption that each option corresponds to an independent Gaussian distribution and that learning follows linear Gaussian dynamics. Gershman (2015) shows how a Kalman filter provides a Bayesian analogue to the RPE learning updates of the Q-learning model we describe above.

Converting values to actions

In order to convert values to actions, we need a policy \(\pi\) defining a probability distribution over actions. Each action \(a\) is assigned probability \(\pi(a)\), which can be used to either predict or simulate future actions of the agent. One common method is to use a softmax policy, which selects actions proportional to their exponentiated Q-values:

\[ \begin{equation} \pi(a) \propto exp\left(\beta Q(a)\right) \end{equation} \] \(\beta\) is the inverse temperature parameter, governing the amount of stochasticity in the policy. When \(\beta\) is small, there is more randomness in the policy with higher probability given to lesser-valued options. As \(\beta\) grows larger, the policy tend to more strongly favor the option with the highest value. The code below visualizes the softmax policy across a range of different \(\beta\) values and at different combinations of Q-values for a simple 2-armed bandit.

softmax <- function(beta, Qvec){
  p <- exp(beta*Qvec)
  p <- p/sum(p) #normalize to sum to 1
  return(p)
}

Qdiff <- seq(-5,5,length.out=100) #Let's fix the value of option 2 to 0 and then modulate the value of option 1 between -5 and 5

#Generate softmax policies for different beta values
softmaxDF <- data.frame()
for (beta in c(0.5, seq(1,4,by=1))){
  sims <- sapply(Qdiff, FUN=function(diff) softmax(beta, c(diff,0))) #Compute softmax across the range of value differences
  simDF <- data.frame(Qdiff = Qdiff, prob1 = sims[1,], prob2=sims[2,], beta = beta) #Put the value difference, choice probabilities, and beta parameter into a data frame
  softmaxDF <- rbind(simDF, softmaxDF)
}
softmaxDF$beta <- factor(softmaxDF$beta)

ggplot(softmaxDF, aes(x = Qdiff, y = prob1, color = beta))+
  geom_line()+
  theme_classic()+
  scale_color_viridis_d()+
  labs(title = "Softmax policy",
       x = expression(Q[1]-Q[2]),
       y = 'Probability of action 1', 
       col = expression(beta))+
  theme(legend.position = c(1,.1), legend.justification = c(1,0))
Softmax policy using a simplified case of a 2-armed bandit. Towards the right side, action $a_1$ has a higher Q-value than the alternative $a_2$, and has a higher probability of being chosen under the softmax policy.

Softmax policy using a simplified case of a 2-armed bandit. Towards the right side, action \(a_1\) has a higher Q-value than the alternative \(a_2\), and has a higher probability of being chosen under the softmax policy.

Exercise 2.2. What other policies could you consider? Think about which strategies you tried in Tutorial 1 during the live demo of the individual learning experiment. Did you find that you sometimes choose an option randomly? You could describe that using an \(\epsilon\)-greedy policy (Sutton & Barto, 2018). Did you tend to explore options you visited the least? Then you could implement a count-based exploration policy (Machado et al., 2020) or uncertainty-directed exploration policy such as upper confidence bound sampling (Auer et al., 2002). Note, that to implement the latter, you will need to have implemented a Bayesian RL model (Exercise 2.1) in order to quantify the uncertainty for each option.

Simulating data

We’re now ready to simulate some data with this Q-learning model paired with a bandit environment we defined in Tutorial 1. We first need to specify some task parameters, such as the number of arms, the means and standard deviations of rewards, number of rounds, and number of trials per round. Then we need to define the model parameters, where we start with some sensible values of \(\alpha=.9\), \(\beta=1\), and \(Q_0=0\). The code block below describes the entire simulation procedure and plots reward curves over each individual round (colored lines) and aggregated over 10 rounds (black line).

#Task parameters for bandit with Gaussian rewards
k <- 10 #number of arms
meanVec <- seq(-10,10, length.out=k) #Payoff means
sigmaVec <- rep(1, k) #Payoff stdevs
T <- 25 #total number of trials
nAgents <- 4
rounds <- 10

banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
  payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
  return (payoff)
}

#Model parameters
alpha <- .9 #learning rate
beta <- 1 #Softmax inverse temperature
Q0 <- Qvec <-  rep(0,k) #prior initialization of Q-values

#Now simulate data for multiple agents over multiple rounds
simDF <- data.frame()
for (a in 1:nAgents){ #loop through agents
  for (r in 1:rounds){ #loop through rounds
    Qvec <- Q0 #reset Q-values
    for (t in 1:T){ #loop through trials
      p <- softmax(beta, Qvec) #compute softmax policy
      action <- sample(1:k,size = 1, prob=p) #sample action
      reward <- banditGenerator(action) #generate reward
      Qvec[action] <- Qvec[action] + alpha*(reward - Qvec[action]) #update q-values
      chosen <- rep(0, k) #create an index for the chosen option
      chosen[action]<- 1 #1 = chosen, 0 = not
      trialDF <- data.frame(trial = t, agent = a, round = r, Q = Qvec, action = 1:k, chosen = chosen, reward = reward)
      simDF <- rbind(simDF,trialDF)
    }
  }
}

saveRDS(simDF, 'data/simChoicesQlearning.Rds')
#Plot results
ggplot(subset(simDF, chosen==1), aes(x = trial, y = reward, color = interaction(agent,round)))+
  geom_line(size = .5, alpha = 0.5)+
  stat_summary(fun = mean, geom='line', color = 'black', size = 1)+
  theme_classic()+
  xlab('Trial') +
  ylab('Reward')+
  ggtitle('Simulated performance')+
  theme(legend.position='none')
Simulated performance of the single Q-learning agent over multiple rounds. Each colored line is the learning curve in a single round, with the black line showing the mean across rounds.

Simulated performance of the single Q-learning agent over multiple rounds. Each colored line is the learning curve in a single round, with the black line showing the mean across rounds.

Demo 1: Tweaking individual learning parameters

Simulated behavior is sensitive to the choice of model parameters. Under the following link, you can access an interactive Shiny app to explore how different model parameters influence the simulated learning curves.

Demo 1: Tweaking individual learning parameters

Likelihood function

Beyond only simulating data, we also want to use computational models to describe experimental data collected from subjects. Which model type provides a better description will allow us to test different hypotheses, while the estimated parameters of the winning model should provide an interpretable characterization of behavior.

To fit a model to data, we first need to define a likelihood function:

\[ \begin{equation} P(D|\theta) \end{equation} \] The likelihood function describes the likelihood of the observed data \(D\) (i.e., actions made by a subject) given a set of parameters \(\theta\) (e.g., \(\alpha\) and \(\beta\)).

Since we are hardly ever modeling a single behavioral data point, but rather a sequence of observations (e.g., over multiple trials of learning and multiple rounds of a task), we need to describe the joint likelihood over all observations under consideration \(P(D|\theta) = \prod_t P(d_t|\theta)\), which is the product of the likelihood of each observation \(P(d_t|\theta)\). This becomes much easier using logarithms, since we can replace multiplication with summation in log-space and simply sum over the log-likelihoods:

\[ \begin{equation} \log P(D|\theta) = \sum_t \log P(d_t|\theta) \end{equation} \] Note that natural logs \(\ln\) are most commonly used rather than base 10 logarithms \(\log_{10}\).

The (natural) log of any probability will be negative (since probabilies are always less than or equal to 1). Thus, it’s more convenient to express the fit of a model in terms of the negative log-likelihood (nLL), by simply inverting the sign of the previous equation:

\[ \begin{equation} nLL= -\sum_t \log P(d_t|\theta) \end{equation} \] The nLL will always be a value greater than zero, with smaller values indicating a better fit. Thus, nLL expresses the amount of error or loss in quantifying how well a model (given a set of parameters) fits the data, and is sometimes also called “log loss”.

In practice, we can define a likelihood function (using very similar code to our model simulations) that takes a set of parameters and participant data, and then returns the nLL. The likelihood function iteratively loops through the trials of the task and predicts the next action based on the current policy \(\pi_t\). We then take the log probability of the actual action that was selected, and use that to update our running tally of the nLL. Remember that we also need to use the actual action and reward observations to update the value function of the agent in order for the model predictions to evolve following the learning dynamics of our model.

Lastly, we wrap the likelihood function in an optimization function in order to find the set of parameters that minimizes the nLL. By minimizing the loss of the model, we arrive at an estimate of the most likely set of parameters known as the Maximum Likelihood Estimate:

likelihood <- function(params, data, Q0=0){ #We assume that prior value estimates Q0 are fixed to 0 and not estimated as part of the model
  names(params) <- c('alpha', 'beta') #name parameter vec
  nLL <- 0 #Initialize negative log likelihood
  rounds <- max(data$round)
  trials <- max(data$trial)
  for (r in 1:rounds){ #loop through rounds
    Qvec <- rep(Q0,k) #reset Q-values each new round
    for (t in 1:trials){ #loop through trials
      p <- softmax(params['beta'], Qvec) #compute softmax policy
      trueAction <- subset(data, chosen==1 & trial==t & round == r)$action
      negativeloglikelihood <- -log(p[trueAction]) #compute negative log likelihood
      nLL <- nLL + negativeloglikelihood #update running count
      Qvec[trueAction] <- Qvec[trueAction] + params['alpha']*(subset(data, chosen==1 & trial==t & round == r)$reward - Qvec[trueAction]) #update q-values
    }
  }
  return(nLL)
}

#Now let's optimize the parameters
init <- c(1,1) #initial guesses
lower <- c(0,-Inf) #lower and upper limits. We use very liberal bounds here, but you may want to set stricter bounds for better results
upper <- c(1,Inf)

MLE <- optim(par=init, fn = likelihood,lower = lower, upper=upper, method = 'L-BFGS-B', data = subset(simDF, chosen==1 & agent==1) )

Using the data we simulated earlier using \(\alpha\) = 0.9 and \(\beta\) = 1.0, we arrive at a MLE estimate of \(\hat{\alpha}\) = 0.95 and \(\hat{\beta}\) = 1.10. This MLE corresponds to a negative log likelihood 109.5659023 for this MLE, quantifying how good of a fit it provides.

Loss landscape

We can also compute nLLs across a range of parameter combinations in order to visualize the loss landscape. The plot below illustrates how some regions of the parameter space produce similar fits to the data.

#Create a grid of plausible parameter combinations
paramCombs <- expand.grid(alpha = seq(0,1, length.out=20), beta = seq(0,20, length.out=20)) 

#Now compute nLLs for each grid combination
nLLs <- sapply(1:nrow(paramCombs), FUN=function(i) likelihood(as.numeric(paramCombs[i,]), subset(simDF, chosen==1 & agent==1))) #This can be a bit slow
paramCombs$nLL <- nLLs #add to dataframe

bestFit <- paramCombs[paramCombs$nLL ==  min(nLLs),] #best fitting combination 

#plot data
ggplot(paramCombs)+
  geom_tile(aes(x = alpha, y = beta, fill = nLL)) +
  geom_contour(aes(x = alpha, y = beta, z = nLL), color = 'white')+
  geom_point(data=data.frame(alpha=alpha, beta=beta, type = 'true'), aes(alpha,beta, shape = type), size = 4, color = 'red')+ # true value
  geom_point(data=data.frame(alpha=bestFit$alpha, beta=bestFit$beta, type = 'MLE'), mapping=aes(alpha,beta, shape = type), size = 4, color = 'red')+ # best fitting value
  scale_fill_viridis_c('nLL', direction = -1)+
  labs(title = "Loss landscape",
       x = expression(alpha),
       y = expression(beta))+
  scale_shape_manual(values= c(4, 0), name = '')+
  theme_classic()+
  scale_x_continuous(expand = c(0, 0))+
  scale_y_continuous(expand = c(0, 0))
Loss landscape over different parameter combinations. The MLE does not not always provide a close match to the true generating parameters, and the white contour lines show areas of the parameter space providing similar fits.

Loss landscape over different parameter combinations. The MLE does not not always provide a close match to the true generating parameters, and the white contour lines show areas of the parameter space providing similar fits.

Learning from social information

Having described how to model a simple individual learner, let us finally describe social learning. We don’t attempt to provide an exhaustive list of potential social learning mechanisms (Heyes, 2016; see Hoppitt & Laland, 2013; Wu, Vélez, et al., 2022 for more details), but simply highlight a few models that have recently be applied to human social learning.

Imitating actions

Arguably, one of the most simple social learning strategies is to copy the action of others. We can define a simple frequency-dependent copying (FDC) policy (McElreath et al., 2005; Toyokawa et al., 2019) that selects actions proportional to the number of other agents observed performing the same action:

\[ \begin{equation} \pi_{FDC}(a) \propto f(a)^\theta \end{equation} \] where \(\theta\) is a conformity parameter. Higher levels of \(\theta\) correspond to stronger bias towards copying the majority.

Demo 2 - Imitation

Under the following link, you can access an interactive Shiny app to explore different parameters of both individual learning agents and social imitation agents. Observe how the individual learning parameters influence the imitators, but not the other way around.

Demo 2: Imitation

Importantly, FDC (like other forms of imitation) has frequency-dependent fitness, meaning how well it performs depends on proportion of social learners in the population. Typically, imitation is most effective when there few other imitators in the population. But when most peers are also imitating others, then we can observe maladaptive information cascades, where imitators are simply imitating other imitators. We can illustrate this pattern of behavior using our social bandit task, allowing us to reproduce a key theoretical result known as Rogers (1988) pardox:

#Simulation function
simSocialData <- function(nAgents, rounds=10, trials = 25, k = 10, alpha = .9, beta = 1, Q0 = 0, theta = 5){
  socialSimDF <- data.frame() #Initialize dataframe
  for (socLearners in c(seq(1,nAgents-1,1))){ #loop through different ratios of social learners
    indLearners <- nAgents - socLearners #number of individual learners
    group <- c(rep('Ind', indLearners), rep('Soc', socLearners)) #group makeup
    for (r in 1:rounds){ #loop through rounds
      set.seed(nrow(socialSimDF)) #set the seed for each new game
      Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #reset Q-values; this variable is now a matrix with one row per agent (unused for social learners)
      socialObs <- rep(NA, nAgents) #social observations
      for (t in 1:trials){ #loop through trials
        for (agent in 1:nAgents){ #loop through agents
          if (group[agent] == 'Ind'){
            #Individual learners
            p <- softmax(beta, Qmat[agent,]) #compute softmax policy; we assume for now that all individual agents have the same parameter combo as before
            action <- sample(1:k,size = 1, prob=p) #sample action
            reward <- banditGenerator(action) #generate reward
            Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action
            #save data
            trialDF <- data.frame(trial = t, agent = agent, type = group[agent], ratio = socLearners, round = r, reward = reward)
            socialSimDF <- rbind(socialSimDF,trialDF)
          }else if (group[agent] == 'Soc'){ 
            #social learning agent
            if (t==1){ #first trial is random, since there are no social observations
              p = rep(1/k, k) #uniform probabilities
            }else{
              socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each action + a very small number to avoid divide by zero
              p <- socialFreq^theta #compute probabilities
              p <- p/sum(p) 
            }
            action <- sample(1:k,size = 1, prob=p) #sample action according to policy
            reward <- banditGenerator(action) #generate reward
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action
            #save data
            trialDF <- data.frame(trial = t, agent = agent, type = group[agent], ratio = socLearners, round = r, reward = reward)
            socialSimDF <- rbind(socialSimDF,trialDF)
          }
        }
      }
    }
  }
  return(socialSimDF)
}

#socialsimDF <- simSocialData(nAgents = 10, rounds = 100, trials = 25) #commented out because this is very slow. Load it from precomputed data instead
socialsimDF <- readRDS('data/socialSimsTutorial2.Rds')

perfDF <- socialsimDF  %>% group_by(type,  ratio) %>% summarize(performance = mean(reward))
popDF <- socialsimDF  %>% group_by(ratio) %>% summarize(performance = mean(reward))
popDF$type = 'Pop'

combinedDF <- rbind(perfDF,popDF )
combinedDF$type <- factor(combinedDF$type, levels = c('Ind', 'Soc', 'Pop'))
#Plot results
ggplot(subset(combinedDF, ratio>0 & ratio<8),  aes(x = ratio/8, y = performance, color = type))+
  geom_line(aes(linetype=type))+
  theme_classic()+
  labs( x =  "% Social Learners", y = "Avg. Performance")+
  scale_color_manual(values = c("#d7191c","#2b83ba",  "black"), name = "")+
  scale_linetype_manual(values = c("dashed", "dotted", "solid"), name = "")+
  ylab('Average Rewards')+
  scale_x_continuous(labels=scales::percent)+
  ggtitle("Rogers' Paradox")+
  theme(legend.position = c(0.1,0.1), legend.justification = c(0,0), legend.background=element_blank())
Rogers' paradox is primarily concerned with the evolution of social learning, but it also illustrates how social learning via imitation has frequency-dependent fitness. Social learning (Soc) can outperform individual learning (Ind) when the proportion of social learners are low. But as the proportion of social learners increases, the fitness for social learners and the entire population collapses.

Rogers’ paradox is primarily concerned with the evolution of social learning, but it also illustrates how social learning via imitation has frequency-dependent fitness. Social learning (Soc) can outperform individual learning (Ind) when the proportion of social learners are low. But as the proportion of social learners increases, the fitness for social learners and the entire population collapses.

Combining imitation and value-learning

While the pure imitation agent we described previously can sometimes outperform individual learning, it lacks any learning of it’s own. You may have noticed from the code block above that the FDC agent maintains no beliefs about rewards in the environment. However, we can define a hybrid decision-biasing (DB) agent (Toyokawa et al., 2019) that combines Q-learning (or any other RL model) with FDC using a mixture policy:

\[ \begin{equation} \pi_{DB} = (1-\gamma)\pi_{Ind} + \gamma\pi_{FDC} \end{equation} \] Here, we use \(\pi_{Ind}\) to refer to the softmax policy of the Q-learning agent to avoid confusion, and use \(\gamma\) as a mixture weight (bounded between 0 and 1) to describe how much social imitation to perform relative to individual learning.

#Simulator for decision-biasing agent
dbSimulator <- function(gammaVec, nAgents = 4, trials = 25, rounds = 1, alpha = .9, beta = 1, Q0 = 0){
  dbDF <- data.frame() #initialize dataframe
  for (r in 1:rounds){ #loop through rounds
     Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #Initialize Q-values in a matrix with one row per agent
     socialObs <- rep(NA, nAgents) #social observations
      for (t in 1:trials){ #loop through trials
          for (agent in 1:nAgents){ #loop through agents
              p_ind <- softmax(beta, Qmat[agent,]) #compute individual learning policy; we assume for now that all individual agents have the same parameters as before
              if (t==1){ #first trial has no social
                p_fdc = rep(1/k, k) #uniform probabilities
              }else{
                socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number to avoid divide by zero
                p_fdc <- socialFreq/sum(socialFreq) #compute probabilities
              }
              p_db <- ((1-gammaVec[agent]) * p_ind) + (gammaVec[agent] * p_fdc) #mixture policy
              action <- sample(1:k,size = 1, prob=p_db) #sample action
              reward <- banditGenerator(action) #generate reward
              Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
              chosen <- rep(0, k) #create an index for the chosen option
              chosen[action]<- 1 #1 = chosen, 0 not
              socialObs[agent] <- action #update social observation vector
              #save data
              trialDF <- data.frame(trial = t, round = r, agent = agent,  reward = reward)
              dbDF <- rbind(dbDF,trialDF)
              }
       }
  }
 
  return(dbDF)
}

gammaVec <-runif(4) #replace with 
dbDF <- dbSimulator(gammaVec) #run simulation


ggplot(dbDF, aes(x = trial, y = reward, color = factor(agent)))+
  geom_line()+
  theme_classic()+
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73", "#F0E442"), name = 'Agent')+
  theme(legend.position=c(1,0.1), legend.justification = c(1,0))+
  labs(x = 'Trial', y = 'Reward', title = 'Decision-biasing agents')

Demo 3 - Decision-Biasing

You can use the following link to test our different parameter combinations for the decision-biasing agent. Which values of \(\gamma\) typically produce the best results?

Demo 3: Decision-Biasing

Value shaping

Instead of only using social information to influence one’s behavioral policy, we can also describe social learning strategies where social information is integrated higher up in the decision-making hierarchy. Social information can also shape the value representations we form, or the beliefs we hold about the structure of the environment and which outcomes are intrinsically rewarding (Wu, Vélez, et al., 2022).

Adapted from Wu, Vélez, et al. (2022) Here, we define a very simple form of social value learning known as value-shaping (VS), where we can simply add a value bonus proportional to the popularity of an option (Biele et al., 2011; Najar et al., 2020):

\[ \begin{equation} Q(a) \leftarrow Q(a) + \eta f(a) \end{equation} \] As before, \(f(a)\) is the frequency of other agents that select action \(a\) and \(\eta\) is a free parameter representing how large of a social bonus to add.

#Simulate data
vsSimulator <- function(etaVec, nAgents = 4, trials = 25, rounds = 1, alpha = .9, beta = 1, Q0 = 0){
  vsDF <- data.frame() #initialize dataframe
  for (r in 1:rounds){ #loop through rounds
    Qmat <- matrix(Q0,  nrow = nAgents, ncol = k) #Initialize Q-values in a matrix with one row per agent
    socialObs <- rep(NA, nAgents) #social observations
    for (t in 1:trials){ #loop through trials
        for (agent in 1:nAgents){ #loop through agents
            socialFreq <- table(factor(socialObs[-agent], levels = 1:k)) + .0001 #Frequency of each socially observed action + a very small number
            Qmat[agent,] <- Qmat[agent,] + (etaVec[agent]*socialFreq) #add value bonus
            p <- softmax(beta, Qmat[agent,]) #compute policy
            action <- sample(1:k,size = 1, prob=p) #sample action
            reward <- banditGenerator(action) #generate reward
            Qmat[agent,action] <- Qmat[agent,action] + alpha*(reward - Qmat[agent,action]) #update q-values
            chosen <- rep(0, k) #create an index for the chosen option
            chosen[action]<- 1 #1 = chosen, 0 not
            socialObs[agent] <- action #update social observation vector
            #save data
            trialDF <- data.frame(trial = t, round = r,agent = agent,  reward = reward, action = action, eta = etaVec[agent])
            vsDF <- rbind(vsDF,trialDF)
        }
      } 
    }
  return(vsDF)
}

etaVec <-rexp(4) #value bonus parameter; let's sample this from a exponential distribution
vsDF <- vsSimulator(etaVec) #run simulation

#We can also run this for several rounds of the task, which will be necessary for model fitting. The commented out lines of code will be used to simulate data for Tutorial 3
#vsDF <- vsSimulator(etaVec, rounds = 10) #run simulation
#saveRDS(vsDF,'data/simChoicesVSagent.Rds') #save data

ggplot(vsDF, aes(x = trial, y = reward, color = factor(agent)))+
  geom_line()+
  theme_classic()+
  scale_color_manual(values = c("#E69F00", "#56B4E9", "#009E73", "#F0E442"), name = 'Agent')+
  theme(legend.position=c(1,0.1), legend.justification = c(1,0))+
  labs(x = 'Trial', y = 'Reward', title = 'Value-shaping agents')

Demo 4 - Heterogenous groups

For the last demo of this tutorial, you can use the Shiny app linked below to simulate heterogenous groups of individual learners, decision-biasing agents, and value-shaping agents.

Demo 4: Heterogenous groups

Exercise 2.3. The social learning models we have described treat all other individuals as the same. These are known as conformity biased strategies, since they only care about the frequency of behaviors in the population but don’t distinguish between who is performing each action. Yet social learning is often selective (Wu, Deffner, et al., 2025), for instance, in preferentially learning from successful or prestigious individuals. How could you modify decision-biasing and value shaping to be selectively influenced by some individuals?

Scaling up to more complex problems

While the learning mechanisms we have presented are very simple, similar principles can be used in arbitrarily complex problems. Here, we consider a few examples across the three dimensions of complexity we introduced in Tutorial 1.

Dimensions of complexity
Dimensions of complexity

Environmental complexity: Structured rewards

So far, the learning models we have introduced assume that each option in the environment has independent rewards. This means, that everything our farmer has learned about growing wheat and potatoes provides no information about new crops like barley or corn. This corresponds to a tabular RL approach.

In contrast, many environments have rich underlying structure. When foraging for mushrooms, they are often clustered in similar locations. When trying out new cooking recipes, similar combinations of ingredients will produce similar flavors. And friends in similar social circles often have similar tastes in music or cinema. This kind of structure can be captured by learning models that use function approximation to learn a value function across the choice landscape, using principles of interpolation and extrapolation to generalize from a small handful of observations, to a potentially infinite range of possibilities.

One framework that has been widely used to model how humans generalize in these settings is Gaussian process (GP) regression (Wu, Meder, et al., 2025), which provides a Bayesian framework that can be applied to spatial structure (Wu et al., 2018b), graph structures (Wu et al., 2021), or any arbitrary set of abstract features (Wu et al., 2020). Below, we provide a very simple demonstration of the tabular Q-learning model in comparison to function approximation using a GP, where the orange diamonds represent the same set of reward observations given to each model.

# Generate some synthetic data along an input space of 1 - 25 ---
inputs <- 1:25
simData <- data.frame(
x = c(1,2,3,6,18,10,11,15,16),
y = c(-38, -34, -32, -18,  14,  16,  24,  33,  29)
)

# Q-learning: Tabular RL  
alpha <- 0.7 # learning rate
Q0 <- 0 # prior/initial value 
Q <- rep(Q0, length(inputs))


for (i in seq_len(nrow(simData))) {
a <- simData$x[i]
r <- simData$y[i]
Q[a] <- Q[a] + alpha * (r - Q[a])
}

#Gaussian process regression: Function approximation

# Radial Basis Function (squared exponential) kernel
rbf <- function(A, B, lambda = 1, signalVariance = 1) { # A, B can be numeric vectors or 1D matrices, lambda is the length scale (how smooth), and signal is the range of expected reward values
A <- as.matrix(A); B <- as.matrix(B)
if (ncol(A) != ncol(B)) stop("A and B must have same number of columns.")
sq <- matrix(0, nrow(A), nrow(B))
for (i in seq_len(ncol(A))) { # pairwise squared distance along each dimension
sq <- sq + (outer(A[, i], B[, i], "-") / lambda)^2
}
signalVariance * exp(-0.5 * sq)
}

# GP regression with mean function mu_0
# kfun is a function(A, B) -> covariance matrix
gpr <- function(X.test, X, Y, kfun, mu_0 = 0, noise = 1e-6) {
X <- as.matrix(X); Xstar <- as.matrix(X.test)
y <- as.numeric(Y - mu_0)

#GP math, see Wu et al., 2018 for details
K <- kfun(X, X) + diag(noise, nrow(X))
Kxs <- kfun(X, Xstar)
Kss <- kfun(Xstar, Xstar)
Kinv <- chol2inv(chol(K))
mu <- t(Kxs) %*% Kinv %*% y
covS <- Kss - t(Kxs) %*% Kinv %*% Kxs

# Numerical safety for any tiny negatives on the diagonal
var <- pmax(diag(covS), 0)
data.frame(mu = as.numeric(mu + mu_0), var = as.numeric(var))
}

#prepare plots


Qvalues <- data.frame(input = inputs, mu = Q)

# --- Plot 1: Q-learning reward predictions ---
pQ <- ggplot(Qvalues, aes(x = input, y = mu)) +
  geom_col(fill = "#56B4E9", width = 0.9) +
  geom_point(data = simData, aes(x = x, y = y),
  fill = "#E69F00", shape = 23, size = 2) +
  theme_classic() +
  coord_cartesian(xlim = c(0, 26), ylim = c(-50, 50), expand = FALSE) +
  xlab("Option") + ylab("Q-learning Reward Prediction") + ggtitle("Tabular RL (Q-learning)")


# GP posterior
GPinputs <- seq(0, 25, by = 0.1) #here we can define the posterior over a denser input grid for visualization 
GPpost <- gpr( # Fit GP to (x, y) with an RBF kernel and prior mean 50
  X.test = GPinputs,
  X = simData$x,
  Y = simData$y,
  kfun = function(A, B) rbf(A, B, lambda = 2, signalVariance = 100), #RBF kernel
  mu_0 = 0, #prior mean
  noise = 1e-6
)
GPpost$input <- GPinputs


# --- Plot 2: GP reward predictions (pGP1) ---
pGP <- ggplot(GPpost, aes(x = input, y = mu)) +
geom_line(color = "#009E73", linewidth = 1) +
geom_ribbon(aes(ymin = mu - 1.96 * sqrt(var), ymax = mu + 1.96 * sqrt(var)),
alpha = 0.3, fill = "#009E73") +
geom_point(data = simData, aes(x = x, y = y),
fill = "#E69F00", shape = 23, size = 2) +
theme_classic() +
coord_cartesian(xlim = c(0, 26), ylim = c(-50, 50), expand = FALSE) +
xlab("Option") + ylab("GP Reward Prediction") + ggtitle('Function Approximation (GP)')

p <- plot_grid(pQ, pGP)
p

In social settings, generalization is just as important. Many imitation-based approaches to social learning assume that social information can simply be copied verbatim. However, this is rarely the case when there are individual differences in preferences, goals, or skills. If you are a vegetarian but copy the food choices of a carnivore, you will find yourself hungry. If you follow a random person on the subway, you’re very unlikely to arrive at your destination. And if you try to imitate the moves of the professional snowboarder, you might even find yourself in the hospital. Thus, social information often needs to be taken with a “grain of salt”.

You can check out (witt2024flexible?) to find out how our social generalization model extends the GP framework to account for social information from people with different reward functions, and how it yields better normative performance and provides better descriptive accounts of human behavior than decision-biasing (DB) or value shaping (VS).

Social generalization from (witt2024flexible?)
Social generalization from (witt2024flexible?)

Social Complexity: Network structures

Previously, we assumed that all agents have access to information about all other agents. However, social interactions are often structured by the environment, the history of previous interactions, or a variety of other forms of social structure. Consider what network structures define which set of people you have access to for social information, informing you about news, scientific ideas, or the latest fashions.

Here, we consider how the nature of social network structures shapes how information flows through groups of individuals, and how this influences their behavior. Specifically, one key insight is that the connectivity of a network influences the exploration-exploitation trade-off at the group level. More locally connected networks restrict the flow of information, which allows for a greater diversity of solutions to co-exist in the group, which leads to more exploration. These types of networks typically perform better on more complex, high-dimensional problems (Lazer & Friedman, 2007), such as in cultural evolution settings (Derex & Boyd, 2016). In contrast, more fully connected networks enable a faster flow of information, leading the group to converge more quickly on exploiting what is thought to be the best solution. These networks often perform best in relatively smooth environments without too many local optima (Mason & Watts, 2012). More generally, the ideal network structure is shaped by both the demands of the environment (Barkoczi et al., 2016) and the nature of the task (Xi & Toyokawa, 2025).

Network structures shape exploration-exploitation trade-off; from Barkoczi et al. (2016)
Network structures shape exploration-exploitation trade-off; from Barkoczi et al. (2016)

Here, we provide a very simple illustration of how different network structures (local vs. fully connected) perform on a 50-armed bandit task with a population of decision-biasing agents. In contrast to our previous simulations, here the agents only have access to social information from other agents they are explicitly connected to via the social network.

set.seed(2025)
# 1. Define bandit
k <- 50 #number of arms
meanVec <- rexp(k, 0.1) #Payoff means sampled from an exponential distribution
sigmaVec <- rep(1, k) #Payoff stdevs
T <- 100 #total number of trials
nAgents <- 8

banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
  payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
  return (payoff)
}

# 2. Define networks

# Fully connected network (no self-loops), scoped to a single group
net_full_group <- function(n) {
  A <- matrix(1L, n, n); diag(A) <- 0L; A
}

# Ring (local) network with even degree k (k/2 neighbors on each side)
net_ring_group <- function(n, k = 2) {
  stopifnot(k %% 2 == 0, k >= 2, k <= max(2, n - 1))
  A <- matrix(0L, n, n)
  half <- k / 2
  for (i in seq_len(n)) for (d in 1:half) {
    j1 <- ((i - 1 + d) %% n) + 1
    j2 <- ((i - 1 - d + n) %% n) + 1
    A[i, j1] <- 1L; A[i, j2] <- 1L
  }
  A
}


# 3. Policies & updates (Q-learning + softmax)

# Softmax: converts Q-values to action probabilities using temperature tau.
# Lower tau -> greedier; higher tau -> more exploratory.
# Softmax (β is inverse temperature). Returns a proper probability vector.
softmax_probs <- function(Q_row, beta) {
  z <- (Q_row - max(Q_row)) * beta
  p <- exp(z)
  p <- p / sum(p)
  return(p)
}

# Q-learning update
# Q[a] <- Q[a] + alpha * (r - Q[a])
q_update <- function(Q_row, a, r, alpha) {
  Q_row[a] <- Q_row[a] + alpha * (r - Q_row[a])
  Q_row
}

# 3. Define simulation decision-biasing agents
run_networked_gaussian_bandit <- function(
  n_arms      = length(meanVec),  # keep consistent with the bandit you defined
  n_groups    = 2,                # number of independent groups
  group_size  = 8,                # agents per group 
  T           = 100,              # time steps
  alpha       = 0.1,              # Q-learning rate
  beta        = 2.0,              # softmax inverse temperature
  gamma       = 0.5,              # weight on frequency-dependent copying (0..1)
  network     = c("ring", "full"),# topology within each group
  ring_k      = 2                 # ring degree (even)
) {
  stopifnot(n_arms == length(meanVec))  # make sure bandit and code agree
  network <- match.arg(network)

  # Build per-group adjacency (no edges across groups)
  A_list <- vector("list", n_groups)
  for (g in seq_len(n_groups)) {
    A_list[[g]] <- if (network == "full") {
      net_full_group(group_size)
    } else {
      net_ring_group(group_size, k = ring_k)
    }
  }

  # Q-values for each agent and arm (start at 0)
  Q <- array(0, dim = c(n_groups, group_size, n_arms))

  # Keep track of last-round choices to build neighbor frequencies
  last_choice <- matrix(NA_integer_, n_groups, group_size)

  # Class-level metrics
  frac_opt <- numeric(T)    # fraction of agents who choose the truly best arm
  mean_rew <- numeric(T)    # mean reward across all agents

  # The truly best arm under your Gaussian bandit
  opt_arm <- which.max(meanVec)

  for (t in seq_len(T)) {
    choices <- matrix(NA_integer_, n_groups, group_size)
    rewards <- matrix(0, n_groups, group_size)

    for (g in seq_len(n_groups)) {
      A <- A_list[[g]]

      for (i in seq_len(group_size)) {
        # --- Individual policy (softmax over own Q) ---
        p_ind <- softmax_probs(Q[g, i, ], beta)  # normalized

        # --- Frequency-dependent policy from neighbors' last choices ---
        nbr_idx <- which(A[i, ] == 1L)
        nbr_last <- last_choice[g, nbr_idx]
        nbr_last <- nbr_last[!is.na(nbr_last)]

        if (length(nbr_last) > 0) {
          counts <- tabulate(nbr_last, nbins = n_arms)
          if (sum(counts) > 0) {
            p_freq <- counts / sum(counts)
          } else {
            p_freq <- rep(0, n_arms)
          }
        } else {
          # No neighbor history yet (e.g., t = 1) → fall back to own policy
          p_freq <- rep(0, n_arms)
        }

        # --- Decision-biasing mixture ---
        p_mix <- (1 - gamma) * p_ind + gamma * p_freq
        if (sum(p_mix) <= 0) p_mix <- p_ind
        p_mix <- p_mix / sum(p_mix)

        # --- Act, observe reward, learn ---
        a <- sample.int(n_arms, size = 1, prob = p_mix)
        r <- banditGenerator(a)[1]              # uses meanVec/sigmaVec you defined
        Q[g, i, ] <- q_update(Q[g, i, ], a, r, alpha)

        choices[g, i] <- a
        rewards[g, i] <- r
      }
    }

    # Update memory for next round's frequency counts
    last_choice <- choices

    # Record summary metrics
    frac_opt[t] <- mean(choices == opt_arm)
    mean_rew[t] <- mean(rewards)
  }

  list(
    opt_arm  = opt_arm,
    frac_opt = frac_opt,
    mean_rew = mean_rew
  )
}

# 4. run simulations across conditions and save results
compare_topologies <- function(
  n_arms = length(meanVec), n_groups = 2, group_size = 8,
  T = 100, reps = 10,
  alpha = 0.1, beta = 2.0, gamma = 0.5,
  ring_k = 2
) {
  stopifnot(n_arms == length(meanVec))

  run_many <- function(which_network) {
    F <- M <- matrix(0, nrow = reps, ncol = T)
    for (r in seq_len(reps)) {
      # (If you want different bandits per repetition, redefine meanVec here.)
      sim <- run_networked_gaussian_bandit(
        n_arms = n_arms, n_groups = n_groups, group_size = group_size,
        T = T, alpha = alpha, beta = beta, gamma = gamma,
        network = which_network, ring_k = ring_k
      )
      F[r, ] <- sim$frac_opt
      M[r, ] <- sim$mean_rew
    }
    list(frac_opt = colMeans(F), mean_rew = colMeans(M))
  }

  out_ring <- run_many("ring")
  out_full <- run_many("full")

  time <- seq_len(T)
  df1 <- rbind(
    data.frame(time, value = out_ring$frac_opt, metric = "Fraction choosing optimal arm", network = sprintf("Local (k=%d)", ring_k)),
    data.frame(time, value = out_full$frac_opt, metric = "Fraction choosing optimal arm", network = "Fully connected")
  )
  df2 <- rbind(
    data.frame(time, value = out_ring$mean_rew, metric = "Mean reward (per round)", network = sprintf("Local (k=%d)", ring_k)),
    data.frame(time, value = out_full$mean_rew, metric = "Mean reward (per round)", network = "Fully connected")
  )
  rbind(df1, df2)
}


# 5. run the simulation
results_df <- compare_topologies(
  n_arms = length(meanVec),           # more arms than any group can cover alone
  n_groups = 2,          # several independent classrooms
  group_size = 8,        # each group is small (< n_arms)
  T = 100, reps = 10,
  alpha       = 0.1,              # Q-learning rate
  beta        = 2.0,              # softmax inverse temperature
  gamma       = 0.5,             # social learning mixture param
  ring_k = 2            # degree in the ring (even numbers)
)

# 6. plot results
p <- ggplot(results_df, aes(x = time, y = value, linetype = network)) +
  geom_line(linewidth = 1) +
  facet_wrap(~ metric, ncol = 2, scales = "free_y") +
  labs(x = "Time", y = NULL, linetype = NULL) +
  theme_classic(base_size = 12) +
  theme(
    legend.position = "right",
    strip.background = element_blank(),
    strip.text = element_text(face = "bold")
  )
p

Cognitive Complexity: Theory of Mind

So far, we have described some of the simplest social learning mechanisms. With frequency-dependent imitation or decision-biasing an agent is able to directly imitate socially observed behaviors. With value shaping, social information translates rather directly into a value bonus for the observed action without needing a deal of interpretation.

However, this neglects an important aspect of what makes human social learning so powerful, which is our ability to “unpack” observed actions into imputed mental states using Theory of Mind (ToM). We can describe ToM as a form of inference, where observed behaviors can are used to infer a variety of different mental representations (red arrows).

Varieties of Theory of Mind Inference. Adapted from Wu, Vélez, et al. (2022)
Varieties of Theory of Mind Inference. Adapted from Wu, Vélez, et al. (2022)

Value inference

Actions can be used to infer the other person’s values or goals, such as when we recognize an individual sprinting across the street is attempting to catch a bus. Or when we observe someone’s food choices at a buffet and realize they are quite likely to be a vegetarian. Performing these inferences allow us to use social information more flexibly and tailored our own personal circumstances. For instance, we might flexibly decide whether to also chase after the sprinting individual based on whether we think it’s the same bus we also need. Or we might imitate some but not all of the food choices from another person with different dietary restrictions.

One formal framework for describing value inference is using Inverse Reinforcement Learning (IRL; Sutton & Barto (2018); Jara-Ettinger (2019)). Using Bayesian inference, we can compute a posterior distribution about the Values (e.g., Q-values) that gave rise to the observed Actions:

\[ P(V|A) \propto P(A|V)P(V) \] To perform this computationally, we need to integrate over all possible value functions \(P(V)\) and evaluate the likelihood at each possible value function \(P(A|V)\). This type of computation is clearly intractable for almost all problems, since there are an infinite number of possible value functions someone could have. Nevertheless, it provides a rational framework that has been used to uncover inductive biases that simplify the required computations. For instance, the principle of efficient action is a bias that assumes other people are acting in the most efficient manner (Gergely & Csibra, 2003). This allows us to constrain the likelihood to be a value-maximizing policy, although it still does not necessarily rescue the computations of IRL from intractability.

Belief inference

Beyond inferring values, we can also use the same IRL framework to decompose actions into even more primitive building blocks of representation. Social information can be used to make inferences about someone’s Beliefs about the causal structure of the world (“she thinks local markets have fresher produce”), or about their intrinsic Reward specification (“she loves the taste of broccoli”). We can use a similar formalism as above to describe this form of inference:

\[ P(B,R|A) \propto P(A|B,R)P(B,R) \] The more we use ToM to decompose actions into their primitives, the more flexibility it affords. By inferring values, beliefs, and intrinsic reward representations from observed actions, we can better integrate social information into our own individual circumstances, goals, and preferences. However, this additional flexibility comes with substantial computational costs, and it is an exciting line of research to better understand how humans navigate this trade-off.

Combining multiple forms of complexity

While the learning mechanisms we have presented are very simple, the same principles can be used in arbitrarily complex problems.

Collective foraging task. (a) Birds-eye perspective of a single round. (b) Screenshot from the experiment. (c) Automated transcription of each player’s field of view (FOV) used to model social visibility. (d) Task design. (e) Different reward structures, where only smooth environments afford benefits for social imitation. (f) Agent-based simulations show a benefit for success-biased social learning over asocial learning in smooth, but not random environments, whereas unbiased social learning performs poorly in both. Adapted from Wu, Deffner, et al. (2025)
Collective foraging task. (a) Birds-eye perspective of a single round. (b) Screenshot from the experiment. (c) Automated transcription of each player’s field of view (FOV) used to model social visibility. (d) Task design. (e) Different reward structures, where only smooth environments afford benefits for social imitation. (f) Agent-based simulations show a benefit for success-biased social learning over asocial learning in smooth, but not random environments, whereas unbiased social learning performs poorly in both. Adapted from Wu, Deffner, et al. (2025)

For instance, Wu, Deffner, et al. (2025) uses an immersive collective foraging task implemented in the Minecraft game engine, where a realistic field of view creates attentional trade-offs for social learning (looking at other players for imitation comes at the cost of slower individual foraging). The task provides rich spatial and visual dynamics, which can nevertheless be incorporated into a computational model by sequentially predicting each of the \(k\) blocks participants destroy:

\[ P(\text{Choice}_{k+1}) \propto \exp(\mathbf{f}_k\cdot \mathbf{w}) \] This is modeled as a softmax choice function over a linear combination of features \(\mathbf{f}\) times weights \(\mathbf{w}\). The features capture hypotheses about individual and social learning, which incorporate rich spatial and visual dynamics of the task, while model weights are hierarchically estimated using Bayesian methods in STAN (see Tutorial 4 from COSMOS 2023) in order to control for individual and group-level variability.

Computational modeling of choices in a dynamic and immersive environment.
Computational modeling of choices in a dynamic and immersive environment.

Evolutionary simulations

Another important computational tool for studying social learning is the use of evolutionary simulations. Since social learning has frequency-dependent fitness (Rogers, 1988), the best strategy to use depends on what others in the population are deploying. Thus, in order to determine the best normative strategy, it is often helpful to use evolutionary simulations (Tump et al., 2019; witt2024flexible?):

  1. Initialize a population of agents
  2. Simulate performance on the task
  3. Select agents to seed the next generation (e.g., based on performance)
  4. Add mutation (change agent type, modify parameters)
  5. Repeat until convergence

Below, we provide some code to run an evolutionary simulation over 500 generations, starting from a group of purely individual learners and seeing if our simple social learning agents can invade. The results can often be quite noisey, and usually need to be aggregated over multiple replications for robustness.

set.seed(1234)
#Define simulation parameters
popSize <- 100 #population size of each generation
groupSize <- 10 #number of agents playing each game
indPop <- 50 #start with only individual learners to see if social learning evolves
socPop <- 50
initial <- cbind( #initial agent parameters, stored in a matrix
    c(rep(0, indPop), rep(1, socPop)), #0 = ind agent, 1 = soc agent
    c(runif(indPop), rep(NA, socPop)), #alpha (only for ind agents), sampled from a uniform distribution
    c(rexp(indPop), rep(NA, socPop)), #beta (only for ind agents), sampled from an exponential distribution
    c(rep(NA,indPop), rexp(socPop)) #theta (only for soc agents), sampled from an exponential distribution 
  )
Tsteps <- 25 #Trials in each game
k<- 10 #arms of the bandit
meanVec  <- seq(-1, 1, length.out = k)
sigmaVec <- rep(2, k)
banditGenerator <- function(a) {#a is an integer or a vector of integers, selecting one of the 1:k arms of the bandit
  payoff <- rnorm(n = length(a), mean = meanVec[a], sd = sigmaVec[a])
  return (payoff)
}
typeMutation <- .05 #rate of type mutation: ind <--> soc
paramMutation <- .1 #rate of parameter mutation
paramMutationVariance <- .1 #How much parameters mutate

#Evolution function
doTheEvolution <- function(initialPop, generations = 250){
  outputList <- list()
  outputList[[1]] <- initialPop

  for (gen in 2:generations){
    pop <- outputList[[gen - 1]]             # current population (matrix)
    winners <- rep(0L, popSize)

    # Indices of agent types
    indAgents <- which(pop[,1] == 0)
    socAgents <- which(pop[,1] == 1)

    # Assign agents to games (rows = games, cols = positions in the game)
    agentAssignments <- matrix(NA, popSize, groupSize)
    probs <- rep(1/popSize, popSize)
    for (iter in 1:popSize){
      ids <- sample(1:popSize, groupSize, prob = probs, replace = FALSE)
      probs[ids] <- probs[ids] / groupSize
      agentAssignments[iter,] <- ids
    }

    # Play each game
    for (i in 1:popSize){
      Qmat <- matrix(0, nrow = groupSize, ncol = k)   # Q-values for agents in THIS game
      socialObs <- rep(NA_integer_, groupSize)        # last actions (for social frequency)
      gameScore <- rep(0, groupSize)

      for (t in 1:Tsteps){
        for (j in 1:groupSize){
          agent <- agentAssignments[i, j]

          if (agent %in% indAgents){
            # --- individual learner: softmax over own Q row ---
            alpha <- pop[agent, 2]
            beta  <- pop[agent, 3]
            prob_ind <- softmax_probs(Q_row = Qmat[j, ], beta = beta)
            action <- sample.int(k, size = 1, prob = prob_ind)
            reward <- banditGenerator(action)

            gameScore[j]    <- gameScore[j] + reward
            Qmat[j, action] <- Qmat[j, action] + alpha * (reward - Qmat[j, action])
            socialObs[j]    <- action

          } else if (agent %in% socAgents){
            # --- social learner: frequency-dependent copying ---
            theta <- pop[agent, 4]

            if (t == 1){
              prob_soc <- rep(1/k, k)  # no observations yet → uniform
            } else {
              # counts of neighbors' actions (exclude self), tiny epsilon to avoid zeros
              counts <- table(factor(socialObs[-j], levels = 1:k))
              counts <- as.numeric(counts) + 1e-6
              prob_soc <- counts^theta
              prob_soc <- prob_soc / sum(prob_soc)
            }

            action <- sample.int(k, size = 1, prob = prob_soc)
            reward <- banditGenerator(action)

            gameScore[j]    <- gameScore[j] + reward
            # Social learners don't update Qmat in your spec; if they should, add a rule here.
            socialObs[j]    <- action
          }
        }
      }

      # Winner of game i reseeds: pick the agent with highest gameScore
      winners[i] <- agentAssignments[i, which.is.max(gameScore)]
    }

    # New generation
    newPop <- pop[winners, , drop = FALSE]

    # Mutations
    mutateType   <- runif(popSize) < typeMutation
    mutateParams <- runif(popSize) < paramMutation

    for (player in 1:popSize){
      if (mutateType[player]){
        if (newPop[player,1] == 0){            # Ind → Soc
          newPop[player, ] <- c(1, NA, NA, rexp(1, rate = 1))
        } else {                                # Soc → Ind
          newPop[player, ] <- c(0, runif(1), rexp(1), NA)
        }
      }
      if (mutateParams[player]){
        if (newPop[player,1] == 0){
          # tweak individual params (alpha in (0,1), beta > 0)
          newPop[player, c(2,3)] <- newPop[player, c(2,3)] + rnorm(2, 0, sqrt(paramMutationVariance))
          if (is.na(newPop[player,2]) || newPop[player,2] <= 0 || newPop[player,2] >= 1) newPop[player,2] <- runif(1)
          if (is.na(newPop[player,3]) || newPop[player,3] <= 0) newPop[player,3] <- rexp(1)
        } else {
          # tweak social param theta > 0
          newPop[player, 4] <- newPop[player, 4] + rnorm(1, 0, sqrt(paramMutationVariance))
          if (is.na(newPop[player,4]) || newPop[player,4] <= 0) newPop[player,4] <- rexp(1)
        }
      }
    }

    outputList[[gen]] <- newPop
  }
  return(outputList)
}

#Run simulations
outputList <- doTheEvolution(initial, generations = 250)

#Convert output to dataframe
evoSimDF <-rbindlist(lapply(1:length(outputList), function(i) data.frame(outputList[[i]], generation = rep(i, popSize))))
colnames(evoSimDF) <- c('agentType', 'alpha', 'beta', 'theta', 'generation')
evoSimDF$agentType <- factor(evoSimDF$agentType, labels = c('Ind', 'Soc'))

#What ratio of agents evolve?
agentRatio <- evoSimDF %>% group_by(generation) %>% summarize(Ind = mean(agentType=='Ind'), Soc = mean(agentType == 'Soc'))
agentRatio <- agentRatio %>% pivot_longer(cols = -generation, names_to = 'AgentType', values_to = 'Prop')

p1 <- ggplot(agentRatio,  aes(x = generation, y = Prop, color = AgentType))+
  geom_line()+
  theme_classic()+
  labs(x ="Generation", y = "Proportion")+
  scale_color_manual(values = c("#d7191c","#2b83ba"), name = "")+
  theme(legend.position = c(1,1), legend.justification = c(1,1), legend.background=element_blank())

#How do parameters evolve?
paramDF <- evoSimDF %>% group_by(generation) %>% summarize(alpha = mean(alpha, na.rm = TRUE), beta = mean(beta, na.rm = TRUE), theta = mean(theta, na.rm = TRUE))
paramDF <- paramDF %>% pivot_longer(cols = -generation, names_to = 'param')

p2 <- ggplot(paramDF,  aes(x = generation, y = value, color = param))+
  geom_line()+
  theme_classic()+
  labs(x ="Generation", y = "Parameter Value")+
  scale_color_manual(values = c("#E69F00", "#009E73", "#56B4E9"), name = "")+
  theme(legend.position = c(0.0,1), legend.justification = c(0,1), legend.background=element_blank())

plot_grid(p1,p2, labels = 'auto')
Evolutionary simulations.

Evolutionary simulations.

Other forms of social learning

Social learning deploys a wide range of tools. So far we have looked at:

  • Imitation: directly copy observed behaviors
  • Value-shaping: add a heuristic bonus to observed behaviors
  • Value inference: inferring hidden value representations
  • Belief inference: inferring hidden beliefs about the world

However, this represents only a subset of social learning mechanisms and characteristics of collective intelligence. In addition:

  • Intelligent behavior is not only a function of each individual but also how well groups collectively solve problems (Galesic et al., 2023)
  • Over large time scales, simple innovations can cumulatively add up to produce massively complex cultural solutions (Acerbi et al., 2020)
  • So far we have focused on observational learning, but social learning also involves pedagogy (Vélez et al., 2023) and explicit communication (Fan et al., 2020)

Yet for each mechanism we can describe verbally, we can also define a computational model that makes more precise commitments to the mechanisms of behavior. Through experimentation and modeling, we can iteratively tweak and refine our understanding of social learning.

References

Acerbi, A., Mesoudi, A., & Smolla, M. (2020). Individual-based models of cultural evolution. A step-by-step guide using r.
Auer, P., Cesa-Bianchi, N., & Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning, 47(2), 235–256.
Barkoczi, D., Analytis, P. P., & Wu, C. M. (2016). Collective search on rugged landscapes: A crossenvironmental analysis. In A. Papafragou, D. Grodner, D. Mirman, & J. C. Trueswell (Eds.), Proceedings of the 38th Annual Conference of the Cognitive Science Society (pp. 918–923). Cognitive Science Society.
Biele, G., Rieskamp, J., Krugel, L. K., & Heekeren, H. R. (2011). The neural basis of following advice. PLoS Biology, 9(6), e1001089.
Dayan, P., Kakade, S., & Montague, P. R. (2000). Learning and selective attention. Nature Neuroscience, 3, 1218–1223.
Derex, M., & Boyd, R. (2016). Partial connectivity increases cultural accumulation within groups. Proceedings of the National Academy of Sciences, 113(11), 2982–2987.
Fan, J. E., Hawkins, R. D., Wu, M., & Goodman, N. D. (2020). Pragmatic inference and visual abstraction enable contextual flexibility during visual communication. Computational Brain & Behavior, 3, 86–101.
Galesic, M., Barkoczi, D., Berdahl, A. M., Biro, D., Carbone, G., Giannoccaro, I., Goldstone, R. L., Gonzalez, C., Kandler, A., Kao, A. B., et al. (2023). Beyond collective intelligence: Collective adaptation. Journal of the Royal Society Interface, 20(200), 20220736.
Gergely, G., & Csibra, G. (2003). Teleological reasoning in infancy: The naıve theory of rational action. Trends in Cognitive Sciences, 7(7), 287–292.
Gershman, S. J. (2015). A unifying probabilistic view of associative learning. PLoS Computational Biology, 11(11), e1004567.
Heyes, C. (2016). Who knows? Metacognitive social learning strategies. Trends in Cognitive Sciences, 20(3), 204–213.
Ho, M. K., Cushman, F., Littman, M. L., & Austerweil, J. L. (2019). People teach with rewards and punishments as communication, not reinforcements. Journal of Experimental Psychology: General, 148(3), 520.
Hoppitt, W., & Laland, K. N. (2013). Social learning. In Social learning. Princeton University Press.
Jara-Ettinger, J. (2019). Theory of mind as inverse reinforcement learning. Current Opinion in Behavioral Sciences, 29, 105–110.
Lazer, D., & Friedman, A. (2007). The network structure of exploration and exploitation. Administrative Science Quarterly, 52(4), 667–694.
Machado, M. C., Bellemare, M. G., & Bowling, M. (2020). Count-based exploration with the successor representation. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 5125–5133.
Mason, W., & Watts, D. J. (2012). Collaborative learning in networks. Proceedings of the National Academy of Sciences, 109(3), 764–769.
McElreath, R., Lubell, M., Richerson, P. J., Waring, T. M., Baum, W., Edsten, E., Efferson, C., & Paciotti, B. (2005). Applying evolutionary models to the laboratory study of social learning. Evolution and Human Behavior, 26(6), 483–508.
Najar, A., Bonnet, E., Bahrami, B., & Palminteri, S. (2020). The actions of others act as a pseudo-reward to drive imitation in the context of social reinforcement learning. PLoS Biology, 18(12), e3001028.
Palminteri, S., Khamassi, M., Joffily, M., & Coricelli, G. (2015). Contextual modulation of value signals in reward and punishment learning. Nature Communications, 6(1), 1–14.
Rogers, A. R. (1988). Does biology constrain culture? American Anthropologist, 90(4), 819–831.
Speekenbrink, M., & Konstantinidis, E. (2015). Uncertainty and exploration in a restless bandit problem. Topics in Cognitive Science, 7(2), 351–367.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction (Second). MIT press.
Toyokawa, W., Whalen, A., & Laland, K. N. (2019). Social learning strategies regulate the wisdom and madness of interactive crowds. Nature Human Behaviour, 3(2), 183–193.
Tump, A. N., Wu, C. M., Bouhlel, I., & Goldstone, R. L. (2019). The evolutionary dynamics of cooperation in collective search. In A. K. Goel, C. M. Seifert, & C. Freksa (Eds.), Proceedings of the 41st Annual Conference of the Cognitive Science Society (pp. 883--889). Cognitive Science Society.
Vélez, N., Chen, A. M., Burke, T., Cushman, F. A., & Gershman, S. J. (2023). Teachers recruit mentalizing regions to represent learners’ beliefs. Proceedings of the National Academy of Sciences, 120(22), e2215015120.
Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3), 279–292.
Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A., & Cohen, J. D. (2014). Humans use directed and random exploration to solve the explore–exploit dilemma. Journal of Experimental Psychology: General, 143(6), 2074.
Wu, C. M., Deffner, D., Kahl, B., Meder, B., Ho, M. K., & Kurvers, R. H. (2025). Adaptive mechanisms of social and asocial learning in immersive foraging environments. Nature Communications, 16, 3539. https://doi.org/10.1038/s41467-025-58365-6
Wu, C. M., Meder, B., & Schulz, E. (2025). Unifying principles of generalization: Past, present, and future. Annual Reviews of Psychology, 76, 275–302. https://doi.org/10.1146/annurev-psych-021524-110810
Wu, C. M., Schulz, E., Garvert, M. M., Meder, B., & Schuck, N. W. (2020). Similarities and differences in spatial and non-spatial cognitive maps. PLOS Computational Biology, 16, 1–28. https://doi.org/10.1371/journal.pcbi.1008149
Wu, C. M., Schulz, E., & Gershman, S. J. (2021). Inference and search on graph-structured spaces. Computational Brain & Behavior, 4, 125–147. https://doi.org/10.1007/s42113-020-00091-x
Wu, C. M., Schulz, E., Pleskac, T. J., & Speekenbrink, M. (2022). Time pressure changes how people explore and respond to uncertainty. Scientific Reports, 12, 1–14. https://doi.org/https://doi.org/10.1038/s41598-022-07901-1
Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018b). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2, 915--924. https://doi.org/10.1038/s41562-018-0467-4
Wu, C. M., Schulz, E., Speekenbrink, M., Nelson, J. D., & Meder, B. (2018a). Generalization guides human exploration in vast decision spaces. Nature Human Behaviour, 2(12), 915–924.
Wu, C. M., Vélez, N., & Cushman, F. A. (2022). Representational exchange in human social learning: Balancing efficiency and flexibility. In I. C. Dezza, E. Schulz, & C. M. Wu (Eds.), The Drive for Knowledge: The Science of Human Information-Seeking. Cambridge University Press.
Xi, J., & Toyokawa, W. (2025). Collective behavior emerging from social learning strategies and network structures. Proceedings of the Annual Meeting of the Cognitive Science Society, 47.