Benjamin Rosche and Wicia Fang
Most courses on social network analysis (SNA) focus on descriptive SNA, such as measuring the density of a network, identifying subgroups within a network, or examining the centrality of actors within a network. Inferential SNA, which focuses on explaining the formation of networks and behaviors and beliefs of actors embedded in networks, by contrast, is often eschewed.
This short course is an introduction to inferential SNA. The focus of this course is on models to explain how the behaviors and beliefs of actors are influenced by the networks in which they are embedded. Covered are cross-sectional, panel, and dynamic panel models to estimate exogenous and endogenous peer effects.
In the following, we examine under which circumstances correct peer effects are recovered and under which circumstances they are not. For that, I created four cross-sectional scenarios (dat1.1
through dat1.4
) and four panel scenarios (dat2.1
through dat2.4
). These datasets are simulated and differ with respect to whether the peer effect is exogenous or endogenous and whether the networks have formed randomly or not. For each of the simulated scenarios, there is a corresponding adjacency matrix (w1.1
through w2.4
) describing the network structure.
In this first scenario (dat1.1
and w1.1
), we consider exogenous peer effects on a social network that has formed randomly.
The network formation process is very simple, each dyad has a 50% change of existing:
\(Pr(ij) = 0.5\)
The outcome in all examples is achievement
and the explanatory variables are SI
and I
. Both explanatory variables are randomly and independently normally distributed and there are no confounding influences in the data generation process (DGP).
SI
(selection+influence) is a variable that affects both selection into dyads and influences the outcome while I
(influence) only affects the outcome. However, since network formation is random in 1.1., both variables only affect the outcome:
\(achievement_i = 1SI_i + 1 I_i + 1 \sum_{j}w_{ij}SI_j + 1 \sum_{j}w_{ij}I_j\)
That is, achievement of i is influenced by its own value of SI
and I
and by the average SI
and I
of her peers. As such, the exogenous peer effect is local because individuals are only affected by those peers to which they are directly connected.
lm(achievement ~ SI + I, data=dat1.1) # A: conventional regression model
##
## Call:
## lm(formula = achievement ~ SI + I, data = dat1.1)
##
## Coefficients:
## (Intercept) SI I
## 0.0354 1.0127 0.9773
In (A), the exogenous peer effects (\(SI_j\), \(I_j\)) are omitted. We observe that ego’s corresponding features (\(SI_i\), \(I_i\)) are nonetheless estimated correctly.
lmSLX(
achievement ~ SI + I,
Durbin = ~ SI + I,
listw = mat2listw(w1.1, style="W"), # adjacency matrix
data = dat1.1
) # B: Spatial lagged-X model
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Coefficients:
## (Intercept) SI I lag.SI lag.I
## -9.992e-17 1.000e+00 1.000e+00 1.000e+00 1.000e+00
lmSLX(
achievement ~ SI,
Durbin = ~ SI ,
listw = mat2listw(w1.1, style="W"), # adjacency matrix
data = dat1.1
) # C: Spatial lagged-X model
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Coefficients:
## (Intercept) SI lag.SI
## -0.02141 1.03459 1.00996
In (B), we include the exogenous peer effects. We observe that the estimates are perfectly recovered. We observe in (C) that even if covariates are omitted, the remaining individual and exogenous peer effects are still estimated correctly.
Generally, it makes sense to also include the adjacency matrix in the disturbance term as it will produce more appropriate standard errors (and with that, hypothesis tests). This is done in (D):
lagsarlm(
achievement ~ SI + I,
Durbin = ~ SI + I, # exogenous peer effects
listw = mat2listw(w1.1, style="W"),
data = dat1.1
) # D: Spatial Durbin error model (not run)
In dat1.2
, we consider a situation in which network formation is endogenous:
\(Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))\)
That is, dyads are more likely for peers that are similar in SI
. The DGP for the outcome is the same:
\(achievement_i = 1SI_i + 1 I_i + 1 \sum_{j}w_{ij}SI_j + 1 \sum_{j}w_{ij}I_j\)
lm(achievement ~ SI + I, data=dat1.2) # A: Conventional regression model, both exogenous peer effects omitted
##
## Call:
## lm(formula = achievement ~ SI + I, data = dat1.2)
##
## Coefficients:
## (Intercept) SI I
## 0.0158 1.7799 1.0361
In (A), we observe that that the ego feature I
is still estimated correctly with a conventional regression model. SI
, however, is biased now. More generally, the effect of individual features that are involved in the selection process will be biased if the variables also have an effect on peers.
lmSLX(
achievement ~ SI + I,
Durbin = ~ SI,
listw = mat2listw(w1.2, style="W"),
data = dat1.2
) # B: Spatial lagged-X model, one exogenous peer effect (I) omitted
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Coefficients:
## (Intercept) SI I lag.SI
## 0.03845 0.40807 1.00870 1.78274
As we observe in (B), SI
is also biased if we do not include all relevant exogenous peer effects. Moreover, the included exogenous peer effect (SI
) is likewise biased!
lmSLX(
achievement ~ SI + I,
Durbin = ~ SI + I,
listw = mat2listw(w1.2, style="W"),
data = dat1.2
) # C: Spatial lagged-X model
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Coefficients:
## (Intercept) SI I lag.SI lag.I
## -5.551e-17 1.000e+00 1.000e+00 1.000e+00 1.000e+00
In (C), we observe that endogenous network formation will not bias estimates if we capture the true DGP.
In dat1.3
, networks are random again: \(Pr(ij) = 0.5\) but, here, we consider an endogenous peer effect. The endogenous peer effect has a global impact on the network since any change in the outcome of an individual will not only affect all direct peers but also the peers of peers. The endogenous peer effect can be conceptualized as a social interaction process, in which the value of the dependent variable for each individual is jointly determined with that of her peers. This is the DGP: \(achievement_i = 0.5\sum_{j}w_{ij}achievement_j + 1SI_i + 1I_i\).
m13.listw <- mat2listw(w1.3, style="W")
m13 <-
lagsarlm(
achievement ~ SI + I,
listw=m13.listw,
data=dat1.3
) # A: Spatial autoregressive model
m13
##
## Call:
## lagsarlm(formula = achievement ~ SI + I, data = dat1.3, listw = m13.listw)
## Type: lag
##
## Coefficients:
## rho (Intercept) SI I
## 5.000000e-01 4.531651e-09 1.000000e+00 1.000000e+00
##
## Log likelihood: 1890.228
We observe in (A) that the endogenous peer effect (rho=\(\rho\)) is estimated correctly. The interpretation of the effect differs from exogenous peer effects since the spillover traverses the network. To understand the impact better, we can calculate marginal effects (dy/dx) of the features (x) using the impacts() function:
impacts(m13, listw = m13.listw)
## Impact measures (lag, exact):
## Direct Indirect Total
## SI 1.007514 0.9924863 2
## I 1.007514 0.9924863 2
We observe that the effect of SI
and I
exhibits a direct and indirect effect. The direct effects equal the regression coefficient of SI
and I
. The indirect effects equal the proportion of the total effect that is due to the ripple effect of the endogenous peer effect. That is, if \(\beta_{SI}=1, \rho=0.5\), the indirect effect also equals 1. More generally, \(IND=\beta/(1-\rho)\rho\). Note that this is only true for a row-standardizes adjacency matrix (i.e., mean - peer effect).
lagsarlm(
achievement ~ SI,
listw=m13.listw,
data=dat1.3
) # B: Spatial autoregressive model in which I is omitted
##
## Call:
## lagsarlm(formula = achievement ~ SI, data = dat1.3, listw = m13.listw)
## Type: lag
##
## Coefficients:
## rho (Intercept) SI
## 0.60682068 -0.01935358 1.02810393
##
## Log likelihood: -69.25506
We observe in (B) that the endogenous peer effect (\(\rho\)) is biased if there are omitted variables but the individual feature effect of SI
remains unaffected.
Generally, it makes sense to also include the adjacency matrix in the disturbance term as it will produce more appropriate standard errors (and with that, hypothesis tests). This is done in (C):
sacsarlm(
achievement ~ SI + I,
listw=m13.listw,
data=dat1.3
) # C: Spatial autoregressive combined model (not run)
In dat1.4
we consider an endogenous network formation process:
\(Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))\)
as well as a endogenous peer effect:
\(achievement_i = 0.5\sum_{j}w_{ij}achievement_j + 1I_i + 1SI_i\).
lagsarlm(
achievement ~ SI + I,
listw=mat2listw(w1.4, style="W"),
data=dat1.4
) # A: Spatial autoregressive model will full DGP
##
## Call:
## lagsarlm(formula = achievement ~ SI + I, data = dat1.4, listw = mat2listw(w1.4,
## style = "W"))
## Type: lag
##
## Coefficients:
## rho (Intercept) SI I
## 5.000000e-01 -1.036521e-09 1.000000e+00 1.000000e+00
##
## Log likelihood: 1959.575
We observe in (A) that the endogenous peer effect estimate is correct despite the endogenous network formation if the true DGP is captured.
lagsarlm(
achievement ~ I,
listw=mat2listw(w1.4, style="W"),
data=dat1.4
) # B: Spatial autoregressive model, exogenous peer effect of a covariate involved in selection process omitted (SI)
##
## Call:
## lagsarlm(formula = achievement ~ I, data = dat1.4, listw = mat2listw(w1.4,
## style = "W"))
## Type: lag
##
## Coefficients:
## rho (Intercept) I
## -1.5087776 -0.2677616 1.0456252
##
## Log likelihood: -17.04281
lagsarlm(
achievement ~ SI,
listw=mat2listw(w1.4, style="W"),
data=dat1.4
) # C: Spatial autoregressive model, exogenous peer effect of a covariate NOT involved in the selection process omitted (I)
##
## Call:
## lagsarlm(formula = achievement ~ SI, data = dat1.4, listw = mat2listw(w1.4,
## style = "W"))
## Type: lag
##
## Coefficients:
## rho (Intercept) SI
## 0.53848730 -0.03228798 1.03979048
##
## Log likelihood: -69.26538
We observe in (B) that effects of covariates that are not involved in the selection process (I
) are correctly recovered. The endogenous peer effect (\(\rho\)), however, is estimated incorrectly because because SI
is omitted and the selection process therefore unobserved.
In (C), we observe that omitting a variable that is not part of the selection process (I
), by contrast, is less problematic. The endogenous peer effect is close to the true value of 0.5 and the individual feature effect (SI
) is close to the true value of 1.
We now move to panel data. dat2.1
exhibits three waves but there are no temporal trends. That is, the DGP for each wave is:
\(Pr(ijt) = 0.5\) and \(achievement_{it} = 1SI_i + 1I_i + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}\)
The advantage of using a panel model is that individual-specific and/or time-specific fixed or random effects can be estimated. Let’s start with a pooled (i.e. cross-sectional) model and omit a time-constant individual-specific effect to observe the advantages of a panel model. Since SI
and I
do not change across time, we can use them as time-constant unobservables.
lm(
achievement ~ alter_SI + alter_I,
data=dat2.1
) # A: Pooled model
##
## Call:
## lm(formula = achievement ~ alter_SI + alter_I, data = dat2.1)
##
## Coefficients:
## (Intercept) alter_SI alter_I
## -0.1067 0.4424 0.7858
Since lmSLX
does not allow including an exogenous peer effect of I
without including it as a main effect, I use lm()
instead and manually compute \(alter_{SI} = WSI\) and \(alter_I = WI\). We observe in (A) that all effects are biased due to the unobserved time-constant individual-specific confounder (I
).
We know that a fixed-effect panel model will remove any bias from time-constant individual-specific effects. The drawback of this model is that no time-constant effects can be estimated. Even though SI
and I
are such variables, the peer effect can nonetheless be estimated because the networks are probabilistic and change randomly across waves. This is an important insight. If networks change across waves, peer effects of time-constant variables can nonetheless be estimated:
plm(
achievement ~ alter_SI + alter_I,
model="within",
index = c("uid", "wave"),
data=dat2.1
) # B: Panel fixed effect model, all exogenous peer effects included
##
## Model Formula: achievement ~ alter_SI + alter_I
##
## Coefficients:
## alter_SI alter_I
## 1 1
plm(
achievement ~ alter_SI,
model="within",
index = c("uid", "wave"),
data=dat2.1
) # B: Panel fixed effect model, alter_I omitted
##
## Model Formula: achievement ~ alter_SI
##
## Coefficients:
## alter_SI
## 1.0061
We observe in (B) that the exogenous peer effects perfectly are perfectly recovered. The estimated effects are correct regardless whether or not all exogenous peer effects are included.
The same result is achieved with the spatial panel model in (C). The only difference to the plm() is that spml() also models a peer effect on the disturbances.
spml(
achievement ~ alter_SI + alter_I,
lag=F, # endogenous peer effect
spatial.error="b", # peer effect on disturbances
model="within", # change within observations is considered (i.e. individual FE)
effect="individual", # individual FE (vs time FE)
index = c("uid", "wave"),
listw=mat2listw(w2.1, style="W"), # is assumed to be constant across time
data=dat2.1
) # C Panel spatial autoregressive combined model (not run)
In dat2.2
, networks form endogenously. Note that the selection process is time-constant:
\(Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))\)
The DGP for the outcome is the same as before: \(achievement_{it} = 1SI_i + 1I_i + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}\)
lm(
achievement ~ alter_SI + alter_I,
data=dat2.2
) # A: Pooled model
##
## Call:
## lm(formula = achievement ~ alter_SI + alter_I, data = dat2.2)
##
## Coefficients:
## (Intercept) alter_SI alter_I
## -0.07823 2.37146 -0.34580
Endogenous network formation amplifies the bias of the pooled model in (A).
plm(
achievement ~ alter_SI + alter_I,
model="within",
index = c("uid", "wave"),
data=dat2.2
) # B: Fixed-effect panel model
##
## Model Formula: achievement ~ alter_SI + alter_I
##
## Coefficients:
## alter_SI alter_I
## 1 1
We observe in (B) that the panel model is not affected by a time-constant selection process. This is another important insight. If we can credibly argue that the selection process is time-constant, then the panel fixed-effect model will not only remove confounding from variables that influence the outcome but also the selection process!
plm(
achievement ~ alter_SI,
model="within",
index = c("uid", "wave"),
data=dat2.2 # (! network formation is endogenous)
) # C: Fixed-effect panel model in which alter_I is omitted
##
## Model Formula: achievement ~ alter_SI
##
## Coefficients:
## alter_SI
## 1.1261
We observe in (C), however, that if relevant exogenous peer effects are omitted, the remaining exogenous peer effects will be biased.
plm(
achievement ~ alter_SI,
model="within",
index = c("uid", "wave"),
data=dat2.1 # (! network formation is exogenous)
) # D: Fixed-effect panel model in which alter_I is omitted but the network formation process is exogenous
##
## Model Formula: achievement ~ alter_SI
##
## Coefficients:
## alter_SI
## 1.0061
As can be observed in (D), this is not the case if network formation is exogenous.
The DGP in dat2.3
is the following:
\(Pr(ij) = 0.5\)
\(achievement_{it} = 1SI_i + 1I_i + 0.5\sum_{j}(w_{ijt}y_{jt})_{it}\)
lm(
achievement ~ alter_achievement + SI + I,
data=dat2.3
) # A: Pooled model with full DGP
##
## Call:
## lm(formula = achievement ~ alter_achievement + SI + I, data = dat2.3)
##
## Coefficients:
## (Intercept) alter_achievement SI I
## -4.166e-17 5.000e-01 1.000e+00 1.000e+00
lm(
achievement ~ alter_achievement,
data=dat2.3
) # B: Pooled model with time-constant individual effects omitted
##
## Call:
## lm(formula = achievement ~ alter_achievement, data = dat2.3)
##
## Coefficients:
## (Intercept) alter_achievement
## 0.05337 -1.32800
plm(
achievement ~ alter_achievement,
model="within",
index = c("uid", "wave"),
data=dat2.3
) # C: Panel fixed-effect model
##
## Model Formula: achievement ~ alter_achievement
##
## Coefficients:
## alter_achievement
## 0.5
We observe in (A), (B), and (C) that, just like with exogenous peer effects, the pooled model only recovers the correct effect estimate if the full DGP captured. As soon as time-constant unobservables are present, the estimates are incorrect. The panel fixed-effect model, however, estimates the correct endogenous peer effect if networks are random.
The DPG in dat2.4
is the same as before with the exception that networks form endogenously in a time-constant selection process:
\(Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))\)
\(achievement_{it} = 1SI_i + 1I_i + 0.5\sum_{j}(w_{ijt}y_{jt})_{it} + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}\)
lm(
achievement ~ alter_achievement + SI + I,
data=dat2.4
) # A: Pooled model with full DGP
##
## Call:
## lm(formula = achievement ~ alter_achievement + SI + I, data = dat2.4)
##
## Coefficients:
## (Intercept) alter_achievement SI I
## 7.692e-17 5.000e-01 1.000e+00 1.000e+00
lm(
achievement ~ alter_achievement,
data=dat2.4
) # B: Pooled model with time-constant individual effects omitted
##
## Call:
## lm(formula = achievement ~ alter_achievement, data = dat2.4)
##
## Coefficients:
## (Intercept) alter_achievement
## 0.05541 1.33311
plm(
achievement ~ alter_achievement,
model="within",
index = c("uid", "wave"),
data=dat2.4
) # C: Panel fixed-effect model
##
## Model Formula: achievement ~ alter_achievement
##
## Coefficients:
## alter_achievement
## 0.5
We observe in (A), (B), and (C) the same pattern as before. The pooled model only recovers the correct results if the DGP is fully captured. The panel fixed effect model, however, estimates the correct endogenous peer effect even if the networks formed endogenously - as long as the selection process is time-constant.
Not yet covered.
remotes::install_github("RozetaSimonovska/SDPDmod")
Let us draw some conclusions from this exercise:
In the cross-sectional setting:
In the panel setting:
Note that this simulation ignored important complications: