--- title: "Inferential SNA: Peer effect models" output: html_document: toc: true toc_float: true theme: united --- Benjamin Rosche and Wicia Fang Most courses on social network analysis (SNA) focus on descriptive SNA, such as measuring the density of a network, identifying subgroups within a network, or examining the centrality of actors within a network. Inferential SNA, which focuses on explaining the formation of networks and behaviors and beliefs of actors embedded in networks, by contrast, is often eschewed. This short course is an introduction to inferential SNA. The focus of this course is on models to explain how the behaviors and beliefs of actors are influenced by the networks in which they are embedded. Covered are cross-sectional, panel, and dynamic panel models to estimate exogenous and endogenous peer effects. In the following, we examine under which circumstances correct peer effects are recovered and under which circumstances they are not. For that, I created four cross-sectional scenarios (`dat1.1` through `dat1.4`) and four panel scenarios (`dat2.1` through `dat2.4`). These datasets are simulated and differ with respect to whether the peer effect is exogenous or endogenous and whether the networks have formed randomly or not. For each of the simulated scenarios, there is a corresponding adjacency matrix (`w1.1` through `w2.4`) describing the network structure. ```{r, echo=FALSE, message=FALSE} rm(list=ls()) # install.packages(c("plm", "spdep", "spatialreg", "splm")) library(plm) # panel regression library(spdep) # spatial functions library(spatialreg) # spatial cross-sectional regression library(splm) # spatial panel regression load("C:/Users/benja/OneDrive - Cornell University/GitHub/peersimulation/2022 isna workshop/isna.RData") ``` ## 1. Cross-sectional models --- ### 1.1. Exogenous peer effects and random networks :::dgp In this first scenario (`dat1.1` and `w1.1`), we consider exogenous peer effects on a social network that has formed randomly. The network formation process is very simple, each dyad has a 50% change of existing: $Pr(ij) = 0.5$ The outcome in all examples is `achievement` and the explanatory variables are `SI` and `I`. Both explanatory variables are randomly and independently normally distributed and there are no confounding influences in the data generation process (DGP). `SI` (*selection+influence*) is a variable that affects both selection into dyads and influences the outcome while `I` (*influence*) only affects the outcome. However, since network formation is random in 1.1., both variables only affect the outcome: $achievement_i = 1SI_i + 1 I_i + 1 \sum_{j}w_{ij}SI_j + 1 \sum_{j}w_{ij}I_j$ That is, achievement of i is influenced by its own value of `SI` and `I` and by the average `SI` and `I` of her peers. As such, the exogenous peer effect is local because individuals are only affected by those peers to which they are directly connected. ::: ```{r, warning=FALSE} lm(achievement ~ SI + I, data=dat1.1) # A: conventional regression model ``` In (A), the exogenous peer effects ($SI_j$, $I_j$) are omitted. We observe that ego's corresponding features ($SI_i$, $I_i$) are nonetheless estimated correctly. ```{r, warning=FALSE} lmSLX( achievement ~ SI + I, Durbin = ~ SI + I, listw = mat2listw(w1.1, style="W"), # adjacency matrix data = dat1.1 ) # B: Spatial lagged-X model lmSLX( achievement ~ SI, Durbin = ~ SI , listw = mat2listw(w1.1, style="W"), # adjacency matrix data = dat1.1 ) # C: Spatial lagged-X model ``` In (B), we include the exogenous peer effects. We observe that the estimates are perfectly recovered. We observe in (C) that even if covariates are omitted, the remaining individual and exogenous peer effects are still estimated correctly. Generally, it makes sense to also include the adjacency matrix in the disturbance term as it will produce more appropriate standard errors (and with that, hypothesis tests). This is done in (D): ```{r, eval=FALSE} lagsarlm( achievement ~ SI + I, Durbin = ~ SI + I, # exogenous peer effects listw = mat2listw(w1.1, style="W"), data = dat1.1 ) # D: Spatial Durbin error model (not run) ``` --- ### 1.2. Exogenous peer effects and endogenous networks :::dgp In `dat1.2`, we consider a situation in which network formation is endogenous: $Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))$ That is, dyads are more likely for peers that are similar in `SI`. The DGP for the outcome is the same: $achievement_i = 1SI_i + 1 I_i + 1 \sum_{j}w_{ij}SI_j + 1 \sum_{j}w_{ij}I_j$ ::: ```{r, warning=FALSE} lm(achievement ~ SI + I, data=dat1.2) # A: Conventional regression model, both exogenous peer effects omitted ``` In (A), we observe that that the ego feature `I` is still estimated correctly with a conventional regression model. `SI`, however, is biased now. More generally, the effect of individual features that are involved in the selection process will be biased if the variables also have an effect on peers. ```{r, warning=FALSE} lmSLX( achievement ~ SI + I, Durbin = ~ SI, listw = mat2listw(w1.2, style="W"), data = dat1.2 ) # B: Spatial lagged-X model, one exogenous peer effect (I) omitted ``` As we observe in (B), `SI` is also biased if we do not include all relevant exogenous peer effects. Moreover, the included exogenous peer effect (`SI`) is likewise biased! ```{r, warning=FALSE} lmSLX( achievement ~ SI + I, Durbin = ~ SI + I, listw = mat2listw(w1.2, style="W"), data = dat1.2 ) # C: Spatial lagged-X model ``` In (C), we observe that endogenous network formation will not bias estimates if we capture the true DGP. --- ### 1.3. Endogenous peer effect and random networks :::dgp In `dat1.3`, networks are random again: $Pr(ij) = 0.5$ but, here, we consider an endogenous peer effect. The endogenous peer effect has a global impact on the network since any change in the outcome of an individual will not only affect all direct peers but also the peers of peers. The endogenous peer effect can be conceptualized as a social interaction process, in which the value of the dependent variable for each individual is jointly determined with that of her peers. This is the DGP: $achievement_i = 0.5\sum_{j}w_{ij}achievement_j + 1SI_i + 1I_i$. ::: ```{r, warning=FALSE} m13.listw <- mat2listw(w1.3, style="W") m13 <- lagsarlm( achievement ~ SI + I, listw=m13.listw, data=dat1.3 ) # A: Spatial autoregressive model m13 ``` We observe in (A) that the endogenous peer effect (rho=$\rho$) is estimated correctly. The interpretation of the effect differs from exogenous peer effects since the spillover traverses the network. To understand the impact better, we can calculate marginal effects (dy/dx) of the features (x) using the impacts() function: ```{r, warning=FALSE} impacts(m13, listw = m13.listw) ``` We observe that the effect of `SI` and `I` exhibits a direct and indirect effect. The direct effects equal the regression coefficient of `SI` and `I`. The indirect effects equal the proportion of the total effect that is due to the ripple effect of the endogenous peer effect. That is, if $\beta_{SI}=1, \rho=0.5$, the indirect effect also equals 1. More generally, $IND=\beta/(1-\rho)\rho$. Note that this is only true for a row-standardizes adjacency matrix (i.e., mean - peer effect). ```{r, warning=FALSE} lagsarlm( achievement ~ SI, listw=m13.listw, data=dat1.3 ) # B: Spatial autoregressive model in which I is omitted ``` We observe in (B) that the endogenous peer effect ($\rho$) is biased if there are omitted variables but the individual feature effect of `SI` remains unaffected. Generally, it makes sense to also include the adjacency matrix in the disturbance term as it will produce more appropriate standard errors (and with that, hypothesis tests). This is done in (C): ```{r, eval=FALSE} sacsarlm( achievement ~ SI + I, listw=m13.listw, data=dat1.3 ) # C: Spatial autoregressive combined model (not run) ``` --- ### 1.4. Endogenous peer effect and endogenous networks :::dgp In `dat1.4` we consider an endogenous network formation process: $Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))$ as well as a endogenous peer effect: $achievement_i = 0.5\sum_{j}w_{ij}achievement_j + 1I_i + 1SI_i$. ::: ```{r, warning=FALSE} lagsarlm( achievement ~ SI + I, listw=mat2listw(w1.4, style="W"), data=dat1.4 ) # A: Spatial autoregressive model will full DGP ``` We observe in (A) that the endogenous peer effect estimate is correct despite the endogenous network formation if the true DGP is captured. ```{r, warning=FALSE} lagsarlm( achievement ~ I, listw=mat2listw(w1.4, style="W"), data=dat1.4 ) # B: Spatial autoregressive model, exogenous peer effect of a covariate involved in selection process omitted (SI) lagsarlm( achievement ~ SI, listw=mat2listw(w1.4, style="W"), data=dat1.4 ) # C: Spatial autoregressive model, exogenous peer effect of a covariate NOT involved in the selection process omitted (I) ``` We observe in (B) that effects of covariates that are not involved in the selection process (`I`) are correctly recovered. The endogenous peer effect ($\rho$), however, is estimated incorrectly because because `SI` is omitted and the selection process therefore unobserved. In (C), we observe that omitting a variable that is not part of the selection process (`I`), by contrast, is less problematic. The endogenous peer effect is close to the true value of 0.5 and the individual feature effect (`SI`) is close to the true value of 1. --- ## 2. Panel models --- ### 2.1 Exogenous peer effects and random networks :::dgp We now move to panel data. `dat2.1` exhibits three waves but there are no temporal trends. That is, the DGP for each wave is: $Pr(ijt) = 0.5$ and $achievement_{it} = 1SI_i + 1I_i + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}$ The advantage of using a panel model is that individual-specific and/or time-specific fixed or random effects can be estimated. Let's start with a pooled (i.e. cross-sectional) model and omit a time-constant individual-specific effect to observe the advantages of a panel model. Since `SI` and `I` do not change across time, we can use them as time-constant unobservables. ::: ```{r, warning=FALSE} lm( achievement ~ alter_SI + alter_I, data=dat2.1 ) # A: Pooled model ``` Since `lmSLX` does not allow including an exogenous peer effect of `I` without including it as a main effect, I use `lm()` instead and manually compute $alter_{SI} = WSI$ and $alter_I = WI$. We observe in (A) that all effects are biased due to the unobserved time-constant individual-specific confounder (`I`). We know that a fixed-effect panel model will remove any bias from time-constant individual-specific effects. The drawback of this model is that no time-constant effects can be estimated. Even though `SI` and `I` are such variables, the peer effect can nonetheless be estimated because the networks are probabilistic and change randomly across waves. This is an important insight. If networks change across waves, peer effects of time-constant variables can nonetheless be estimated: ```{r, warning=FALSE} plm( achievement ~ alter_SI + alter_I, model="within", index = c("uid", "wave"), data=dat2.1 ) # B: Panel fixed effect model, all exogenous peer effects included plm( achievement ~ alter_SI, model="within", index = c("uid", "wave"), data=dat2.1 ) # B: Panel fixed effect model, alter_I omitted ``` We observe in (B) that the exogenous peer effects perfectly are perfectly recovered. The estimated effects are correct regardless whether or not all exogenous peer effects are included. The same result is achieved with the spatial panel model in (C). The only difference to the plm() is that spml() also models a peer effect on the disturbances. ```{r, eval=FALSE} spml( achievement ~ alter_SI + alter_I, lag=F, # endogenous peer effect spatial.error="b", # peer effect on disturbances model="within", # change within observations is considered (i.e. individual FE) effect="individual", # individual FE (vs time FE) index = c("uid", "wave"), listw=mat2listw(w2.1, style="W"), # is assumed to be constant across time data=dat2.1 ) # C Panel spatial autoregressive combined model (not run) ``` --- ### 2.2 Exogenous peer effects and endogenous networks :::dgp In `dat2.2`, networks form endogenously. Note that the selection process is time-constant: $Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))$ The DGP for the outcome is the same as before: $achievement_{it} = 1SI_i + 1I_i + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}$ ::: ```{r, warning=FALSE} lm( achievement ~ alter_SI + alter_I, data=dat2.2 ) # A: Pooled model ``` Endogenous network formation amplifies the bias of the pooled model in (A). ```{r, warning=FALSE} plm( achievement ~ alter_SI + alter_I, model="within", index = c("uid", "wave"), data=dat2.2 ) # B: Fixed-effect panel model ``` We observe in (B) that the panel model is not affected by a time-constant selection process. This is another important insight. If we can credibly argue that the selection process is time-constant, then the panel fixed-effect model will not only remove confounding from variables that influence the outcome but also the selection process! ```{r, warning=FALSE} plm( achievement ~ alter_SI, model="within", index = c("uid", "wave"), data=dat2.2 # (! network formation is endogenous) ) # C: Fixed-effect panel model in which alter_I is omitted ``` We observe in (C), however, that if relevant exogenous peer effects are omitted, the remaining exogenous peer effects will be biased. ```{r, warning=FALSE} plm( achievement ~ alter_SI, model="within", index = c("uid", "wave"), data=dat2.1 # (! network formation is exogenous) ) # D: Fixed-effect panel model in which alter_I is omitted but the network formation process is exogenous ``` As can be observed in (D), this is not the case if network formation is exogenous. --- ### 2.3. Endogenous peer effect and random networks :::dgp The DGP in `dat2.3` is the following: $Pr(ij) = 0.5$ $achievement_{it} = 1SI_i + 1I_i + 0.5\sum_{j}(w_{ijt}y_{jt})_{it}$ ::: ```{r, warning=FALSE} lm( achievement ~ alter_achievement + SI + I, data=dat2.3 ) # A: Pooled model with full DGP lm( achievement ~ alter_achievement, data=dat2.3 ) # B: Pooled model with time-constant individual effects omitted plm( achievement ~ alter_achievement, model="within", index = c("uid", "wave"), data=dat2.3 ) # C: Panel fixed-effect model ``` We observe in (A), (B), and (C) that, just like with exogenous peer effects, the pooled model only recovers the correct effect estimate if the full DGP captured. As soon as time-constant unobservables are present, the estimates are incorrect. The panel fixed-effect model, however, estimates the correct endogenous peer effect if networks are random. --- ### 2.4. Endogenous peer effect and endogenous networks :::dgp The DPG in `dat2.4` is the same as before with the exception that networks form endogenously in a time-constant selection process: $Pr(ij) = 1 / (1 + exp(-( 1 * similaritySI_{ij} )))$ $achievement_{it} = 1SI_i + 1I_i + 0.5\sum_{j}(w_{ijt}y_{jt})_{it} + 1(\sum_{j}w_{ijt}SI_{jt})_{it} + 1\sum_{j}(w_{ijt}I_{jt})_{it}$ ::: ```{r, warning=FALSE} lm( achievement ~ alter_achievement + SI + I, data=dat2.4 ) # A: Pooled model with full DGP lm( achievement ~ alter_achievement, data=dat2.4 ) # B: Pooled model with time-constant individual effects omitted plm( achievement ~ alter_achievement, model="within", index = c("uid", "wave"), data=dat2.4 ) # C: Panel fixed-effect model ``` We observe in (A), (B), and (C) the same pattern as before. The pooled model only recovers the correct results if the DGP is fully captured. The panel fixed effect model, however, estimates the correct endogenous peer effect even if the networks formed endogenously - as long as the selection process is time-constant. ```{r, echo=FALSE, eval=FALSE} spml( achievement ~ alter_SI + alter_I, lag=T, spatial.error="none", model="within", effect="individual", index = c("uid", "wave"), listw=mat2listw(w2.4, style="W"), data=dat2.4 ) # D: Spatial panel random-effect model # Note that spml() calls the spatial autoregressive parameter lambda and the error correlation rho. spml() does not recover the correct results in 2.3 and 2.4 - WHY? ``` --- ## 3. Dynamic panel models Not yet covered. ```{r, eval=FALSE} remotes::install_github("RozetaSimonovska/SDPDmod") ``` --- ## 4. Conclusions Let us draw some conclusions from this exercise: #### When are individual effect (ego features) estimates biased? - The effect of individual features ($\beta$) will not be biased in the presence of peer effects if networks are random - Endogenous network formation does not bias the effect of individual features ($\beta$) in the presence of an endogenous peer effect - However, $\beta$ will be biased if the individual feature is part of the selection process and influences peers (exogenous peer effect) - These results hold for cross-sectional and panel data using a RE estimator. The FE estimator cannot estimate time-constant individual effects #### When are peer effect estimates biased? **In the cross-sectional setting:** - If networks are random, exogenous peer effects are estimated correctly even if there are omitted variables - If networks are random, the endogenous peer effect is biased if there are omitted variables - Both exogenous and endogenous peer effects are biased if networks formed endogenously **In the panel setting:** - If networks evolve over time, we can estimate time-constant exogenous peer effects using a FE estimator - If networks are random, the FE model recovers the correct exogenous peer effects even if other exogenous peer effects are omitted - The FE model also recovers correct exogenous peer effects in the presence of time-constant selection effects if all relevant exogenous peer effects are included in the model - The FE model recovers the correct endogenous peer effect as long as the selection process is time-constant (!) --- **Note that this simulation ignored important complications:** - These results only hold for randomly distributed features. Individual features that are themselves affected by peer effects (i.e., $X=WX\Theta$) will be affected more by endogenous network formation. - We have not examined what changes if both exogenous and endogenous peer effects are present.