diff --git a/content/topics/Analyze/causal-inference/panel-data/system-gmm.md b/content/topics/Analyze/causal-inference/panel-data/system-gmm.md index c66080b2b..8d8916f8c 100644 --- a/content/topics/Analyze/causal-inference/panel-data/system-gmm.md +++ b/content/topics/Analyze/causal-inference/panel-data/system-gmm.md @@ -14,9 +14,9 @@ aliases: ## Overview -[Panel data](\paneldata) tracks observations of individuals over multiple time periods, enabling researchers to uncover dynamic patterns that can't be observed in cross-sectional or time-series data alone. While traditional static panel data models assume that idiosyncratic errors are uncorrelated across time periods, dynamic panel models account for temporal dependencies in the data by including the lagged dependent variable as a regressor, often providing a more accurate representation of economic relationships. For example, current employment and wage levels are likely dependent on their past levels. Or, interest rates are very likely to be influenced by last year's interest rate. +[Panel data](/topics/analyze/causal-inference/panel-data/paneldata/) tracks observations of individuals over multiple time periods, enabling researchers to uncover dynamic patterns that can't be observed in cross-sectional or time-series data alone. While traditional static panel data models assume that idiosyncratic errors are uncorrelated across time periods, dynamic panel models account for temporal dependencies in the data by including the lagged dependent variable as a regressor, often providing a more accurate representation of economic relationships. For example, current employment and wage levels are likely dependent on their past levels. Or, interest rates are very likely to be influenced by last year's interest rate. -This topic introduces the dynamic panel model and demonstrates how to estimate it, given that the estimation methods for panel data (e.g. [Fixed Effects](\within)) are likely to produce biased results. After introducing the dynamic panel data model and System-GMM estimation, a simple example of estimation in R is provided. +This topic introduces the dynamic panel model and demonstrates how to estimate it, given that the estimation methods for panel data (e.g. [Fixed Effects](/topics/analyze/causal-inference/panel-data/within-estimator/)) are likely to produce biased results. After introducing the dynamic panel data model and System-GMM estimation, a simple example of estimation in R is provided. ## Dynamic panel data model @@ -24,7 +24,7 @@ A general form of the dynamic panel data model is expressed as follows: $Y_{it} = \beta_1 Y_{i,t-1} + \beta_2 x_{it} + u_{it}$ -where +Where: - $Y_{it}$: Dependent variable for individual $i$ at time $t$ - $Y_{i,t-1}$: Lagged dependent variable @@ -50,7 +50,7 @@ As discovered by [Nickell (1981)](https://www.jstor.org/stable/1911408), includi $E(v_{it} | Y_{i,t-1}) ≠ 0$. -As a result, standard panel data estimators like [Fixed Effects](\within), [Random Effects](\random), and [First-Difference](\firstdifference) become biased and inconsistent. The coefficients of the lagged dependent variable ($\beta_1$) and other coefficients of interest ($\beta_2$) are potentially biased. The size of the bias depends on the length of the time period (T) and the persistence of the correlation, making the bias particularly significant in data with a short T and a large number of individuals (N). Increasing N will not decrease the bias. +As a result, standard panel data estimators like [Fixed Effects](/topics/analyze/causal-inference/panel-data/within-estimator/), [Random Effects](/topics/analyze/causal-inference/panel-data/random-effects/), and [First-Difference](/topics/analyze/causal-inference/panel-data/first-difference/) become biased and inconsistent. The coefficients of the lagged dependent variable ($\beta_1$) and other coefficients of interest ($\beta_2$) are potentially biased. The size of the bias depends on the length of the time period (T) and the persistence of the correlation, making the bias particularly significant in data with a short T and a large number of individuals (N). Increasing N will not decrease the bias. Specifically, the Nickell bias leads to: @@ -60,7 +60,7 @@ Since the dependent varaible ($Y_{it}$) is a function of the unobserved individu - *Underestimation with Fixed Effects (FE)* -The [FE estimator](\within) eliminates $\mu_{i}$ through demeaning (within transformation). This transformation introduces a negative correlation between the transformed lagged dependent variable and the error term, resulting in a downward bias. +The [FE estimator](/topics/analyze/causal-inference/panel-data/within-estimator/) eliminates $\mu_{i}$ through demeaning (within transformation). This transformation introduces a negative correlation between the transformed lagged dependent variable and the error term, resulting in a downward bias. {{% tip %}} *How does this bias exactly occur?* @@ -91,6 +91,7 @@ Difference GMM is the original estimator that uses only lagged levels of the end {{% /tip %}} The *two-step system GMM* estimation process consists of: + 1. *First-differencing* the variables to eliminate individual fixed effects 2. *Instrumenting* the differenced equations using lagged levels and differences of the variables. @@ -111,12 +112,13 @@ To ensure instrument validity, the Sargan test can be used. It tests whether all {{% tip %}} For more background on Instrumental Variable Estimation: -- [Intro to IV Estimation](\iv) + +- [Intro to IV Estimation](/topics/analyze/causal-inference/instrumental-variables/iv/) - [Bastardoz et al. (2023)](https://www.sciencedirect.com/science/article/abs/pii/S1048984322000765): A comprehensive review of IV Estimation discussing valid instruments. {{% /tip %}} -## Example in R +## Example application To illustrate the estimation of a dynamic panel data model, we use an example adjusted from [Blundell & Bond (1998)](https://www.sciencedirect.com/science/article/pii/S0304407698000098?casa_token=dYWIhT8f8OMAAAAA:ABPXjapGCr7BAZKtJVamMFPhU2yvYbgDcnAd7Usvp6H2QqyxhJftVQQ9i-KXcfAg_qH8BbAs). We are interested in the impact of wages and capital stock on employment rates. As current employment rates are expected to depend on the values of the previous year, a dynamic model is more suitable. The dataset consists of unbalanced panel data of 140 firms in the UK over the years 1976-1984. Specifically, the following model is estimated: @@ -124,6 +126,7 @@ $Emp_{it} = \beta_1 Emp_{i,t-1} + \beta_2 Wage_{it} + \beta_3 Wage_{i,t-1} + \be Where: + - $Emp_{it}$: Log of employment in firm $i$ in year $t$ - Independent variables include both current and lag values of log wages and log capital - $\mu_{i}$: Fixed effects per firm @@ -135,6 +138,10 @@ Where: - The dataset starts in 1976 because no employment data is provided for earlier years. Therefore, the first observation for each firm in the sample is not special, supporting the initial conditions restriction outlined by Blundell and Bond (1998). This restriction assumes that the initial observations of the dependent variable are not correlated with the individual-specific effects, which is crucial for the validity of the System-GMM estimation. {{% /tip %}} +### Model estimation in R + +We estimate the dynamic model in R using System-GMM with the [`pgmm` function](https://rdrr.io/cran/plm/man/pgmm.html) from the `plm` package. + {{% codeblock %}} ```R @@ -163,11 +170,19 @@ dyn_model <- pgmm(log(emp) ~ lag(log(emp), 1) + - `transformation = "ld"` applies the first-difference transformation (a System GMM model instead of a Difference GMM). - `collapse = TRUE` reduces the number of instruments to avoid overfitting the model. -{{% tip %}} Refer to the [pgmm documentation](https://rdrr.io/cran/plm/man/pgmm.html) for further information on the arguments within this function. + +{{% tip %}} + +*Stata users* + +For those using Stata, the `xtabond2` command is recommended for implementing System-GMM estimations. For guidance on using `xtabond2`, you can refer to [the presentation by Roodman](https://www.stata.com/meeting/5nasug/How2Do_xtabond2.ppt). +Addiontally, the full paper can be accessed [here](https://journals.sagepub.com/doi/epdf/10.1177/1536867X0900900106). + {{% /tip %}} -## Interpreting the output + +### Interpreting the output {{% codeblock %}} ```R @@ -201,11 +216,35 @@ indicates that first-order serial correlation is present, as expected. All the test outcomes confirm the validity of the model. However, dynamic panel data estimators are highly sensitive to the specific model specification and the choice of instruments. Therefore, it is good practice to conduct several robustness checks and experiment with different model specifications, for example varying the lag lengths. +### Reporting instrument count + +In System-GMM estimated models, it is crucial to report the number of instruments used. An excessive number of instruments can lead to overfitting, which reduces the validity of the estimation by making the model too complex and potentially unreliable. + +To obtain and report the number of instruments from our System-GMM model in R: + + +{{% codeblock %}} +```R +# Extract the list of instruments +W_list <- dyn_model$W + +# Count the number of columns in each matrix and sum them up +total_instruments <- sum(sapply(W_list, function(x) ncol(x))) + +# Print the result +total_instruments + +``` +{{% /codeblock %}} + +In this model, the total number of instruments used is `4480`. + {{% summary %}} Dynamic models account for temporal dependencies by including the lagged dependent variable, often providing more accurate results than static panel models. A Nickell bias arises from including the lagged dependent variable as an explanatory variable, making standard panel data estimators (FE, RE, FD) inconsistent, especially in analyses with short T and large N. System-GMM estimation addresses this by instrumenting the endogenous variables with their lagged values. + {{% /summary %}}