- Overview
- R files
2.1 Preparation
2.2 Running the R files
2.2.1 functions.R
2.2.2 simulation_design_1.R and simulation_design_2.R
2.2.3 simulation_results_analysis.R and simulation_results_visualizations.R - Citation
- Updates
Title: Fair and Robust Estimation of Heterogeneous Treatment Effects for Optimal Policies in Multilevel Studies
Authors: Youmi Suk, Chan Park, Chenguang Pan, Kwangho Kim
Recently, there have been growing efforts in developing fair algorithms for treatment effect estimation and optimal treatment recommendations to mitigate discriminatory biases against disadvantaged groups. While most of this work has primarily focused on addressing discrimination due to individual-level attributes (e.g., race/ethnicity), it overlooks the broader impact of societal structures and cultural norms (e.g., structural racism) beyond the individual level. In this paper, we formalize the concept of multilevel fairness for estimating heterogeneous treatment effects to improve fairness in optimal policies. Specifically, we propose a general framework for the estimation of conditional average treatment effects under multilevel fairness constraints that incorporate individual-level sensitive variables, cluster-level sensitive variables, and their combinations. Using this framework, we analyze the trade-off between fairness and the maximum achievable utility by the optimal policy. We evaluate the effectiveness of our framework through a simulation study and a real data study on math course-taking plans using data from the High School Longitudinal Study of 2009.
For more details of our proposed methods, see our paper.
Before running the R files, please be sure all required R packages are installed successfully.
install.packages("Rmosek")
install.packages("MASS")
install.packages("matrixcalc")
install.packages("mbend")
install.packages("Matrix")
install.packages("ggplot2")
install.packages("lme4")
install.packages("bartCause")
install.packages("grf")
install.packages("parallel")
install.packages("dplyr")
install.packages("fastDummies")
Please note that installing Rmosek
package is a little bit different from the traditional way. Please check this link for instructions and this link for requesting a free license.
This file contains four functions for GLMM model fitting, estimating fair CATE on multilevel data, and simulated data generating:
GLMM_model()
Description
to fit the outcome model and treatment model using GLMM method.
Usage
GLMM_model <- function(data,
outcome,
treatment,
cluster,
multicategorical=NULL,
n_AGQ = 1,
fixed_intercept = TRUE,
glmer_Control=NULL)
Arguments
data
: a dataframe. You do not need to convert the multicategorical covariates into dummy (one-hot) codings before model fitting. We recommend users to normalize all the continuous covariates in advance for faster convergence in GLMM model fitting.outcome
: a string to indicate the name of the outcome variable.treatment
: a string to indicate the name of the treatment variable. Treatment must be a binary variable for the current version.cluster
: a string to indicate the name of the cluster id.multicategroical
: a string array to specify all the multicategorical variables. UseNULL
by default to indicate that there is no multicategorical variable in your dataframe. If specified, this function will automatically convert these multicategorical variables into one-hot codings. The multicategorical variable(s) must be in factor format.nAGQ
: integer scalar, same to the argument inglmer()
function from thelme4
packages. Values greater than 1 produce greater accuracy in the evaluation of the log-likelihood at the expense of speed.fixed_intercept
: logical. Whether to include a fixed grand intercept in the outcome model. UsingTRUE
by default. This may depend on your research setting.gmler_Control
: a list of correct class, resulting fromglmerControl()
containing control parameters. Same as the argumentcontrol
from thelme4
package. It is used to solve the non-convergence issue in fitting the GLMM model.
Example
# customize the optimizer and the maximal estimation iterations
glmer_Control <- glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun=100000))
# model fitting
glmm_out <- GLMM_model(data = dat0, outcome = "Y",
treatment = "A",
cluster = "id",
n_AGQ = 2,
fixed_intercept = FALSE,
glmer_Control = glmer_Control)
# check the fitted objects
summary(glmm_out$outcome.LMM)
summary(glmm_out$ps.GLMM)
fairCATE_multilevel()
Description:
to estimate the CATE with fairness constraints. It will return:
tau.hat
: the estimated CATEs.unfair
: the unfairness on each condition you specified in the argumentfairness
.cf_outcomes
: the estimated counterfactual outcomes from the fitted outcome model.ps_scores
: the estimated propensity scores from the fitted treatment model.
Usage
fairCATE_multilevel <- function(data,
sensitive,
legitimate=NULL,
fairness,
treatment,
outcome,
cluster,
multicategorical=NULL,
outcome.LMM,
ps.GLMM,
fixed_intercept = TRUE,
delta,
ps.trim="Sturmer.1")
Argument
-
data
: a dataframe. You do not need to convert the multicategorical covariates into dummy (one-hot) codings before model fitting. We recommend users to normalize all the continuous covariates in advance for faster convergence in GLMM model fitting. -
sensitive
: a string array to indicate the sensitive variable(s). -
legitimate
: a string array to indicate the legitimate variable(s) for conditional statistical disparity. Using NULL by default for statistical disparity. -
fairness
: a formula style string array to indicate the fairness constraints. It should start withtau ~
, where we usetau
to represent the CATE. For example, suppose your sensitive variable isS1
and the legitimate variable isL1
, your fairness constraint (conditional statistical disparity) should be written astau ~ S1|L1
, which means "get the CATE with fairness constraint on S1 after controlling for L1". The spaces between any letters or symbols in thisfairness
specification do not matter. -
treatment
: a string to indicate the name of the treatment variable. Treatment must be a binary variable for the current version. -
outcome
: a string to indicate the name of the outcome variable. -
cluster
: a string to indicate the name of the cluster id. -
multicategroical
: a string array to specify all the multicategorical variables. UseNULL
by default to indicate that there is no multicategorical variable in your dataframe. If specified, this function will automatically convert these multicategorical variables into one-hot codings. The multicategorical variable(s) must be in factor format. -
outcome.LMM
: a fitted outcome model returned from the functionGLMM_model
. -
ps.GLMM
: a fitted treatment model returned from the functionGLMM_model
. -
fixed_intercept
: logical. Whether to include a fixed grand intercept in the outcome model. UsingTRUE
by default. This may depend on your research setting. -
delta
: a numeric array or a list to indicate the (un)fairness tolerance level$\delta$ . Usually, you can use 20 to indicate no fairness constraint (i.e.,$\delta = \infty$ ) and 0.0001 for the most strict fairness constraint (i.e.,$\delta = 0$ ). For example, if you set thefairness
argument to bec("tau~S1", "tau~S2")
and you want to give fairness constraints on both two conditions, yourdelta
should bedelta = c(0.0001, 0.0001)
. -
ps.trim
: a string to choose the trimming method for propensity scores. It should be either "Sturmer.1" (the default) or "Sturmer.2". "Sturmer.1" is the common range method ver.1 by Stürmer et al. Am J Epidemiol 2021;190:1659–1670. "Sturmer.2" is the common range method ver.2 by Stürmer et al. Am J Epidemiol 2010;172:843–854.
Example
# customize the optimizer and the maximal estimation iterations
glmer_Control <- glmerControl(optimizer = "bobyqa",
optCtrl = list(maxfun=100000))
# model fitting
glmm_out <- GLMM_model(data = dat0, outcome = "Y",
treatment = "A",
cluster = "id",
n_AGQ = 2,
fixed_intercept = FALSE,
glmer_Control = glmer_Control)
# estimate CATEs with fairness constraints
ml_fr_out <- fairCATE_multilevel(data = dat0,
sensitive = c("S1","S2"),
legitimate = c("L1"),
fairness = c("tau~S1|L1","tau~S2"),
treatment = "A",
outcome = "Y",
cluster = "id",
multicategorical = NULL,
outcome.LMM = glmm_out$outcome.LMM,
ps.GLMM = glmm_out$ps.GLMM,
fixed_intercept = FALSE,
delta = c(20,20),
ps.trim="Sturmer.1")
# retrieve the estimated CATEs
cates_est <- ml_fr_out$tau_hat
# retrieve the details of unfairness
unfair_detail <- ml_fr_out$unfair
# retrieve the estimated counterfactual outcomes and propensity scores
cf_out <- ml_fr_out$cf_outcomes
ps_out <- ml_fr_out$ps_scores
create_multileveldata_D1()
Description
To generate the multilevel (individual-level and cluster-level) data. If setting the argument clustereffect=TRUE
, it will return a dataframe with:
- indivdiaul-level covariates:
X11
,X12
,X13
, andX14
. - individual-level sensitive variable:
S1
. - cluster-level covariates:
X21
,X22
, andX23
. - cluster-level sensitive variable:
S2
. - cluster ID:
id
. - true CATEs:
tau
. - true optimal treatment regime:
policy
. - propensity scores:
ps
. - observed treatment assignments:
A
. - the observed outcome:
Y
.
Usage
create_multileveldata_D1 <- function(cluster_num = 150,
cluster_size = 25,
R_var = 18,
V_var = 0.0001,
U_var = 0.0001,
clustereffect=FALSE)
Argument
cluster_num
: an integer to define the number of clusters.cluster_size
: an integer to define the cluster size.R_var
: a float to define the variance of residual in the outcome model.V_var
: a float to define the variance of cluster effect in treatment.U_var
: a float to define the variance of cluster effect in outcome.clustereffect
: logical. To determine whether to generate multilevel data.
Example
# Generate 7500 observations nested in 300 clusters, with the conditional ICC of .105
# for the outcome model, and .372 for the treatment model.
df <- create_multileveldata_D1(cluster_num = 300,
cluster_size = 25,
R_var = 0.6653, # var of residual
V_var = 1.95, # var of cluster effect in selection
U_var = 0.0776, # var of cluster effect in outcome
clustereffect=TRUE)
# retrieve the true heterogeneous treatment effect
tau <- df$tau
# retrieve the true ATE
ATE_true <- mean(df$tau)
# retrieve the true optimal regimes
OTRs_true <- df$policy
# retrieve the true Value under OTRs
value_true <- mean(ifelse(df$policy > 0, df$Y1, df$Y0))
create_multileveldata_D2()
Description
Based on the function create_multileveldata_D1
, to generate multilevel data with intersectionality. If setting the argument clustereffect=TRUE
, it will return a dataframe with:
- indivdiaul-level covariates:
X11
,X12
,X13
, andX14
. - individual-level sensitive variable:
S1
. - cluster-level covariates:
X21
,X22
, andX23
. - cluster-level sensitive variable:
S2
. - intersectional sensitive variable:
S_is_0
,S_is_1
,S_is_2
, andS_is_3
for the intersectional groups$(S1=0, S2=0)$ ,$(S1=1, S2=0)$ ,$(S1=0, S2=1)$ , and$(S1=1, S2=1)$ , respectively. - cluster ID:
id
. - true CATEs:
tau
. - true optimal treatment regime:
policy
. - propensity scores:
ps
. - observed treatment assignments:
A
. - the observed outcome:
Y
.
Usage
create_multileveldata_D2 <- function(cluster_num = 150,
cluster_size = 25,
R_var = 18,
V_var = 0.0001,
U_var = 0.0001,
clustereffect=FALSE)
Argument
cluster_num
: an integer to define the number of clusters.cluster_size
: an integer to define the cluster size.R_var
: a float to define the variance of residual in the outcome model.V_var
: a float to define the variance of cluster effect in treatment.U_var
: a float to define the variance of cluster effect in outcome.clustereffect
: logical. To determine whether to generate multilevel data.
Example
# Generate 7500 observations nested in 300 clusters, with the conditional ICC of .105
# for the outcome model, and .372 for the treatment model.
df <- create_multileveldata_D2(cluster_num = 300,
cluster_size = 25,
R_var = 0.6653, # var of residual
V_var = 1.95, # var of cluster effect in selection
U_var = 0.0776, # var of cluster effect in outcome
clustereffect=TRUE)
# retrieve the true heterogeneous treatment effect
tau <- df$tau
# retrieve the true ATE
ATE_true <- mean(df$tau)
# retrieve the true optimal regimes
OTRs_true <- df$policy
# retrieve the true Value under OTRs
value_true <- mean(ifelse(df$policy > 0, df$Y1, df$Y0))
# subset the generated dataframe for model building
dat0 <- df[,c("id","X11","X12","X13","S1","X14","X21","X22",
"X23","S2","A","Y","S_is_1","S_is_2","S_is_3")]
# mutate the dat0 with intersectional sensitive variables
library(dplyr)
dat0_is <- dat0 %>% select(-S1, -S2)
These two R files use parallel computing techniques to accelerate the simulation study process. Running either of them for 500 repetitions will be finished in about 2 hours on our platform, which is as follows:
- OS: Windows 11 Pro
- CPU: i7 14700F with PL1 250W, PL2 250W
- RAM: 96 GB
- R version: 4.4.1 (2024-06-14 ucrt) -- "Race for Your Life"
A large RAM (at least 64 GB) is recommended. R will borrow space from your hard drive if RAM is full, which costs much more running time like 5 hours on 32GB RAM compared to 2 hours on 96GB RAM.
These two scripts run well on Apple's Mac systems.
Please scroll down to the last few lines of code and replace 20
with the repetition number you'd like in this line of code:
results <- parLapply(cl, 1:20, run_iteration)
For details about our simulation design, please refer to section 5 Simulation Study in our paper.
After running the simulation scripts in section 2.2.2, please run these two files for results analysis and visualizations. simulation_results_analysis.R
will return the results similar to Table 1
and Table 2
in our paper. And, simulation_results_visualizations.R
will return the plots similar to Figure 1
, Figure 2
, and Figure S1
in our paper.
@article{Suk2024fair,
title={Fair and Robust Estimation of Heterogeneous Treatment Effects for Optimal Policies in Multilevel Studies},
author={Suk, Youmi and Park, Chan and Pan, Chenguang and Kim, Kwangho},
journal={PsyArXiv preprint},
year={2024},
doi={10.31234/osf.io/xz3jw}
}
- 2024.09: Repo initialized;
- 2024.10: Adjusted the layouts, revised the code's comments, changed the parameters' names to be consistent with our paper;