Codes and Data for Examining the Algorithmic Fairness in Predicting High School Dropouts(accepted by EDM2024)

Chenguang Pan and Zhou Zhang
Teachers College Columbia University

1. Instruction

We have open-sourced the data and code of our paper Examining the Algorithmic Fairness in Predicting High School Dropouts accepted by Educational Data Mining 2024. This project is based on the public version of a nationally representative dataset called High School Longitudinal Study of 2009 (HSLS:09). You can find the detailed information here.

The raw data can be accessed on NCES's DataLab. Please click the download button on the right side of this webpage and choose the R version (i.e., in the .rdata format). The cleaned data is in the 01_data/02_processed_data folder.

The codebook and official documents are in the 04_files folder.

We welcome any comments, questions, and bug reports on this study. This current study is accepted as a short (less mature) paper rather than a full paper. We are currently working on the extension, and we plan to improve the predictive performance, conduct a more comprehensive examination of the algorithmic fairness in High school dropout prediction, and investigate ways to reduce the predictive bias.

Github Copilot, an AI coding assistant, was used to assist in the development of these coding scripts.

2. Running the code

2.1 Our coding environment

System: Mac OS Sonoma 14.5
CPU&GPU: Apple silicon M1 Pro
Unified RAM: 32GB
R version: 4.3.0 (2023-04-21) -- "Already Tomorrow"
Python version: 3.8.17
Pytorch version: 2.0.1
Sklearn version: 1.2.2

For Mac users: We noticed that the higher version of Python might cause a conflict between the sklearn and PyTorch, which sometimes leads to crashes in Jupyter Notebook. Therefore, we switched to Python 3.8.17, and the code ran well. Another tricky thing is that running the neural network model (in 02_NN_models.ipynb) on an Apple CPU is faster than running on the mps (Metal Performance Shaders) when using the Pytorch 2.0.1.

For Windows users: all those codes should run well on the latest version of R, and Python with the required packages.

2.2 Running the scripts

In the 02_codings folder:

01_Data_cleaning.r shows the details about how we clean the raw dataset. The cleaned data were already stored in the 01_data folder. You directly skip this file if not interested.
02_NN_models.ipynb is to build the neural network models.
03_ML_models.ipynb is about building all the other machine learning models including the logistic regression, random forest, XGBoost, support vector machine.
functions.py contains all the necessary functions used in the scripts above. You can check the details about running the DAF functions if you want to apply it to another project.

3. Citation

@inproceedings{pan2024examining,
  title={Examining the Algorithmic Fairness in Predicting High School Dropouts},
  author={Pan, Chenguang and Zhang, Zhou},
  booktitle={Proceedings of the 17th International Conference on Educational Data Mining},
  pages={262--269},
  year={2024}
}

4. Contact

Please email us for feedback, questions, comments, and bug reports. Thank you very much!
Chenguang Pan and Zhou Zhang
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
01_data/02_processed_data		01_data/02_processed_data
02_codings		02_codings
03_outputs/02_tables		03_outputs/02_tables
04_files		04_files
.DS_Store		.DS_Store
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
[NEW]Slides for EDM 2024 presentation.pdf		[NEW]Slides for EDM 2024 presentation.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codes and Data for Examining the Algorithmic Fairness in Predicting High School Dropouts(accepted by EDM2024)

1. Instruction

2. Running the code

2.1 Our coding environment

2.2 Running the scripts

3. Citation

4. Contact

About

Releases

Packages

Languages

License

cgpan/HSLSdropout

Folders and files

Latest commit

History

Repository files navigation

Codes and Data for Examining the Algorithmic Fairness in Predicting High School Dropouts(accepted by EDM2024)

1. Instruction

2. Running the code

2.1 Our coding environment

2.2 Running the scripts

3. Citation

4. Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages