Skip to content

Analysis of Combined Sewer Overflow Dataset for FSE21 Paper Submission

License

Notifications You must be signed in to change notification settings

nicholas-maltbie/CSO-Analysis

Repository files navigation

CSO Case Study

This repository is provided as supplementary material for the paper "XAI Tools in the Public Sector: A Case Study on Predicting Combined Sewer Overflows" by Nicholas Maltbie, Nan Niu, Reese Johnson, and Matthew VanDoren.

These are the notes for the CSO case study, how the data is prepared, ML models are tuned and created, and the final interpretability analysis.

This repository contains instructions on how to use the code required to create models for the dataset and then how to apply these models to a sample dataset and gather expandability results for our research.

The data in this repository is randomized as the data used for the research is proprietary to our stakeholder.

This project is designed to run on a Linux system with an available NVIDIA GPU as well as a minimum of 4GB of RAM and disk space for libraries, datasets, and results (fairly small and should only require 8GB of disk storage for all libraries and installations).

Contents

Here is a description for each item in this project

Documentation

Scripts

  • run_lstm_hparam.py - A python script to generate a LSTM model for a given set of hyper parameters
  • hparams_search.sh - A script to automate searching through hyper parameters.

IPython Notebooks

These are jupyter notebook files that document how to run the project and have example visualizations and information.

  • Data_Preparation.ipynb - Prepare the data from the original sensors into a synchronized and interpolated form
  • Interpretability.ipynb - Apply interpretability tools to the various models.
  • Paper Charts.ipynb - Notebook with code to generate various charts used in the paper

Folders

  • Datasets - This is a representation of the data used in the project. We are not able to release the proprietary data we used from our stakeholder as part of the case study, but this is randomized data to help show how this code operates and how to use this in future projects.
  • Datasets-Synchronized - This is the synchronized and interpolated dataset generated from the sensor output
  • Dataset-Analysis - This folder holds results of tuning models (or a sample of model tuning) from the Datasets-Synchronized data. This folder is generated by the run_lstm_hparam.py script.

Setup Instructions

This project requires python 3.8 (installation guide), anaconda (installation guide).

Steps for installing and setting up the project can be found in the INSTALL file.

Project Organization

The project contains copies of all the files generated using the randomized data as part of the project. The files are all derived from the csv files in the Datasets folder.

The project uses assets in the following order

  1. Data_Perparation.ipynb to prepare and clean the data
  2. hparams_search.sh to find a set of tuned hyper parameters
  3. run_lstm_hparam.py to create final LSTM based models
  4. Interpretability.ipynb to complete an analysis using XAI tools
  5. Paper Charts.ipynb to create the charts based on results and analysis

A more detailed description of how to use these tools is written next.

Project Usage Explanation

This project should be run in the format of first following setup instructions to setup an environment. Then the Data_Perparation.ipynb notebook can be used to read in the raw sensor data from Datasets folder to create synchronized and interpolated datasets into the Datasets-Synchronized folder.

Next the hparams_search.sh can be used to search various hyper parameters. This uses the run_lstm_hparam.py to generate a LSTM based model. The final set of hyper parameters we ended up using (2 layers with 24 nodes) can be generated through this command:

python run_lstm_hparam.py \
    --end_offset 1 --start_offset 0 --seq_len 12 \
    --num_units 24 --dropout 0 --num_layers 2 \
    --class_weight 2 --learning_rate 0.001 \
    --batch_size=1024

To visualize the results of the training, you can use TensorBoard:

tensorboard --logdir Dataset-Analysis/lstm_hparams/logs

Once the model has been generated, results for each run can be found using either tensorboard under the hparams menu or by looking up the genreated results in Dataset-Analysis/lstm_hparams/logs/complete/{model_name}. This will have the results for both the validation subset as well as the training subset of the data.

Now that we have the functional results of the model, we can move onto generating the interpretability analysis of the model. To do this use the Interpretability.ipynb notebook. (this notebook has to be run using jupyter notebook and NOT jupyter lab due to limitations of the tools).

The final notebook Paper Charts.ipynb has code to generate various charts used in the paper.

About

Analysis of Combined Sewer Overflow Dataset for FSE21 Paper Submission

Resources

License

Stars

Watchers

Forks

Languages