diff --git a/.gitignore b/.gitignore index 77ecf9f..0711a5a 100644 --- a/.gitignore +++ b/.gitignore @@ -176,3 +176,4 @@ cython_debug/ /Submodule_1/Exercises/*.csv /Submodule_1/Tutorials/foo.* /Submodule_3/Tutorials/*.csv +/Submodule_3/Tutorials/*.names diff --git a/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb b/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb index edc7ca7..6223f73 100644 --- a/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb +++ b/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb @@ -6,6 +6,8 @@ "source": [ "# Basic Data Cleaning\n", "\n", + "Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).\n", + "\n", "## Overview\n", "\n", "This tutorial covers basic data cleaning techniques using Python and pandas. We'll explore common data quality issues and learn how to address them effectively.\n", @@ -47,22 +49,46 @@ "metadata": {}, "outputs": [], "source": [ - "# Install required packages\n", - "%pip install numpy\n", "%pip install pandas\n", - "%pip install requests\n", - "\n", - "# Import necessary libraries\n", + "%pip install requests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import necessary libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "from pathlib import Path\n", "\n", - "import numpy as np\n", "import pandas as pd\n", - "import requests\n", + "import requests" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Define utility functions\n", "\n", + "Define a helper function for downloading example datasets. \n", "\n", - "# Define helper function for downloading example datasets.\n", - "# (It is not essential that you understand the following code--it is just for\n", - "# getting the example data.)\n", + "*Note!* It is not essential that you understand the following code. It is just for getting the example data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ "def download(url, to_file):\n", " \"\"\"Download content from the given URL and save it to a file.\n", "\n", @@ -72,20 +98,8 @@ "\n", " \"\"\"\n", " response = requests.get(url, timeout=10)\n", - " Path(to_file).write_bytes(response.content)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this tutorial, you will learn:\n", - "\n", - "* How to identify and remove column variables that only have a single value.\n", - "* How to identify and consider column variables with very few unique values.\n", - "* How to identify and remove rows that contain duplicate observations.\n", - "\n", - "Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/)." + " Path(to_file).write_bytes(response.content)\n", + " print(f\"downloaded file '{to_file}'\")" ] }, { diff --git a/Submodule_3/Tutorials/Submodule_3_Tutorial_2_Mark_and_Remove_Missing_Data.ipynb b/Submodule_3/Tutorials/Submodule_3_Tutorial_2_Mark_and_Remove_Missing_Data.ipynb index 8a3fa2d..0b9250f 100644 --- a/Submodule_3/Tutorials/Submodule_3_Tutorial_2_Mark_and_Remove_Missing_Data.ipynb +++ b/Submodule_3/Tutorials/Submodule_3_Tutorial_2_Mark_and_Remove_Missing_Data.ipynb @@ -4,32 +4,121 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#**Mark and Remove Missing Data**" + "# Mark and Remove Missing Data\n", + "\n", + "Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).\n", + "\n", + "## Overview\n", + "\n", + "In this tutorial, we will learn how to handle missing data in datasets, specifically focusing on marking and removing missing values. By applying these techniques, you'll be able to prepare high-quality datasets for machine learning, improving model reliability and accuracy.\n", + "\n", + "We'll use the Pima Indians Diabetes dataset as an example to demonstrate these techniques.\n", + "\n", + "## Learning Objectives\n", + "\n", + "- Learn how to identify and mark invalid or corrupt values as missing in a dataset\n", + "- Understand how the presence of marked missing values affects machine learning algorithms\n", + "- Learn how to remove rows with missing data from a dataset\n", + "- Evaluate a learning algorithm on a dataset after removing rows with missing values\n", + "\n", + "## Prerequisites\n", + "\n", + "- Basic understanding of Python programming\n", + "- Familiarity with pandas, numpy, and scikit-learn libraries" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Get Started\n", + "\n", + "To start, we install required packages, import the necessary libraries, and define a helper function to download data using the `requests` library.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Install required packages\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install numpy\n", + "%pip install pandas\n", + "%pip install requests\n", + "%pip install scikit-learn" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Import necessary libraries" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from pathlib import Path\n", + "\n", + "import numpy as np\n", + "import pandas as pd\n", + "import requests\n", + "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", + "from sklearn.model_selection import KFold, cross_val_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "In this tutorial, you will learn:\n", + "### Define utility functions\n", "\n", - "* How to mark invalid or corrupt values as missing in your dataset.\n", - "* How to confirm that the presence of marked missing values causes problems for learning algorithms.\n", - "* How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.\n", + "Define a helper function for downloading example datasets. \n", "\n", - "Adapted from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/)." + "*Note!* It is not essential that you understand the following code. It is just for getting the example data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def download(url, to_file):\n", + " \"\"\"Download content from the given URL and save it to a file.\n", + "\n", + " Args:\n", + " url (str): The URL to download the content from.\n", + " to_file (str): The name of the file to save the downloaded content to.\n", + "\n", + " \"\"\"\n", + " response = requests.get(url, timeout=10)\n", + " Path(to_file).write_bytes(response.content)\n", + " print(f\"downloaded file '{to_file}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#Diabetes Dataset\n", + "## Diabetes Dataset\n", + "\n", "The dataset classifies patient as\n", - "either an onset of diabetes within five years or not. \n", + "either an onset of diabetes within five years or not.\n", + "\n", "```\n", "Number of Instances: 768\n", - "Number of Attributes: 8 plus class \n", + "Number of Attributes: 8 plus class\n", "For Each Attribute: (all numeric-valued)\n", " 1. Number of times pregnant\n", " 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test\n", @@ -47,19 +136,20 @@ " 0 500\n", " 1 268\n", "```\n", + "\n", "You can learn more about the dataset here:\n", "\n", - "* Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))\n", - "* Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))\n", + "- Diabetes Dataset File ([pima-indians-diabetes.csv](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv))\n", + "- Diabetes Dataset Details ([pima-indians-diabetes.names](https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names))\n", "\n", - "The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database)." + "The description of Diabetes Dataset can be found [here](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "##Download Diabetes data files" + "### Download Diabetes data files\n" ] }, { @@ -68,9 +158,22 @@ "metadata": {}, "outputs": [], "source": [ - "!pip install wget\n", - "!python -m wget \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv\" -o pima-indians-diabetes.csv\n", - "!python -m wget \"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names\" -o pima-indians-diabetes.names" + "download(\n", + " url=\"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv\",\n", + " to_file=\"pima-indians-diabetes.csv\",\n", + ")\n", + "\n", + "download(\n", + " url=\"https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names\",\n", + " to_file=\"pima-indians-diabetes.names\",\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load and summarize the dataset\n" ] }, { @@ -79,10 +182,9 @@ "metadata": {}, "outputs": [], "source": [ - "# load and summarize the dataset\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", "# Peek into the top five rows\n", "dataset.head()" ] @@ -93,7 +195,7 @@ "metadata": {}, "outputs": [], "source": [ - "# summarize the dataset\n", + "# Summarize the dataset\n", "print(dataset.describe())" ] }, @@ -102,21 +204,23 @@ "metadata": {}, "source": [ "We can see that there are columns that have a minimum value of zero (0).\n", - "On some columns, a value of zero does not make sense and indicates an invalid or missing value." + "On some columns, a value of zero does not make sense and indicates an invalid or missing value.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "Specifically, the following columns have an invalid zero minimum value:\n", - "2. Plasma glucose concentration\n", - "3. Diastolic blood pressure\n", - "4. Triceps skinfold thickness\n", - "5. 2-Hour serum insulin\n", - "6. Body mass index\n", + "Specifically, the following columns have an invalid zero minimum value: 2. Plasma glucose concentration 3. Diastolic blood pressure 4. Triceps skinfold thickness 5. 2-Hour serum insulin 6. Body mass index\n", "\n", - "We can confirm this by looking at the raw data and printing out the first 20 rows of data." + "We can confirm this by looking at the raw data and printing out the first 20 rows of data.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Load the dataset and review rows\n" ] }, { @@ -125,11 +229,10 @@ "metadata": {}, "outputs": [], "source": [ - "# load the dataset and review rows\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# summarize the first 20 rows of data\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Summarize the first 20 rows of data\n", "print(dataset.head(20))" ] }, @@ -137,7 +240,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can get a count of the number of missing values on each of these columns." + "We can get a count of the number of missing values on each of these columns.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Summarizing the number of missing values for each variable\n" ] }, { @@ -146,13 +256,13 @@ "metadata": {}, "outputs": [], "source": [ - "# Summarizing the number of missing values for each variable\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# count the number of missing values for each column\n", - "num_missing = (dataset[[1,2,3,4,5]] == 0).sum()\n", - "# report the results\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Count the number of missing values for each column\n", + "num_missing = (dataset[[1, 2, 3, 4, 5]] == 0).sum()\n", + "\n", + "# Report the results\n", "print(num_missing)" ] }, @@ -163,7 +273,7 @@ "We can see that columns 1, 2 and 5 have just a few zero values, whereas columns 3 and 4\n", "show a lot more, nearly half of the rows. This highlights that different missing value strategies\n", "may be needed for different columns, e.g. to ensure that there are still a sufficient number of\n", - "records left to train a predictive model." + "records left to train a predictive model.\n" ] }, { @@ -175,7 +285,14 @@ "as NaN easily with the Pandas DataFrame by using the replace() function on a subset of\n", "the columns we are interested in. After we have marked the missing values, we can use the\n", "isnull() function to mark all of the NaN values in the dataset as True and get a count of the\n", - "missing values for each column." + "missing values for each column.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Marking missing values with nan values\n" ] }, { @@ -184,14 +301,13 @@ "metadata": {}, "outputs": [], "source": [ - "# Marking missing values with nan values\n", - "from numpy import nan\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# replace '0' values with 'nan'\n", - "dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)\n", - "# count the number of nan values in each column\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Replace '0' values with 'nan'\n", + "dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)\n", + "\n", + "# Count the number of nan values in each column\n", "print(dataset.isnull().sum())" ] }, @@ -199,7 +315,14 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We can confirm by printing out the first 20 rows of data." + "We can confirm by printing out the first 20 rows of data.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Review data with missing values marked with a nan\n" ] }, { @@ -208,14 +331,13 @@ "metadata": {}, "outputs": [], "source": [ - "# Review data with missing values marked with a nan\n", - "from numpy import nan\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# replace '0' values with 'nan'\n", - "dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)\n", - "# summarize the first 20 rows of data\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Replace '0' values with 'nan'\n", + "dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)\n", + "\n", + "# Summarize the first 20 rows of data\n", "print(dataset.head(20))" ] }, @@ -223,8 +345,11 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "#Missing Values Cause Problems\n", - "Having missing values in a dataset can cause errors with some machine learning algorithms." + "## Missing Values Cause Problems\n", + "\n", + "Having missing values in a dataset can cause errors with some machine learning algorithms.\n", + "\n", + "*Note!* You should see a message about an error occurring when you try to run the following code block." ] }, { @@ -233,44 +358,53 @@ "metadata": {}, "outputs": [], "source": [ - "# example where missing values cause errors\n", - "from numpy import nan\n", - "from pandas import read_csv\n", - "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", - "from sklearn.model_selection import KFold\n", - "from sklearn.model_selection import cross_val_score\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# replace '0' values with 'nan'\n", - "dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)\n", - "# split dataset into inputs and outputs\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Replace '0' values with 'nan'\n", + "dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)\n", + "\n", + "# Split dataset into inputs and outputs\n", "values = dataset.values\n", - "X = values[:,0:8]\n", - "y = values[:,8]\n", - "# define the model\n", + "X = values[:, 0:8]\n", + "y = values[:, 8]\n", + "\n", + "# Define the model\n", + "#\n", "# A classifier with a linear decision boundary, generated by fitting class\n", "# conditional densities to the data and using Bayes' rule.\n", "model = LinearDiscriminantAnalysis()\n", - "# define the model evaluation procedure using K fold cross-valiation\n", + "\n", + "# Define the model evaluation procedure using K fold cross-validation\n", "cv = KFold(n_splits=3, shuffle=True, random_state=1)\n", - "# evaluate the model accuracy score\n", - "result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')\n", - "# report the mean performance\n", - "print('Accuracy: %.3f' % result.mean())" + "\n", + "# Evaluate the model accuracy score, and report the mean performance if it succeeds.\n", + "try:\n", + " result = cross_val_score(model, X, y, cv=cv, scoring=\"accuracy\")\n", + " print(\"Accuracy: %.3f\" % result.mean())\n", + "except ValueError as e:\n", + " print(f\"********************* An error occurred *********************\\n{e}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "#Remove Rows With Missing Values\n", + "### Remove Rows With Missing Values\n", "\n", "The simplest approach for dealing with missing values is to remove entire predictor(s)\n", "and/or sample(s) that contain missing values.\n", "\n", "We can do this by creating a new Pandas DataFrame with the rows containing missing values\n", - "removed. Pandas provides the **dropna**() function that can be used to drop either columns or\n", - "rows with missing data. We can use **dropna**() to remove all rows with missing data," + "removed. Pandas provides the `dropna()` function that can be used to drop either columns or\n", + "rows with missing data. We can use `dropna()` to remove all rows with missing data.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Example of removing rows that contain missing values\n" ] }, { @@ -279,18 +413,19 @@ "metadata": {}, "outputs": [], "source": [ - "# example of removing rows that contain missing values\n", - "from numpy import nan\n", - "from pandas import read_csv\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# summarize the shape of the raw data\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Summarize the shape of the raw data\n", "print(dataset.shape)\n", - "# replace '0' values with 'nan'\n", - "dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)\n", - "# drop rows with missing values\n", + "\n", + "# Replace '0' values with 'nan'\n", + "dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)\n", + "\n", + "# Drop rows with missing values\n", "dataset.dropna(inplace=True)\n", - "# summarize the shape of the data with missing rows removed\n", + "\n", + "# Summarize the shape of the data with missing rows removed\n", "print(dataset.shape)" ] }, @@ -299,7 +434,14 @@ "metadata": {}, "source": [ "We now have a dataset that we could use to evaluate an algorithm sensitive to missing values\n", - "like LDA." + "like LDA.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Evaluate model on data after rows with missing data are removed\n" ] }, { @@ -308,30 +450,51 @@ "metadata": {}, "outputs": [], "source": [ - "# evaluate model on data after rows with missing data are removed\n", - "from numpy import nan\n", - "from pandas import read_csv\n", - "from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n", - "from sklearn.model_selection import KFold\n", - "from sklearn.model_selection import cross_val_score\n", - "# load the dataset\n", - "dataset = read_csv('pima-indians-diabetes.csv', header=None)\n", - "# replace '0' values with 'nan'\n", - "dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)\n", - "# drop rows with missing values\n", + "# Load the dataset\n", + "dataset = pd.read_csv(\"pima-indians-diabetes.csv\", header=None)\n", + "\n", + "# Replace '0' values with 'nan'\n", + "dataset[[1, 2, 3, 4, 5]] = dataset[[1, 2, 3, 4, 5]].replace(0, np.nan)\n", + "\n", + "# Drop rows with missing values\n", "dataset.dropna(inplace=True)\n", - "# split dataset into inputs and outputs\n", + "\n", + "# Split dataset into inputs and outputs\n", "values = dataset.values\n", - "X = values[:,0:8]\n", - "y = values[:,8]\n", - "# define the model\n", + "X = values[:, 0:8]\n", + "y = values[:, 8]\n", + "\n", + "# Define the model\n", "model = LinearDiscriminantAnalysis()\n", - "# define the model evaluation procedure\n", + "\n", + "# Define the model evaluation procedure\n", "cv = KFold(n_splits=3, shuffle=True, random_state=1)\n", - "# evaluate the model accuracy score\n", - "result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')\n", - "# report the mean performance\n", - "print('Accuracy: %.3f' % result.mean())" + "\n", + "# Evaluate the model accuracy score\n", + "result = cross_val_score(model, X, y, cv=cv, scoring=\"accuracy\")\n", + "\n", + "# Report the mean performance\n", + "print(\"Accuracy: %.3f\" % result.mean())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this tutorial, we learned how to:\n", + "\n", + "- Identify and mark missing values in a dataset\n", + "- Understand the impact of missing values on machine learning algorithms\n", + "- Remove rows with missing data\n", + "- Evaluate a machine learning model on a cleaned dataset\n", + "\n", + "These skills are crucial for preparing real-world datasets for analysis and machine learning tasks.\n", + "\n", + "## Clean up\n", + "\n", + "Remember to shut down your Jupyter notebook and delete any unnecessary resources when you're finished with this tutorial." ] } ],