Merge pull request #16 from udel-cbcb/Submodule3_Tutorial2

Submodule3 tutorial2
udel-cbcb · Oct 20, 2024 · 68d9f8c · 68d9f8c
2 parents 4fd6c1d + 2b35b8a
commit 68d9f8c
Show file tree

Hide file tree

Showing 3 changed files with 318 additions and 140 deletions.
diff --git a/.gitignore b/.gitignore
@@ -176,3 +176,4 @@ cython_debug/
 /Submodule_1/Exercises/*.csv
 /Submodule_1/Tutorials/foo.*
 /Submodule_3/Tutorials/*.csv
+/Submodule_3/Tutorials/*.names
diff --git a/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb b/Submodule_3/Tutorials/Submodule_3_Tutorial_1_Basic_Data_Cleaning.ipynb
@@ -6,6 +6,8 @@
    "source": [
     "# Basic Data Cleaning\n",
     "\n",
+    "Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/).\n",
+    "\n",
     "## Overview\n",
     "\n",
     "This tutorial covers basic data cleaning techniques using Python and pandas. We'll explore common data quality issues and learn how to address them effectively.\n",
@@ -47,22 +49,46 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Install required packages\n",
-    "%pip install numpy\n",
     "%pip install pandas\n",
-    "%pip install requests\n",
-    "\n",
-    "# Import necessary libraries\n",
+    "%pip install requests"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import necessary libraries"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "from pathlib import Path\n",
     "\n",
-    "import numpy as np\n",
     "import pandas as pd\n",
-    "import requests\n",
+    "import requests"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Define utility functions\n",
     "\n",
+    "Define a helper function for downloading example datasets.  \n",
     "\n",
-    "# Define helper function for downloading example datasets.\n",
-    "# (It is not essential that you understand the following code--it is just for\n",
-    "#  getting the example data.)\n",
+    "*Note!* It is not essential that you understand the following code.  It is just for getting the example data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "def download(url, to_file):\n",
     "    \"\"\"Download content from the given URL and save it to a file.\n",
     "\n",
@@ -72,20 +98,8 @@
     "\n",
     "    \"\"\"\n",
     "    response = requests.get(url, timeout=10)\n",
-    "    Path(to_file).write_bytes(response.content)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this tutorial, you will learn:\n",
-    "\n",
-    "* How to identify and remove column variables that only have a single value.\n",
-    "* How to identify and consider column variables with very few unique values.\n",
-    "* How to identify and remove rows that contain duplicate observations.\n",
-    "\n",
-    "Adpated from Jason Brownlee. 2020. [Data Preparation for Machine Learning](https://machinelearningmastery.com/data-preparation-for-machine-learning/)."
+    "    Path(to_file).write_bytes(response.content)\n",
+    "    print(f\"downloaded file '{to_file}'\")"
    ]
   },
   {