diff --git a/projects/project2/327290_327304_327308_327311/README.md b/projects/project2/327290_327304_327308_327311/README.md new file mode 100644 index 0000000..ae3ce37 --- /dev/null +++ b/projects/project2/327290_327304_327308_327311/README.md @@ -0,0 +1,50 @@ +# AutoPrep: Preprocessing perfected, Machine Learning simplified 🚀 + +## 👩‍💻 Authors +[Julia Kruk](https://github.com/krukj), [Paweł Pozorski](https://github.com/Pawlo77), [Kasia Rogalska](https://github.com/Katarzynarogalska) & [Gaspar Sekula](https://github.com/GasparSekula) + +## 📥 Repository +[AutoPrep Github](https://github.com/Pawlo77/AutoPrep) + +## 📚 Documentation +[AutoPrep Documentation](https://pawlo77.github.io/AutoPrep/)Documentation +[AutoPrep Documentation](https://pawlo77.github.io/AutoPrep/) + +## 🐍 PyPI +[AutoPrep Package Webpage](https://pypi.org/project/auto-prep/) + +## 🎯 Objective +The goal of **AutoPrep** is to provide users with a fully-automated machine learning package that handles most tasks for them. By emphasizing the significance of preprocessing in ML tasks, AutoPrep ensures a seamless, user-friendly experience for working with tabular data. + +### Key Features: +- Extensive preprocessing using **3000+ pipelines**. +- Automatic task recognition: regression, binary classification, or multiclass classification. +- Hyperparameter tuning and robust modeling. +- Explainable AI with **Shapley Plots**. +- Detailed LaTeX reporting (~20 pages) covering: + - Dataset overview 📊 + - Exploratory Data Analysis 📈 + - Preprocessing, hyperparameter tuning, and modeling details ⚙️ + - Interpretations of the best model with Shapley Plots 🔍. + +## 🛠 Specifications +- **Input**: Tabular data. +- **User-defined Target**: Specify the target column and let AutoPrep handle the rest. +- **Output**: + - A trained ML model optimized for the chosen metric. + - A professional LaTeX report for analysis and sharing. + +## 📂 Resources +- 📄 **Presentation**: [slides.pdf](./slides.pdf) +- 📖 **Guide and Full Description**: [walkthrough.ipynb](./walkthrough.ipynb) (also available online: [see walkthrough notebook](https://github.com/Pawlo77/AutoPrep/tree/main/examples/walkthrough/walkthrough.ipynb)) +- 📜 **Example Report**: [report.pdf](./report.pdf) + +## 🌟 Why Choose AutoPrep? +- Save time with **automated preprocessing**. +- Gain deep insights with **detailed LaTeX reports**. +- Ensure transparency with **explainable AI** tools. +- Benefit from cutting-edge automation with a focus on usually neglected preprocessing steps. + + +--- +🌟 **AutoPrep: Preprocessing perfected, Machine Learning simplified!** diff --git a/projects/project2/327290_327304_327308_327311/report.pdf b/projects/project2/327290_327304_327308_327311/report.pdf new file mode 100644 index 0000000..e8180ac Binary files /dev/null and b/projects/project2/327290_327304_327308_327311/report.pdf differ diff --git a/projects/project2/327290_327304_327308_327311/slides.pdf b/projects/project2/327290_327304_327308_327311/slides.pdf new file mode 100644 index 0000000..d66b504 Binary files /dev/null and b/projects/project2/327290_327304_327308_327311/slides.pdf differ diff --git a/projects/project2/327290_327304_327308_327311/walkthrough.ipynb b/projects/project2/327290_327304_327308_327311/walkthrough.ipynb new file mode 100644 index 0000000..7d3ecf5 --- /dev/null +++ b/projects/project2/327290_327304_327308_327311/walkthrough.ipynb @@ -0,0 +1,855 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# AutoPrep\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Table of contents \n", + "1. [Autoprep's objective](#autopreps-objective) \n", + "2. [Specifications](#specifications) \n", + "3. [Existing solutions](#existing-solutions) \n", + "4. [Preprocessing](#preprocessing)\n", + "5. [Modelling](#modelling)\n", + "6. [Report](#report)\n", + "7. [Example usage](#example-usage)\n", + "8. [Conclusion](#conclusion)\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## About AutoPrep\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Autoprep's objective\n", + "The goal of the Autoprep project is to provide users with a fully-automated machine learning package that handles most tasks for them. We aim to enhance the significance of preprocessing steps in machine learning tasks (hence Autoprep's name). We deliver extensive preprocessing as well as detailed reporting in researchers' beloved LaTeX. Additionally, hyperparameter tuning and modelling steps are definitely *not* neglected. Since we provide an *auto*-ML package, the system defines the task (regression, binary, or multiclass classification). Keeping in mind the AI Act, Autoprep delivers explainable solutions using Shapley Plots.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Who is it for?\n", + "Autoprep is a package with an intuitive interface, designed to minimize the user's effort. Focusing primarily on advanced preprocessing techniques and generating detailed, easy to export reports this package is dedicated to:\n", + "\n", + "* Python developers curious about the best preprocessing methods for their data\n", + "* Users who want to analyze every step of the ML process without executing it manually\n", + "* Developers interested in leveraging automated solutions for their everyday tasks\n", + "* Programmers eager to expand their knowledge of available preprocessing techniques \n", + "* Researchers examining preprocessing influence on the machine learning task\n", + "* Developers who still value traditional paper-based reports\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Specifications\n", + "Our package provides an automated machine learning system for **tabular** data. Users specify which column is the target, and the process begins. Autoprep emphasizes **preprocessing** by choosing from over 3000 possible pipelines (if non-required steps are chosen, including models). Not only do we provide users with tremendous results in terms of the chosen metric, but we also generate an extensive (around 20 pages, depending on the dataset) LaTeX report consisting of: \n", + "- dataset overview, \n", + "- exploratory data analysis, \n", + "- preprocessing, hyperparameter tuning, and modeling steps details, \n", + "- best model's interpretations with Shapley Plots.\n", + "\n", + "It is required that you have installed LaTeX compiler on your machine." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Existing solutions\n", + "It is nearly impossible to create something truly *new* these days, yet Autoprep somehow stands out from the crowd. Let's take a look at existing automated ML solutions and how they differentiate from Autoprep.\n", + "\n", + "1. **Auto-sklearn**: Focuses on model selection and hyperparameter tuning but lacks extensive preprocessing capabilities and detailed reporting.\n", + "2. **TPOT**: Automates the entire machine learning pipeline but does not provide detailed LaTeX reports or emphasize preprocessing as much as Autoprep.\n", + "3. **H2O.ai**: Offers a comprehensive suite of tools for automated machine learning but does not focus specifically on preprocessing or detailed LaTeX reporting.\n", + "4. **PyCaret**: Focuses on simplifying the machine learning process with a user-friendly interface. While PyCaret provides fantastic prototyping, model comparison, and blending, it does not offer the same level of preprocessing options or detailed LaTeX reporting as Autoprep.\n", + "5. **MLJAR**: Focuses on automating the machine learning pipeline with a range of models for both classification and regression. While it offers solid preprocessing capabilities like handling missing values, scaling, and feature importance-based reduction, it lacks some of the advanced preprocessing techniques, such as VIF or UMAP, that are provided by more specialized tools like Auto-prep.\n", + "6. **Hyperopt-Sklearn**: lacks advanced preprocessing capabilities, requiring manual setup for scaling, imputation, and feature selection, while also not supporting dimensionality reduction methods like PCA or UMAP. In contrast, Auto-prep offers a more comprehensive preprocessing pipeline, including advanced techniques like VIF for feature selection and UMAP for dimensionality reduction, along with automated handling of missing data and scaling and creating a detailed LaTeX report.\n", + "7. **Google AutoML Tables**: offers automated preprocessing but lacks fine-grained customization and advanced techniques like VIF for feature selection or UMAP for dimensionality reduction. In contrast, Auto-prep provides more flexibility with advanced preprocessing methods and better control over feature engineering and dimensionality reduction.\n", + "8. **EvalML**: Offers typical autoML features, as well as interpretability plots. What's different, is a great idea implemented in EvalML that is a possibility of plotting each pipeline as a graph. However, EvalML requires 'problem_type' argument, which is a task that has been automated in Autoprep.\n", + "9. **MLBox** : Allows users to automatically read data and gather statistics, as well as tune parameters and select the best model. In contrast to Autoprep, MLBox doesn't offer pdf generated raport, as most of the information is displayed in console. What's more the repository is no longer updated, so it does not work with well known packages as sklearn or pandas - it can be challenging to get it started. \n", + "10. **Ludwig** : Framework that allows multiple input data formats and user specifications, as well as interpretability plots. As it generates many interesting plots and statistics, visualisations are not displayed automatically during training. Autoprep generates one raport containing everything at the end of the process.\n", + "\n", + "\n", + "While these solutions are powerful, Autoprep aims to differentiate itself by focusing extensively on the preprocessing steps and providing detailed LaTeX reports. Our goal is to offer a comprehensive and explainable automated machine learning package that meets the needs of both novice and experienced users.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Technical details \n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Preprocessing \n", + "As the name suggests, it is Autoprep's core component.\n", + "\n", + "In Autoprep, we distinguish between **required** and **additional** (aka. non-required) steps. \n", + "\n", + "The obligatory phases consist of:\n", + "- **Missing data imputation**: for numerical data, we impute the median, and for categorical data, we impute the most frequent value or \"Missing\" string if NAs dominate in the column.\n", + "- **Removing columns with 100% unique categorical values**\n", + "- **Categorical features encoding**: if there are fewer than 5 values, One Hot Encoding is used; otherwise, Label Encoding is applied.\n", + "- **Scaling**: three scalers are possible: min-max, robust, and standard scaler.\n", + "- **Removing columns with 0 variance**\n", + "- **Detecting highly correlated features**: if features are highly correlated (default threshold = 0.8), one of them is removed.\n", + "\n", + "The additional phases consist of:\n", + "- **Feature selection**: features may be selected based on their correlation with the target (default threshold = 0.7) or on Random Forest feature importance (default threshold: top 70%). \n", + "- **Dimension reduction**: using Principal Component Analysis (PCA) (threshold = 0.95), Uniform Manifold Approximation and Projection (UMAP) (50 components for datasets with over 100 columns or 50% features otherwise), or Variance Inflation Factor (VIF).\n", + "\n", + "Multiple pipelines are generated, from which we choose up to 16 (to save time). Then they are scored using a Random Forest (Classifier or Regressor) model: preprocessed data is fit into the model, AUC/MSE score is calculated, and each pipeline receives its rank. Subsequently, the 3 best pipelines are saved to a .joblib file and continue their journey to modelling.\n", + "\n", + "*Note:* We choose the top 3 best pipelines instead of just 1, since the Random Forest results might not differ significantly. Presenting this information in the report is beneficial to our business objective.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Modeling\n", + "\n", + "For classification tasks, there are 5 implemented models:\n", + "- K Neighbors Classifier,\n", + "- Logistic Regression,\n", + "- Gaussian Naive Bayes,\n", + "- Support Vector Machine (Classifier),\n", + "- Decision Tree Classifier.\n", + "\n", + "For regression tasks, Autoprep has 6 models:\n", + "- Linear Support Vector Machine (Regressor),\n", + "- K Neighbors Regressor,\n", + "- Random Forest Regressor,\n", + "- Bayesian Ridge,\n", + "- Gradient Boosting Regressor,\n", + "- Linear Regression.\n", + "\n", + "We have chosen simple models for two reasons: they consume less time and are easier to explain.\n", + "\n", + "Autoprep at this stage uses the three best pipelines (see [Preprocessing](#preprocessing) section), so there are 3 different datasets generated. Each model is fit with them. Of course, Autoprep adheres to the train-test split rule. In conclusion, there are 15 or 18 (for classification or regression, respectively) models evaluated to ensure the best performance. \n", + "\n", + "All models' hyperparameters are tuned using Randomized Search CV with 10 iterations (again, because of time). \n", + "\n", + "Based on test datasets' AUC/MSE score, three best models are selected and presented in the report.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Report \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Why LaTeX reports?\n", + "\n", + "Using PyLatex technology, we provide users with extensive (approx. 20 pages) LaTex reports. We have chosen LaTeX, since:\n", + "- it is highly customizable,\n", + "- it supports complex mathematical notation,\n", + "- it provides clear, transparent and adaptive style,\n", + "- it allows for high-quality typesetting.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### What is in the report?\n", + "\n", + "The report includes the following sections:\n", + "1. **Overview**: Provides a summary of the dataset.\n", + " 1. **system information**: results may differ on different soft- and hardware (Table 1),\n", + " 2. **dataset overview**: number of samples, number of features (categorical or numerical) (Table 2),\n", + " 3. *[classification only]* **target class distribution** presented in table,\n", + " 4. **missing values**: counts and percentage of missing values for each feature (from here we do not provide exact Table/Plot identifiers, as they differ depending on task type),\n", + " 5. **description of all features in the dataset**: types, memory usage,\n", + " 6. **description of numerical features in the dataset**: basic statistics like mean, sd, count, min, max, etc.,\n", + " 7. **description of categorical features in the dataset**: count, number of unique instances, most frequent.\n", + "\n", + "2. **Exploratory Data Analysis**: Includes visualizations and statistical summaries to understand the data distribution and relationships between features.\n", + " 1. **target variable**: barplot (class distribution incl.) for *classification* or histogram (mean, median incl.) for *regression*\n", + " 2. **missing values distribuiton**: on barplot,\n", + " 2. **distribution of all features**: presented on histogrms for numerical and on barplots for categorical,\n", + " 3. **correlation heatmap**: for numerical features,\n", + " 4. **boxplots**: for numerical features.\n", + "\n", + "3. **Preprocessing**: Details the preprocessing steps applied to the data, including missing data imputation, encoding, scaling, and feature selection.\n", + " 1. **list of preprocessing steps**: all possible steps (required and non-required)\n", + " 2. **pipelines**: 16 chosen for examination, with all steps listed,\n", + " 3. **best pipelines**: 3 best pipelines with respect to scoring function, with fit time,\n", + " 4. **best pipelines' details**: the best pipelines' description and parameters,\n", + " 5. **best pipelines' output overview**: enables user to see, how data has changed,\n", + " 6. **preprocessing pipelines runtime statistics**: pipelines fit time and scoring statistics.\n", + "\n", + "\n", + "4. **Modelling**: Describes the models used, their hyperparameters, and the performance metrics.\n", + " 1. **examined models list**\n", + " 2. **hyperparameter grids**\n", + " 3. **best models and pipelines along with their hyperparameters**: (after tunning) information about mean fit time, hyperparameters and test score.\n", + "5. **Model Interpretations**: Uses Shapley Plots to explain the predictions of the best models. Waterfall, bar and summary plots are presented (for each class if task is *classification*).\n", + "\n", + "The report is generated automatically and saved as a .pdf file, providing users with a comprehensive overview of the entire machine learning process.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Example usage\n", + "Here we present, how to use Autoprep.\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At first, install the Autorep package. " + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "#pip install auto-prep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For demonstration purposes, we will use openml's data." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import openml\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import config for your custom settings and Autoprep for all functionalities." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "from auto_prep.utils.config import config\n", + "from auto_prep.prep import AutoPrep" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Binary classsification\n", + "\n", + "Dataset: titanic\n", + "\n", + "In config, you can set pretty much everything. Here, we will set report name (titanic, because it is the name of the dataset) as well as where the report whould be saved. You can see all possible parameters to be set under link: \n", + "[config settings](https://pawlo77.github.io/AutoPrep/auto_prep.utils.html#module-auto_prep.utils.config)." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "config.set(\n", + " raport_name=\"titanic\",\n", + " root_dir=\"titanic_report\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Import dataset, create autoprep object and run. Remember to set `target_column`. Do not worry about the type of the task - our model will detect it on its own!" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "Fitting pipelines: 48pipeline [00:22, 2.13pipeline/s]\n", + "Scoring pipelines: 48pipeline [00:19, 2.48pipeline/s]\n", + "Tuning models for pipeline number 0: 100%|██████████| 5/5 [00:09<00:00, 1.85s/model]\n", + "Tuning models for pipeline number 1: 100%|██████████| 5/5 [00:07<00:00, 1.54s/model]\n", + "Tuning models for pipeline number 2: 100%|██████████| 5/5 [00:12<00:00, 2.57s/model]\n", + "Re-training best models...: 100%|██████████| 3/3 [00:00<00:00, 5.24model/s]\n", + "ExactExplainer explainer: 66it [00:13, 1.35it/s] \n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Report has been generated and saved in: titanic_report. \n", + "Charts are available in: titanic_report/titanic/charts. \n", + "Pipelines are saved in: titanic_report/titanic/pipelines\n" + ] + } + ], + "source": [ + "data = openml.datasets.get_dataset(40945).get_data()[0]\n", + "data[\"survived\"] = data[\"survived\"].astype(np.uint8)\n", + "\n", + "pipeline = AutoPrep()\n", + "pipeline.run(data, target_column=\"survived\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Wonderful! The report is now saved in `./titanic_report` directory, under name `titanic.pdf`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### Multiclass classification\n", + "\n", + "Dataset: cpu\n", + "\n", + "Similarily to previous example, we set, where the report should be saved. The process below is the same to the previus one.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "config.set(\n", + " raport_name=\"cpu\",\n", + " root_dir=\"cpu_report\"\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "posx and posy should be finite values\n", + "Fitting pipelines: 48pipeline [00:17, 2.68pipeline/s]\n", + "Scoring pipelines: 48pipeline [00:16, 2.84pipeline/s]\n", + "Tuning models for pipeline number 0: 100%|██████████| 5/5 [00:00<00:00, 36.29model/s]\n", + "Tuning models for pipeline number 1: 100%|██████████| 5/5 [00:00<00:00, 21.70model/s]\n", + "Tuning models for pipeline number 2: 100%|██████████| 5/5 [00:00<00:00, 36.00model/s]\n", + "Re-training best models...: 100%|██████████| 3/3 [00:00<00:00, 12.10model/s]\n" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Report has been generated and saved in: cpu_report. \n", + "Charts are available in: cpu_report/cpu/charts. \n", + "Pipelines are saved in: cpu_report/cpu/pipelines\n" + ] + } + ], + "source": [ + "data = openml.datasets.get_dataset(338).get_data(dataset_format=\"dataframe\")[0]\n", + "\n", + "pipeline = AutoPrep()\n", + "pipeline.run(data, target_column=\"GG_new\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Again, report has been generated and saved in: `./cpu_report` under name `cpu.pdf`." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "### Regression\n", + "\n", + "Dataset: ailerons\n", + "\n", + "Similarily to previous two examples, we set, where the report should be saved. The process below is the same to the previus ones. It does not matter whether your task is classification or regression - the usage of AutPrep is always easy and seamless!\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "config.set(\n", + " raport_name=\"ailerons\",\n", + " root_dir=\"ailerons_report\",\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Fitting pipelines: 48pipeline [00:18, 2.57pipeline/s]\n", + "Scoring pipelines: 48pipeline [00:18, 2.57pipeline/s]\n", + "Tuning models for pipeline number 0: 0%| | 0/6 [00:00