In this project I will look at a dataset of patient data relating to breast cancer, which is available on Kaggle as the Wisconsin Breast Cancer dataset.
The dataset features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image in the 3-dimensional space is that described in: K. P. Bennett and O. L. Mangasarian: "Robust Linear Programming Discrimination of Two Linearly Inseparable Sets", Optimization Methods and Software 1, 1992, 23-34.
The dataset was released in November 1995 and the original source can be found here.
An example of images of cells that this data comes from of both malignant and benign tumors can be seen below.
I will develop a machine learning model that will aim to predict Malignant tumors with the highest accuracy.
- In the first project finished in July 2019, the best result was an overall F1 score on all categories of 0.99
- In the latest project finished in December 2020, the best result was an overall F1 score on all categories of 0.96. Despite this being a lower score than the first project, this is considered to be a more relaiable estimate of model performance due to the use of more advanced validation techniques. New techniques used in this latest project include: More statisitcal methods, UMAP dimensionality reduction, and the XGBoost model.