An unsupervised machine learning algorithm to form a k number of clusters over a user-specified number of iterations.
An implementation of the K-Means clustering algorithm without using Sklearn's functions. The algorithm consists of the following steps:
- Step 1: Getting the data from a CSV file into lists
- Step 2: Choosing random centroid points from the lists
- Step 3: Assign the rest of the data points to their nearest centroid
- Step 4: Visualise what’s happened in a scatter plot
- Step 5: Find new centroids for each cluster
- Step 6: Repeat steps 3 – 5.
New centroids (Step 5) are calculated by first finding the mean of each cluster. Then the Euclidean distance of each data point from each of the new centroids is calculated. The data point is then assigned to the centroid that's closest to it. For an even more detailed explanation, check out my blog post.
Convergence isn't monitored in this implementation. However, running a few iterations yields interesting insights about the data.
Each of the three .csv files is made up of data points on the life expectancy and birth rate for each country. There is a data set from 1953, one from 2008 and one consisting of both data sets. The data set is from the excellent Gap Minder.
As life expectancy improves worldwide, there are only a few countries in recent times that are still near 1953 levels. These counties have not seen huge improvements in life expectancy due to long periods of deep political and civil unrest. These countries include Afghanistan, Ethiopia, Somalia, Burkina Faso and Zimbabwe.
The three .csv files in this repository are needed. They are:
- data1953.csv
- data2008.csv
- dataBoth.csv
Import the following libraries:
- numpy
- csv
- matplotlib
Runs on any Python IDE with the .csv files in the same folder as kmeans.py.
Follow the prompts and enter the following:
- Enter the number corresponding to the data set to use (1, 2 or 3)
- Enter the value of k - a number between 2 and 5 (inclusive)
- Enter the number of iterations to perform. Ideally between 4 and 8.
Each iteration generated a scatterplot of the current centroids and cluster assignments. When the matplotlib window is closed, the following information about each cluster is displayed to the console:
- A list of countries
- The number of countries
- The mean birth rate
- The mean life expectancy
Nadia Schmidtke get in touch
This project is licensed under the GNU GENERAL PUBLIC LICENSE.