K-Means-clustering-from-scratch

An unsupervised machine learning algorithm to form a k number of clusters over a user-specified number of iterations.

Description

An implementation of the K-Means clustering algorithm without using Sklearn's functions. The algorithm consists of the following steps:

Step 1: Getting the data from a CSV file into lists
Step 2: Choosing random centroid points from the lists
Step 3: Assign the rest of the data points to their nearest centroid
Step 4: Visualise what’s happened in a scatter plot
Step 5: Find new centroids for each cluster
Step 6: Repeat steps 3 – 5.

New centroids (Step 5) are calculated by first finding the mean of each cluster. Then the Euclidean distance of each data point from each of the new centroids is calculated. The data point is then assigned to the centroid that's closest to it. For an even more detailed explanation, check out my blog post.

Convergence isn't monitored in this implementation. However, running a few iterations yields interesting insights about the data.

About the data sets

Each of the three .csv files is made up of data points on the life expectancy and birth rate for each country. There is a data set from 1953, one from 2008 and one consisting of both data sets. The data set is from the excellent Gap Minder.

Insights from running the algorithm

As life expectancy improves worldwide, there are only a few countries in recent times that are still near 1953 levels. These counties have not seen huge improvements in life expectancy due to long periods of deep political and civil unrest. These countries include Afghanistan, Ethiopia, Somalia, Burkina Faso and Zimbabwe.

Getting Started

Dependencies

The three .csv files in this repository are needed. They are:

data1953.csv
data2008.csv
dataBoth.csv

Import the following libraries:

numpy
csv
matplotlib

Installing

Runs on any Python IDE with the .csv files in the same folder as kmeans.py.

Executing program

Follow the prompts and enter the following:

Enter the number corresponding to the data set to use (1, 2 or 3)
Enter the value of k - a number between 2 and 5 (inclusive)
Enter the number of iterations to perform. Ideally between 4 and 8.

Each iteration generated a scatterplot of the current centroids and cluster assignments. When the matplotlib window is closed, the following information about each cluster is displayed to the console:

A list of countries
The number of countries
The mean birth rate
The mean life expectancy

Images

Author

Nadia Schmidtke get in touch

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
1953 and 2008.png		1953 and 2008.png
Iteration 4 Demonstration.png		Iteration 4 Demonstration.png
LICENSE		LICENSE
README.md		README.md
data1953.csv		data1953.csv
data2008.csv		data2008.csv
dataBoth.csv		dataBoth.csv
demonstration screenshot.png		demonstration screenshot.png
kmeans.py		kmeans.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

K-Means-clustering-from-scratch

Description

About the data sets

Insights from running the algorithm

Getting Started

Dependencies

Installing

Executing program

Images

Author

License

Acknowledgments

About

Releases

Packages

Languages

License

Nadia-JSch/K-Means-clustering-from-scratch

Folders and files

Latest commit

History

Repository files navigation

K-Means-clustering-from-scratch

Description

About the data sets

Insights from running the algorithm

Getting Started

Dependencies

Installing

Executing program

Images

Author

License

Acknowledgments

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages