A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages. The objective of this project is to train machine learning models and deep neural nets on the dataset created to predict phishing websites. Both phishing and benign URLs of websites are gathered to form a dataset and from them required URL and website content-based features are extracted. The performance level of each model is measures and compared.
To install the required packages and libraries, run this command in the project directory after Forking and cloning this repository:
pip install -r requirements.txt
The system starts by retrieving URLs to be checked for phishing. These URLs can be collected from user input in the webpage created. Once the URLs are obtained, the system extracts relevant features from the web pages. These features are essential for training and evaluating the machine learning models. Various features were extracted from the URL database based on Domain, HTML and Address bar of the URLs.
Various machine learning models are compared and The machine learning model with high accuracy is selected which predicts whether the URL is a phishing site or not. It provides a probability score or a binary classification (phishing or not phishing) based on the trained model's decision boundary. The system categorize URLs into "phishing" or "legitimate" and the result is finally displayed on the webpage.
Accuracy of various model used for URL detection
ML Model | Accuracy | f1_score | Recall | Precision | |
---|---|---|---|---|---|
0 | Gradient Boosting Classifier | 0.974 | 0.977 | 0.994 | 0.986 |
1 | CatBoost Classifier | 0.972 | 0.975 | 0.994 | 0.989 |
2 | Multi-layer Perceptron | 0.969 | 0.973 | 0.995 | 0.981 |
3 | Random Forest | 0.967 | 0.971 | 0.993 | 0.990 |
4 | Support Vector Machine | 0.964 | 0.968 | 0.980 | 0.965 |
5 | Decision Tree | 0.960 | 0.964 | 0.991 | 0.993 |
6 | K-Nearest Neighbors | 0.956 | 0.961 | 0.991 | 0.989 |
7 | Logistic Regression | 0.934 | 0.941 | 0.943 | 0.927 |
8 | Naive Bayes Classifier | 0.605 | 0.454 | 0.292 | 0.997 |
- The final take away form this project is to explore various machine learning models, perform Exploratory Data Analysis on phishing dataset and understanding their features.
- Creating this notebook helped me to learn a lot about the features affecting the models to detect whether URL is safe or not, also I came to know how to tuned model and how they affect the model performance.
- The final conclusion on the Phishing dataset is that the some feature like "HTTTPS", "AnchorURL", "WebsiteTraffic" have more importance to classify URL is phishing URL or not.
- Gradient Boosting Classifier currectly classify URL upto 97.4% respective classes and hence reduces the chance of malicious attachments.