Skip to content
This repository has been archived by the owner on Jan 14, 2025. It is now read-only.

Fra3zz/Web_Crawler_and_Visualization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

f7b7e83 · Nov 8, 2024

History

13 Commits
May 2, 2024
May 2, 2024
May 2, 2024
May 2, 2024
Nov 8, 2024
May 2, 2024
May 2, 2024
May 2, 2024
May 2, 2024

Repository files navigation

What is this?


This is my cornerstone project for Univeristy of Michigan. This can be found in the courses honors track.

Professor: Dr. Charles 'Chuck' Severance

Class: Python for Everybody

Refrence:

Coursera: Python UM
Dr. Chuck's Website: www.dr-chuck.com
Free Python Materials: Python for Everybody

Websites used for research:
Dr. Chucks Projects Website
Google.com

Description: This project was made utilizing Dr. Chucks files provided in his course. Spider.py was handmade.


Utilization:

  1. Install all dependencies within the provided filelock.
  2. Run spider.py.
  3. Spider.py

    • Requests via command line:
      - URL to be spidered
      - Enable exception list
      - Exceptions list text file (Example of exception: https://www.google.com/search... skips all google urls with google.com/search)
      - Enable saving of settings for easy setup
      - When restarted it will ask if you want to use a new url of provided an updated exceptions list text file
    • Crawls the designated URL adding newly found urls to a spider.sqlite DB (auto creates the DB)
    • Crawls the next url in the sqlite DB
    • Records html (if found), error code(if provided), and the number of attempts on the site(if unable to access with a max of 3 attempts)
  4. Run sprank.py
  5. sprank.py

    • Requests via command line:
      - Amount of iterations to calculate the ranking of the URLs collected so far(must be visited by the crawler, not just collected)
    • Cylces through visited sites and ranks them based upon all other visited sites in the spider.sqlite DB
    • Addds the ranking to the "rank' colomn
  6. Run spjson.py
  7. spjson.py

    • Pulls the ranking and url from the DB
    • Creates spider/js for the force.html to utilize for the nodes
  8. Open force.html in browser/web engine



Note: If you are doing the same class/project, please make your own graph and crawl the web. The pictures provided above are for showing what the code dose and not for use for grades, research, etc.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published