Skip to content
This repository has been archived by the owner on Jan 14, 2025. It is now read-only.

Commit

Permalink
Added README file for easy utilization and running of the code.
Browse files Browse the repository at this point in the history
  • Loading branch information
Fra3zz committed May 2, 2024
1 parent 0550c85 commit 74d7084
Showing 1 changed file with 99 additions and 0 deletions.
99 changes: 99 additions & 0 deletions README
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
</head>
<body>
<div style="align-self: center;">
<img src="./img/Graph_of_Dr.Chuck's_Website_Link_Refrence.png">
<br>
<img src="./img/FRWSCP223.png" alt="">
<br>
</div>
<br>
<H1>What is this?</H1>
<br>
<p>This is my cornerstone project for Univeristy of Michigan. This can be found in the courses honors track.</p>
<p>Professor: Dr. Charles 'Chuck' Severance</p>
<p>Class: Python for Everybody</p>
<p>Refrence:</p>
<div>
Coursera: <a href="https://www.coursera.org/specializations/python">Python UM</a>
<br>
Dr. Chuck's Website: <a href="https://www.dr-chuck.com">www.dr-chuck.com</a>
<br>
Free Python Materials: <a href="https://www.py4e.com/">Python for Everybody</a>
<br>
<br>
Websites used for research:
<br>
<a href="http://python-data.dr-chuck.net/">Dr. Chucks Projects Website</a>
<br>
<a href="https://www.google.com">Google.com</a>
</div>
<div>
<p>Description:
This project was made utilizing Dr. Chucks files provided in his course. Spider.py was handmade.
</p>
<br>
<p>
Utilization:
<ol>
<br>
<li>Install all dependencies within the provided filelock.</li>
<li>Run spider.py.</li>
<p style="font-weight: bold;">Spider.py</p>
<ul>
<li>
Requests vis command line:
<br>
- URL to be spidered
<br>
- Enable exception list
<br>
- Exceptions list text file (Example of exception: https://www.google.com/search... skips all google urls with google.com/search)
<br>
- Enable saving of setting sfor easy setup
<br>
- When restarted it will ask if you want to use a new url of provided an updated exceptions list text file
</li>
<li>
Crawls the designated URL adding newly found urls to a spider.sqlite DB (auto creates the DB)
</li>
<li>
Crawls the next url in the sqlite DB
</li>
<li>
Records html (if found), error code(if provided), and the number of attempts on the site(if unable to access with a max of 3 attempts)
</li>
</ul>
<li>Run sprank.py</li>
<p style="font-weight: bold;">sprank.py</p>
<ul>
<li>
Requests via command line:
<br>
- Amount of iterations to calculate the ranking of the URLs collected so far(must be visited by the crawler, not just collected)
</li>
<li>Cylces through visited sites and ranks them based upon all other visited sites in the spider.sqlite DB</li>
<li>Addds the ranking to the "rank' colomn</li>
</ul>
<li>Run spjson.py</li>
<p style="font-weight: bold;">spjson.py</p>
<ul>
<li>Pulls the ranking and url from the DB</li>
<li>Creates spider/js for the force.html to utilize for the nodes</li>
</ul>
<li>Open force.html in browser/web engine</li>
</ol>
</p>
</div>
<div>
<p style="font-weight: bold;">Note: If you are doing the same class/project, please make your own graph and crawl the web. The pictures provided above are for showing what the code dose and not for
use for grades, research, etc.
</p>
</div>
</body>
</html>

0 comments on commit 74d7084

Please sign in to comment.