Skip to content

gliserma/Find_Broken_Hyperlinks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

066c91f · Oct 31, 2021

History

3 Commits
Oct 31, 2021
Oct 31, 2021
Oct 31, 2021

Repository files navigation

Broken Links Webcrawler

Developed to identify broken links on website for the Gilder Lehrman Insitute of American History. This could easily be adapted to search other websites by modifying the domain and start url parameters.

Implementation.

  • A simple python script that can be launched from the terminal.

Optional Command Line Arguments:

  • --fname: desired name for the output file
  • --number: how many pages should be searched

Output: Two Files

  • fname: All pages visited and all the links contained in those pages as a csv
  • broken_fname: All broken links, i.e. origin page, destination page, anchor text

Requirements

  • Python3.6+
  • Scrapy 2.5.0

Future Steps

  • find broken images
  • find pages with code fragments showing as text

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages