crawler

A web crawler with different strategies

How to use

At minimum, supply a url to the constructor, then use crawl():

c = Crawler('https://xkcd.com')
c.crawl()

Specify the maximum recursion depth, and only take absolute links:

c = Crawler('https://yahoo.com', relative_urls=False, max_depth=3)
c.crawl()

Ignore links starting with or ending with keywords, and indent the output based on recursion depth:

c = Crawler('https://zedchance.github.io/notes',
            ignore_prefixes=('#',),
            ignore_suffixes=('pdf', 'java', 'docx', 'pptx'),
            indent_depth=True)
c.crawl()

Note that ignore_prefixes and ignore_suffixes take tuples of strings as the parameter.

Or build a dictionary:

config = {
    "explore": True,
    "relative_urls": True,
    "max_depth": 3,
    "ignore_prefixes": ('http', 'mailto', '#'),
    "ignore_suffixes": ('docx', ),
    "indent_depth": True,
}
c = Crawler('http://reelradio.com/index.html', **config)
c.crawl()

Strategies

The CrawlerStrategy interface demands an execute functions that processes a BeautifulSoup4 object.

class MyStrategy(CrawlerStrategy):
    def execute(self, soup):
        # do stuff with soup

CrawlerStrategy is the default strategy if no parameter is passed, this strategy does nothing
PrinterStrategy prints out all text from the page
TagStrategy generates a list of words and occurences

When testing strategies it is convenient to use explore=False to only visit the first page.

To print out the page's text:

c = Crawler('https://xkcd.com', strategy=PrinterStrategy(), explore=False)
c.crawl()

To see a list of tags:

c = Crawler('https://xkcd.com', strategy=TagStrategy(), explore=False)
c.crawl()

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

How to use

Strategies

About

Languages

reelradio/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

How to use

Strategies

About

Resources

Stars

Watchers

Forks

Languages