A web crawler with different strategies
At minimum, supply a url
to the constructor, then use crawl()
:
c = Crawler('https://xkcd.com')
c.crawl()
Specify the maximum recursion depth, and only take absolute links:
c = Crawler('https://yahoo.com', relative_urls=False, max_depth=3)
c.crawl()
Ignore links starting with or ending with keywords, and indent the output based on recursion depth:
c = Crawler('https://zedchance.github.io/notes',
ignore_prefixes=('#',),
ignore_suffixes=('pdf', 'java', 'docx', 'pptx'),
indent_depth=True)
c.crawl()
Note that ignore_prefixes
and ignore_suffixes
take tuples of strings as the parameter.
Or build a dictionary:
config = {
"explore": True,
"relative_urls": True,
"max_depth": 3,
"ignore_prefixes": ('http', 'mailto', '#'),
"ignore_suffixes": ('docx', ),
"indent_depth": True,
}
c = Crawler('http://reelradio.com/index.html', **config)
c.crawl()
The CrawlerStrategy
interface demands an execute
functions that processes a BeautifulSoup4
object.
class MyStrategy(CrawlerStrategy):
def execute(self, soup):
# do stuff with soup
CrawlerStrategy
is the default strategy if no parameter is passed, this strategy does nothingPrinterStrategy
prints out all text from the pageTagStrategy
generates a list of words and occurences
When testing strategies it is convenient to use explore=False
to only visit the first page.
To print out the page's text:
c = Crawler('https://xkcd.com', strategy=PrinterStrategy(), explore=False)
c.crawl()
To see a list of tags:
c = Crawler('https://xkcd.com', strategy=TagStrategy(), explore=False)
c.crawl()