Story

The Opportunity

Web scraping for content acquisition and business process automation is a key enabler for many thriving businesses. For example, this article expaoins how the average investment firm is spending about $900,000 yearly on alternative data. But writing and maintaining spiders is hard and while there are lots of screen scraping utilities available, these utilities tend to solve only a piece of the overall problem. There is a distinct lack of an operationally complete and enterprise ready web scraping solution which:

makes it easy to author spiders that are robust enough to survive most types of web site change
provides an infrastructure that makes it easy to discover and run spiders through a well defined (RESTful) API
detects spider failure and gathers enough evidence on failure details that debugging spider failure is easy

Further complicating this overall challenge is that while web scraping might be a key enabler for a business, web scraping isn't typically something most companies want to become experts in. Far too many companies are working far too hard on solving the web scraping problem instead of offloading/outsourcing this responsibility to a WScraping as a Service (WSaaS) offering.

Spider Writing is Hard?

Why is it so hard to write a spider? When someone uses a web browser to surf the web, the web browser generates network traffic to interact with a web server. Traditional spiders are written to mimic the network traffic generated by a web browser as a result of someone interacting with the web browser. However, it’s hard to write these kinds of spiders and it’s getting much harder as web sites increasingly leverage AJAX-like patterns. In addition, this approach to spider writing creates spiders that are very brittle - even minor web site changes can cause spiders to break in ways that are hard to debug.

Trends

There are some very important trends which can be leveraged to realize our spidering dreams.

Automated Testing Trends

The wide spread adoption of automated testing has fueled some important trends/milestones:

web sites are being built to be tested using automated mechanisms
automated testing tools have become very robust
lots of "QA automation" companies are very familiar with Selenium WebDriver

IaaS Trends

the number of IaaS providers continues to increase
IaaS costs continue to drop
an increasing number of very capable CI services are available
efficiently running and isolating a variety of workload types on an IaaS offering at scale has become easier with Docker and Kubernetes
something about organizational adoption of aaS offerings

The Cloudfeaster Approach

Write spiders using a high level scripting language (Python) using tools/APIs designed for automated testing (Selenium). Package collections of spiders in a Docker image. This means:

spiders are easy to write
spiders are reliable even in the face of most web site changes
it's possible to outsource spider development and maintenance

Use a RESTful API for discovering and running spiders. The service implementing the RESTful API is hosted on an IaaS provider. To run a spider, the service locates the Docker image containing the spider, makes sure the latest version of the Docker image is available and uses the latest Docker image to create a Docker container in which the spider is run. Inside the Docker container a headless browser is started and the spider runs against that headless browser. Use Kubernetes for all orchestration and operation of Docker containers to both run the service behind the RESTful API and run the spiders .

Legal

Questions often surface about the legality of web scraping. The following offers a perspective on this topic.

18 Apr '22 - Web scraping is legal, US appeals court reaffirms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

story.md

story.md

Story

The Opportunity

Spider Writing is Hard?

Trends

Automated Testing Trends

IaaS Trends

The Cloudfeaster Approach

Legal

Files

story.md

Latest commit

History

story.md

File metadata and controls

Story

The Opportunity

Spider Writing is Hard?

Trends

Automated Testing Trends

IaaS Trends

The Cloudfeaster Approach

Legal