Web scraping for content acquisition and business process automation is a key enabler for many thriving businesses. For example, this article expaoins how the average investment firm is spending about $900,000 yearly on alternative data. But writing and maintaining spiders is hard and while there are lots of screen scraping utilities available, these utilities tend to solve only a piece of the overall problem. There is a distinct lack of an operationally complete and enterprise ready web scraping solution which:
- makes it easy to author spiders that are robust enough to survive most types of web site change
- provides an infrastructure that makes it easy to discover and run spiders through a well defined (RESTful) API
- detects spider failure and gathers enough evidence on failure details that debugging spider failure is easy
Further complicating this overall challenge is that while web scraping might be a key enabler for a business, web scraping isn't typically something most companies want to become experts in. Far too many companies are working far too hard on solving the web scraping problem instead of offloading/outsourcing this responsibility to a WScraping as a Service (WSaaS) offering.
Why is it so hard to write a spider? When someone uses a web browser to surf the web, the web browser generates network traffic to interact with a web server. Traditional spiders are written to mimic the network traffic generated by a web browser as a result of someone interacting with the web browser. However, it’s hard to write these kinds of spiders and it’s getting much harder as web sites increasingly leverage AJAX-like patterns. In addition, this approach to spider writing creates spiders that are very brittle - even minor web site changes can cause spiders to break in ways that are hard to debug.
There are some very important trends which can be leveraged to realize our spidering dreams.
The wide spread adoption of automated testing has fueled some important trends/milestones:
- web sites are being built to be tested using automated mechanisms
- automated testing tools have become very robust
- lots of "QA automation" companies are very familiar with Selenium WebDriver
-
the number of IaaS providers continues to increase
-
IaaS costs continue to drop
-
an increasing number of very capable CI services are available
-
efficiently running and isolating a variety of workload types on an IaaS offering at scale has become easier with Docker and Kubernetes
-
something about organizational adoption of aaS offerings
Write spiders using a high level scripting language (Python) using tools/APIs designed for automated testing (Selenium). Package collections of spiders in a Docker image. This means:
- spiders are easy to write
- spiders are reliable even in the face of most web site changes
- it's possible to outsource spider development and maintenance
Use a RESTful API for discovering and running spiders. The service implementing the RESTful API is hosted on an IaaS provider. To run a spider, the service locates the Docker image containing the spider, makes sure the latest version of the Docker image is available and uses the latest Docker image to create a Docker container in which the spider is run. Inside the Docker container a headless browser is started and the spider runs against that headless browser. Use Kubernetes for all orchestration and operation of Docker containers to both run the service behind the RESTful API and run the spiders .
Questions often surface about the legality of web scraping. The following offers a perspective on this topic.