This program will scrape books from a list of Amazon links (spreadsheet). It will generate a csv file with book data such as:
- ASIN
- Name
- Author
- Stars
- ISBN10 and ISBN13 number
- Item category, subcategory and specific category
- Item availability
- Original price and sale price
Input links:
Usage:
Repeated links: The program can detect if there are repeated books based on its ASIN number, these repeated books won't be scraped.
Output data:
- Clone the repo
git clone [email protected]:guidosantillan01/amazon-books-web-scraper.git
- Run
modules/main.py
file:
PYTHON_PATH/python.exe "FOLDER_PATH/amazon-books-web-scraper/modules/main.py"
Example:
C:/ProgramData/Anaconda3/python.exe f:/Downloads/amazon-books-web-scraper/modules/main.py
- You will need Python. The Anaconda Distribution is recommended.
- Install the VSCode Python extension.
- Install these python libraries:
pip3 install pandas certifi urllib3
- pandas 0.23.4 or greater is required, if you have an older version of pandas upgrade it with:
pip3 install --upgrade pandas
- Check out variables.py file to modify the desired behavior of the program such as:
- Number of desired books to scrape.
- Change the name of the input and output files and its directory
- REGEX filtering queries
- Useful web scraping resources
- XPATHS for web scraping Amazon
- Python 3.7
- Pandas 0.23.4
- certifi 2018.8.24
- urllib3 1.23
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
- Guido Santillan Arias - [email protected] - www.guidosantillan.com
This project is licensed under the MIT license - see the LICENSE file for details.