Skip to content

ubvu/abstract_scraper

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Abstract Scraper

abstract-scraper is a Python-based CLI tool for fetching abstracts of scientific articles from DOIs. It uses the Pyalex library for accessing metadata from the OpenAlex API, allowing efficient and parallelized processing of large datasets.

Features

  • Fetches abstracts for scientific articles using their DOIs.
  • Supports parallel processing with configurable worker count.
  • Periodically saves progress to prevent data loss.
  • Simple and intuitive command-line interface (CLI).

Installation

Prerequisites

  • Python 3.12 or later
  • Poetry for dependency management

Steps

  1. Clone the repository:

    git clone https://github.com/HuberNicolas/abstract_scraper
    cd abstract-scraper
  2. Install dependencies using Poetry:

    poetry install

Usage

Run the script with the following command:

python -m main <input_file> <output_file> [--num_workers <int>] [--save_interval <int>]

Required Arguments

  • <input_file>: Path to the CSV file containing a column doi with DOIs of the articles.
  • <output_file>: Path where the updated CSV with fetched abstracts will be saved.

Optional Arguments

  • --num_workers: Number of parallel workers to use (default: 4).
  • --save_interval: Save progress after processing this many rows (default: 50).

Example

poetry shell # activate poetry env
python -m main data/sdg_data.csv data/sdg_data_abstracts.csv --num_workers 2 --save_interval 10

Input File Format

The input CSV file must include a doi column with the DOIs of the articles. Example:

doi
10.1234/example-doi-1
10.5678/example-doi-2

Output File

The script saves the output CSV with the following additional column:

  • abstract: Contains the fetched abstract for each article.

Example output:

doi abstract
10.1234/example-doi-1 This is an example abstract.
10.5678/example-doi-2 Another example abstract.

License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%