Abstract Scraper

abstract-scraper is a Python-based CLI tool for fetching abstracts of scientific articles from DOIs. It uses the Pyalex library for accessing metadata from the OpenAlex API, allowing efficient and parallelized processing of large datasets.

Features

Fetches abstracts for scientific articles using their DOIs.
Supports parallel processing with configurable worker count.
Periodically saves progress to prevent data loss.
Simple and intuitive command-line interface (CLI).

Installation

Prerequisites

Python 3.12 or later
Poetry for dependency management

Steps

Clone the repository:

git clone https://github.com/HuberNicolas/abstract_scraper
cd abstract-scraper

Install dependencies using Poetry:
```
poetry install
```

Usage

Run the script with the following command:

python -m main <input_file> <output_file> [--num_workers <int>] [--save_interval <int>]

Required Arguments

<input_file>: Path to the CSV file containing a column doi with DOIs of the articles.
<output_file>: Path where the updated CSV with fetched abstracts will be saved.

Optional Arguments

--num_workers: Number of parallel workers to use (default: 4).
--save_interval: Save progress after processing this many rows (default: 50).

Example

poetry shell # activate poetry env
python -m main data/sdg_data.csv data/sdg_data_abstracts.csv --num_workers 2 --save_interval 10

Input File Format

The input CSV file must include a doi column with the DOIs of the articles. Example:

doi
10.1234/example-doi-1
10.5678/example-doi-2

Output File

The script saves the output CSV with the following additional column:

abstract: Contains the fetched abstract for each article.

Example output:

doi	abstract
10.1234/example-doi-1	This is an example abstract.
10.5678/example-doi-2	Another example abstract.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
main.py		main.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Abstract Scraper

Features

Installation

Prerequisites

Steps

Usage

Required Arguments

Optional Arguments

Example

Input File Format

Output File

License

About

Releases

Packages

Languages

ubvu/abstract_scraper

Folders and files

Latest commit

History

Repository files navigation

Abstract Scraper

Features

Installation

Prerequisites

Steps

Usage

Required Arguments

Optional Arguments

Example

Input File Format

Output File

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages