Skip to content

Alperencode/DergiPark-Project

Repository files navigation

DergiPark Project

Description

DergiPark is one of the biggest websites that provides electronic hosting for academic peer-reviewed articles in Turkey. In this project, I extracted all articles from DergiPark and parsed the data in 8 main headings. Afterwards, I outputted that data into different formats of files like .jsonl (JSON lines) and .txt (Text). Complete Data-Set is available in the DergiPark-Data-Set repository. The number of formats can be increased by customizing the source code.

  DergiPark currently has over 25.000 academic articles. I extracted them all through Web Scraping with Python. Web Scraping is basically extracting a big amount of data from a specific website by reaching its source codes and parsing the tags.

  The data that I extracted can be used in Ai models to give meaning to this data or train any model with them. Because the data is academic peer-reviewed articles this data can be used in any formal project.


Used Techs

I used Python as a main programming language.

  For Web scraping, I used 'BeautifulSoup' and 'request' modules. Except for these I used 'json', 'os' and 'time' for outputting the data and waiting sections.


Installation

1) Download


Download the project as an executable file from Releases and run the DergiPark.exe file.


2) Clone


Clone the project

git clone https://github.com/Alperencode/DergiPark-Project

Go to the project directory

cd DergiPark-Project

Install the required modules

pip install -r requirements.txt

Run the python file

python main.py

Usage/Examples

Run main.py in root directory

python main.py

Example of proper working


Screenshots

Screenshot from JSON line data


Screenshot from txt data


Related


Authors