DergiPark is one of the biggest websites that provides electronic hosting for academic peer-reviewed articles in Turkey. In this project, I extracted all articles from DergiPark and parsed the data in 8 main headings. Afterwards, I outputted that data into different formats of files like .jsonl (JSON lines) and .txt (Text). Complete Data-Set is available in the DergiPark-Data-Set repository. The number of formats can be increased by customizing the source code.
DergiPark currently has over 25.000 academic articles. I extracted them all through Web Scraping with Python. Web Scraping is basically extracting a big amount of data from a specific website by reaching its source codes and parsing the tags.
The data that I extracted can be used in Ai models to give meaning to this data or train any model with them. Because the data is academic peer-reviewed articles this data can be used in any formal project.
I used Python as a main programming language.
For Web scraping, I used 'BeautifulSoup' and 'request' modules. Except for these I used 'json', 'os' and 'time' for outputting the data and waiting sections.
Download the project as an executable file from Releases and run the DergiPark.exe
file.
Clone the project
git clone https://github.com/Alperencode/DergiPark-Project
Go to the project directory
cd DergiPark-Project
Install the required modules
pip install -r requirements.txt
Run the python file
python main.py
Run main.py in root directory
python main.py
Example of proper working
Screenshot from JSON line data
Screenshot from txt data
-
BeautifulSoup: My other Web Scraping projects
-
Python: My main Python repository
-
Algorithm-Solutions: My algorithm problem solutions with python (Leetcode, HackerRank, CodeWars)