scrap shodhganga.inflibnet.ac.in research thesis and publish to archive.org #233

tshrinivasan · 2024-09-16T04:51:53Z

https://shodhganga.inflibnet.ac.in/ is the website by union govt of india, to publish all PhD research papers in all indian languages.

Currently it has 5.5 Lakh + thesis works, all in Creative Commons License - CC BY-NC-SA

Write a program to scrap them all pdf files, with metadata and push to archive.org

Note - each research thesis is as multipart pdf file there. merge all the parts for each work.

The site is made with Dspace - An Open source software for digital repositories - https://dspace.lyrasis.org/

Check for any existing web scrappers for dspace.

tshrinivasan · 2024-09-16T04:58:19Z

ksspam · 2024-11-01T13:52:47Z

I'm interested in this!
Is this meta data we need correct ?

tshrinivasan added the Programming நிரலாக்கம் label Sep 16, 2024

Provide feedback