Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scrap shodhganga.inflibnet.ac.in research thesis and publish to archive.org #233

Open
tshrinivasan opened this issue Sep 16, 2024 · 2 comments
Labels
Programming நிரலாக்கம்

Comments

@tshrinivasan
Copy link
Member

tshrinivasan commented Sep 16, 2024

https://shodhganga.inflibnet.ac.in/ is the website by union govt of india, to publish all PhD research papers in all indian languages.

Currently it has 5.5 Lakh + thesis works, all in Creative Commons License - CC BY-NC-SA

Write a program to scrap them all pdf files, with metadata and push to archive.org

Note - each research thesis is as multipart pdf file there. merge all the parts for each work.

The site is made with Dspace - An Open source software for digital repositories - https://dspace.lyrasis.org/

Check for any existing web scrappers for dspace.

@tshrinivasan tshrinivasan added the Programming நிரலாக்கம் label Sep 16, 2024
@tshrinivasan
Copy link
Member Author

explore this - https://github.com/thenicekat/Scrapers_BPHC

@ksspam
Copy link

ksspam commented Nov 1, 2024

I'm interested in this!
Is this meta data we need correct ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Programming நிரலாக்கம்
Projects
None yet
Development

No branches or pull requests

2 participants