Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract and clean articles from science and social sciences enclopedias #238

Open
Natkeeran opened this issue Dec 4, 2024 · 0 comments
Open

Comments

@Natkeeran
Copy link
Collaborator

Natkeeran commented Dec 4, 2024

The Tamil science and social science encyclopedias have been digitized here:

https://archive.org/details/utsc_tamil?tab=collection&query=encylopedia

The OCR from Internet Archive is quite good. Extract the title, article text, author(s) where available, metadata (volume and page number), and related images.

Prepare a large csv with all articles in text. Or, plain text file with pointers in the csv.

This data can then be further proofread and cleaned either manually, or with assistance from tools such as grammar checker and LLM. For examples, one column could be mistakes identified by an LLMs or a grammar checker. This will be useful for the (human) editor/proof reader.

This will be a very useful dataset for reference as well as machine learning.

Related Tickets:
#198

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant