Extract and clean articles from science and social sciences enclopedias #238

Natkeeran · 2024-12-04T15:59:46Z

The Tamil science and social science encyclopedias have been digitized here:

https://archive.org/details/utsc_tamil?tab=collection&query=encylopedia

The OCR from Internet Archive is quite good. Extract the title, article text, author(s) where available, metadata (volume and page number), and related images.

Prepare a large csv with all articles in text. Or, plain text file with pointers in the csv.

This data can then be further proofread and cleaned either manually, or with assistance from tools such as grammar checker and LLM. For examples, one column could be mistakes identified by an LLMs or a grammar checker. This will be useful for the (human) editor/proof reader.

This will be a very useful dataset for reference as well as machine learning.

Related Tickets:
#198

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract and clean articles from science and social sciences enclopedias #238

Extract and clean articles from science and social sciences enclopedias #238

Natkeeran commented Dec 4, 2024 •

edited

Loading

Extract and clean articles from science and social sciences enclopedias #238

Extract and clean articles from science and social sciences enclopedias #238

Comments

Natkeeran commented Dec 4, 2024 • edited Loading

Natkeeran commented Dec 4, 2024 •

edited

Loading