You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The OCR from Internet Archive is quite good. Extract the title, article text, author(s) where available, metadata (volume and page number), and related images.
Prepare a large csv with all articles in text. Or, plain text file with pointers in the csv.
This data can then be further proofread and cleaned either manually, or with assistance from tools such as grammar checker and LLM. For examples, one column could be mistakes identified by an LLMs or a grammar checker. This will be useful for the (human) editor/proof reader.
This will be a very useful dataset for reference as well as machine learning.
The Tamil science and social science encyclopedias have been digitized here:
https://archive.org/details/utsc_tamil?tab=collection&query=encylopedia
The OCR from Internet Archive is quite good. Extract the title, article text, author(s) where available, metadata (volume and page number), and related images.
Prepare a large csv with all articles in text. Or, plain text file with pointers in the csv.
This data can then be further proofread and cleaned either manually, or with assistance from tools such as grammar checker and LLM. For examples, one column could be mistakes identified by an LLMs or a grammar checker. This will be useful for the (human) editor/proof reader.
This will be a very useful dataset for reference as well as machine learning.
Related Tickets:
#198
The text was updated successfully, but these errors were encountered: